In this study, we used deep learning models to identify a reduction in LFB staining intensity in the top attention tiles from brain sections of donors with antemortem evidence of cognitive impairment. Because LFB staining in brain tissue is generally used to quantify the amount of myelin [24, 25], the signal that we identified is likely due to decreased myelin staining intensity. Our results are not able to distinguish decreased myelin density with spared axons as opposed to axon injury and associated myelin loss. In many cases, diminished myelin density in aging is associated with cerebrovascular disease . This is consistent with the strong correlations we identified in this study between cerebrovascular pathology, the predicted probability of cognitive impairment, and decreased LFB staining in the top attention tiles. Even when accounting for age, there was still an association between decreased LFB staining in the top attention tiles and cognitive impairment. Treating cerebrovascular disease risk factors such as hypertension has been found to decrease white matter pathology and partially reverse age-related cognitive impairment . However, age-associated decreases in myelin density have numerous possible causes other than cerebrovascular disease, such as nearby AD cortical pathology [35, 36], a primary effect of aging [37,38,39], repetitive head impacts , or the accumulated effects of excessive alcohol use . It is unclear the extent to which the decreased myelin density we found to be associated with age-related cognitive impairment are explained solely by cerebrovascular pathology as opposed these other possible etiologies, which warrants further investigation.
While postmortem brain gene expression and pathoanatomical studies in aging and AD have often focused on grey matter, neuroimaging findings over the past several decades have frequently found alterations in the white matter to be strongly associated with cognitive impairment . Leukoaraiosis (leuko–white, araiosis–rarefaction) is a common neuroimaging abnormality of the white matter that can be found in periventricular or subcortical areas [43, 44]. On T2-weighted and FLAIR MRI, leukoaraiosis is frequently described as white matter hyperintensities . While leukoaraiosis is strongly associated with cerebrovascular disease, the precise etiology remains unclear [44, 45]. Clinically, leukoaraiosis is associated with cognitive deficits such as bradyphrenia . Histologically, leukoaraiosis has been suggested to be associated with decreased density of myelin sheaths . Our deep learning models identified a neurohistologic signal for cognitive impairment that was (a) focused in the white matter, (b) in some cases scattered in a non-uniform pattern across the tissue, (c) and associated with decreased myelin staining intensity. Although our data set lacks associated in vivo neuroimaging data to draw conclusive statements, one clear possibility is that the white matter histologic alterations the deep learning models identified may reflect similar etiopathology as the neuroimaging finding of leukoaraiosis. We propose that diminished LFB staining intensity in particular areas identified by a deep learning model may be a quantitative way to assess for the presence of leukoaraiosis-associated neuropathology in postmortem brains.
It is important to consider the limitations of this study. First, compared to previously published weakly supervised learning publications in oncology (which are often n > 1000), the data set employed here (n = 716) is not as large [13, 14]. Because there is an absence of significant Aβ burden in this cohort, it also limits the representativeness of the cohort to the population at large. This adds to the numerous selection biases in brain donation-based autopsy cohorts in general . Second, the WSI data set analyzed only contains one stain, the LH&E stain. While LFB staining is ideal for detecting myelin, it is possible that it may have highlighted the white matter to a disproportionate degree that affected the deep learning algorithm results. Third, we were unable to assess for comorbid TDP-43 pathology, which would allow us to screen for limbic-predominant age-related TDP-43 encephalopathy (LATE), a common TDP-43 proteinopathy associated with an amnestic dementia in elderly individuals . Additionally, because we only looked at two brain regions, we have limited anatomical sampling, which is problematic because we know that cognitive impairment is determined by accumulated lesion burden across the brain.
Another potential limitation is the possibility of systemic variation in staining properties across WSIs. We did not perform WSI color stain normalization because the slides were all stained uniformly at the same center and with the same platform, minimizing systemic heterogeneity. Furthermore, as a tile-level normalization measure, we calculated the ratio of the blue color intensity in each tile, which yielded the same general result as when we only used the dark blue pixel range. It is still possible that a fixation or staining artifact may have affected the cognitive impairment probability estimates and/or attention signals. For example, deeper areas of the brain often had qualitatively higher attention signals. However, the attention signal appeared to follow anatomical compartments, such as the deep white matter, while sparing the subcortical U-fibers, regardless of the depth of these compartments, and therefore our results are considered less likely to be due to artifactual variation in staining intensity. Taken together, addressing the issue of stain normalization without introducing other biases is a complex topic, especially in WSIs stained with multiple types of stains such as in our data, warranting further research .
One of the concerns with contemporary deep learning models is that the basis of their predictions is challenging to understand. In this study, we provided one measure of interpretability, by leveraging the intrinsic attention mechanism of the model to quantify the degree of myelin pallor in the top attention tiles. However, this interpretability measure does not explain all of the model’s attention scores, nor does it fully explain the model’s predictions of cognitive impairment. For example, it is unclear why some of the top attention tiles from brain donors predicted to have cognitive impairment, especially those in the model trained on the hippocampus, still appear to have intact myelination. This suggests that other structural or cellular features are playing a role. Additionally, the probability of cognitive impairment prediction is more strongly correlated with Braak stage in the model from the hippocampus than the frontal cortex, but the features that underlie this difference remain to be determined. As a result, much of the variance in the models’ predictions remains opaque. Furthermore, our interpretability result was derived manually, by inspecting the results of the trained models, noting a qualitative difference, and then developing a metric to quantify this difference. If deep learning models are eventually going to be used safely and effectively in neuropathology research and clinical practice, then there is a critical need to make them understandable to humans. Improving the transparency of deep learning model predictions in a more automated way, for example by using more explainable architectures or distillation tools, is an essential research direction for the field .
Although there are some additional limitations to our study, we expect that our methodology lays the groundwork for further probing of the histopathology of age-related cognitive impairment in future studies that will be able to address these limitations. While our current slide-level predictive accuracy is modest, as our annotated WSI data sets grow, we expect that our trained models will improve in stability, discriminative power, and ability to pinpoint morphological features associated with cognitive impairment. Related to this, while some of the clinicopathologic correlations were statistically significant, they occasionally had weak correlation strengths. Larger and more richly annotated data sets will help to further parse out the practical implications of these correlations. Because the accuracy of our trained models is limited, this study can be conceptualized as a proof of concept for further studies, and certainly not fully dispositive of the underlying pathophysiology of cognitive impairment or applicable to clinical practice. Additionally, while we only focused on a robust yet general approach to assessing myelin, i.e. LH&E stained tissue sections, future studies deploying additional modalities of assessing myelin injury, such as immunohistochemical staining for oligodendrocyte, axonal, vascular, inflammatory, and myelin markers, will help to further elucidate the pathogenesis of age-related cognitive impairment. Finally, because we only have WSI data available from two brain regions, we are limited in our ability to explain why the results from the two brain regions appeared to differ in some ways.
Our ability to interpret differences in deep learning-derived metrics across brain regions will improve with richer data sets containing WSIs from more brain regions. As compared to cancer pathology, which has heretofore been the main use case of weakly supervised deep learning in digital pathology, in studying the neuropathology of dementia, there is less of an emphasis on diagnosis and more of an emphasis on the inference of pathophysiology. This is in part because cancer can be more frequently associated with one causal type, whereas cognitive deficits in the brain are generally due to overlapping pathologies with complex patterns of comorbidity. The multifactorial nature of cognitive deficits lends itself well to multidimensional interpretation studies. First, it emphasizes the value of quantitative probability estimates of cognitive impairment instead of binary labels, which allow for more precise correlation analysis with other clinicopathologic features. Second, the relative focus on understanding pathophysiology in neuropathology also underscores the value of deterministic computer vision studies, such as positive pixel counting, as a downstream method for interrogating attention or other interpretability signals present in deep learning models. While the prediction capacity of deep learning models in digital pathology can be expected to continue to improve rapidly, our ability to understand what histopathologic features those models are focused on is lagging. Improving our suite of methods for the interpretation of deep learning models will allow us to best harness them and to understand how they may be flawed or biased. Because the study of the neuropathology of dementia remains driven by human ingenuity, more interpretable deep learning methods will be essential to accelerate its adoption across the field.
Our results also suggest several future directions that would illuminate additional aspects of the histopathology of cognitive impairment. One possible analysis would be to combine tiles from the hippocampal and frontal cortex regions into one unified data set prior to training the model. This analysis would potentially show similarities and differences in the pathology present in the frontal cortex and hippocampus, as well as allow an assessment of the relative contributions of each to the models’ predictions. Another future research direction would be to query the pixel-level features that are important in making the prediction, rather than tile-level summary statistics such as positive pixel counts. This will require approaches such as semantic segmentation that can parse a tile into overlapping components. This would allow a dissection of which pixel-level features, such as vacuolization, nuclei shape, or fiber orientation, are important for making the cognitive impairment predictions. Finally, assessing the impact of preprocessing procedures, for example to determine the robustness of training deep learning models with different scanners, tile parcellation schemes, and color normalization methods, is a critical future research direction.
Predicting the presence or absence of cognitive impairment with the use of single histology sections on an individual level is an extremely challenging task. There are known barriers related to disease heterogeneity, variation in clinician practices, and cognitive reserve [8, 51]. In this study, we employed a deep learning classification model for inference of pathophysiology from histology slides with noisy labels of cognitive impairment, resulting in predictions with modest accuracy but significantly above chance level. Interpretation studies suggested that top performing models in the hippocampus and frontal cortex focused on similar aspects of white matter pathology. On a macroanatomic level, they had higher attention on white matter than gray matter; on a microanatomic level, the highest attention tiles showed differences in LFB staining intensity between slides from brains donors predicted to have cognitive impairment or not. Both the probability estimates of cognitive impairment and the measure of LFB staining intensity in the top attention tiles were partially independent of several known pathoclinical features, suggesting that they may be identifying unexpected aspects of pathophysiology. On the other hand, the probability estimates of cognitive impairment were not completely explained by LFB intensity in the top attention tiles; for example, ARTAG positivity was significantly associated with the probability estimates of cognitive impairment from the deep learning models but not with LFB intensity in the top attention tiles. Our results demonstrate that weakly supervised deep learning is a promising approach to dissect pathoanatomic features associated with cognitive deficits in neurohistologic data sets in an unbiased manner.