Interpretable deep learning of myelin histopathology in age-related cognitive impairment

Age-related cognitive impairment is multifactorial, with numerous underlying and frequently co-morbid pathological correlates. Amyloid beta (Aβ) plays a major role in Alzheimer’s type age-related cognitive impairment, in addition to other etiopathologies such as Aβ-independent hyperphosphorylated tau, cerebrovascular disease, and myelin damage, which also warrant further investigation. Classical methods, even in the setting of the gold standard of postmortem brain assessment, involve semi-quantitative ordinal staging systems that often correlate poorly with clinical outcomes, due to imperfect cognitive measurements and preconceived notions regarding the neuropathologic features that should be chosen for study. Improved approaches are needed to identify histopathological changes correlated with cognition in an unbiased way. We used a weakly supervised multiple instance learning algorithm on whole slide images of human brain autopsy tissue sections from a group of elderly donors to predict the presence or absence of cognitive impairment (n = 367 with cognitive impairment, n = 349 without). Attention analysis allowed us to pinpoint the underlying subregional architecture and cellular features that the models used for the prediction in both brain regions studied, the medial temporal lobe and frontal cortex. Despite noisy labels of cognition, our trained models were able to predict the presence of cognitive impairment with a modest accuracy that was significantly greater than chance. Attention-based interpretation studies of the features most associated with cognitive impairment in the top performing models suggest that they identified myelin pallor in the white matter. Our results demonstrate a scalable platform with interpretable deep learning to identify unexpected aspects of pathology in cognitive impairment that can be translated to the study of other neurobiological disorders. Supplementary Information The online version contains supplementary material available at 10.1186/s40478-022-01425-5.


Introduction
Cognitive impairment is not an invariable part of the aging process and unimpaired cognition is a core feature of most criteria of successful aging [1]. While Alzheimer's disease (AD) type amyloid-beta peptide (Aβ) deposition in senile plaques may play a role in age-related cognitive impairment, it is clear that removing or ameliorating Aβ alone will not alleviate all cognitive impairment in aging Open Access † Andrew T. McKenzie and Gabriel Marx have contributed equally to this work *Correspondence: Kurt.farrell@mssm.edu; john.crary@mountsinai.org [2]. The neuropathologic correlates of cognitive impairment are multifactorial, with mixed pathologies accounting for the majority of cases in community samples [3,4]. Data suggest that multiple forms of brain pathology can each be uniquely associated with risk of age-related cognitive impairment, including cerebrovascular disease, neuritic plaques and neurofibrillary tangles, Lewy body disease, TDP-43 pathology, and hippocampal sclerosis [5,6]. To pave the way towards better prevention and treatment options for age-related cognitive impairment, there is an urgent need to identify the structural features of brain microanatomy that are robustly associated with the condition using unbiased assessment protocols [7,8]. One approach to identifying structural correlates of cognitive impairment is to perform clinicopathologic correlation in postmortem human brains.
Recent advances in digital pathology, namely whole slide image (WSI) scanning and analysis, provide an opportunity to address the question of clinicopathologic correlation in a way that is less biased towards established paradigms [9]. Studies have begun to apply computational analysis of WSI data using deep learning to answer neuropathologic questions [10][11][12]. However, the use of deep learning in neuropathology has often been limited by the need for intensive manual annotations. Moreover, deep learning analysis in neuropathology has often used supervised learning to study an existing domain of structural features, rather than the discovery of potentially unexpected features.
Weakly supervised deep learning offers a clear path towards WSI analysis in neuropathology with less bias and without the need for laborious manual annotations. In weakly supervised learning, the deep learning algorithm attempts to classify the WSI on the basis of a single slide-level diagnosis or label, rather than pixel-level inputs [13]. Weakly supervised learning approaches using multiple instance learning have had remarkable success thus far in digital pathology, especially in oncology [13,14]. However, unlike cancer pathology, where a gold standard diagnosis can be ascertained, the neuropathologic etiologies of cognitive impairment are poorly understood, graded rather than categorical, overlapping, and dynamically interacting [15][16][17]. Moreover, clinical measures of cognitive function available for clinicopathologic correlation in neuropathology are frequently imprecise, non-standardized, ephemeral, and collected at distant time points prior to death [18][19][20]. As a result, the use of weakly supervised learning to correlate agerelated cognitive impairment with neuropathologic features using WSI data will be dependent on noisy labels of cognition.
To the best of our knowledge, no study has yet reported a weakly supervised deep learning approach on brain tissue WSI data to identify features associated with agerelated cognitive impairment. It is uncertain the degree to which deep learning models will be able to identify robust features to make the prediction of whether an autopsy brain donor had antemortem cognitive impairment in the setting of noisy labels. In this study, we used WSI data stained with Luxol fast blue (LFB), hematoxylin, and eosin (LH&E), from the hippocampus and frontal cortex in a previously described cohort of elderly individuals with a spectrum of age-related pathologies [21][22][23][24][25]. We leveraged a published weakly supervised deep learning algorithm, clustering-constrained-attention multiple instance learning (CLAM) [14], on this histopathologic data to identify pathoanatomic features that are associated with cognition. Our approach re-purposes the classification procedure as a method for inferring pathoanatomical group differences between those found to have any aspect of cognitive impairment and those who were not. We explored the association between the deep learning model predictions on neuropathologic data and antemortem evidence of cognitive impairment. To interpret these results, we dissected the deep learning model's attention weights using additional machine vision techniques. Our study shows that weakly supervised deep histopathology is a promising platform to perform clinicopathologic correlation in neuropathology.

Description of the overall cohort and subset analyzed in this study
Our study used digital WSIs of stained formalin-fixed paraffin embedded (FFPE) tissue from the frontal cortex and hippocampus of a subset of individuals from a previously described collection [21][22][23]. The cohort is a convenience sample derived from our ongoing studies of brain aging, which was collected by eliciting samples from multiple institutions. Extensive neuropathological assessments were completed at the contributing institutions using standardized criteria. This assessment included CERAD neuritic plaque severity score and Braak stage [26]. This cohort contains individuals with varying degrees of primary age-related tauopathy (PART) pathologic change, including PART possible (mild amyloid plaques) and PART definite (amyloid plaque negative), among other age-related changes [27]. Neuropathological exclusion criteria consisted of other neurodegenerative diseases including Lewy body disease, progressive supranuclear palsy (PSP), corticobasal degeneration (CBD), chronic traumatic encephalopathy (CTE), Pick disease, Guam amyotrophic lateral-sclerosis-parkinsonismdementia, subacute sclerosing panencephalitis, globular glial tauopathy, and hippocampal sclerosis. There are also individuals that do not meet the neuropathologic criteria for PART (e.g., two cases with moderate amyloid), and therefore it should be considered an aging-related cognitive impairment cohort. Cerebrovascular pathology was defined in an inclusive manner based on clinical or gross pathoanatomic evidence of vascular disease in the brain in the provided records. In this cohort, ARTAG positivity or absence was assessed on matched phosphorylated tau immunohistochemical stains as previously described [22]. Inclusion criteria in the subset of this cohort analyzed in this paper were individuals who had antemortem clinical evidence of either normal cognition or cognitive impairment, while those without such data were excluded. This led to a data set with WSI and matched pathoclinical data from a total of n = 716 donors (Table 1). For the definition of cognitive impairment, we used a hierarchical method based on the three metrics in the available clinical data to identify any evidence of cognitive impairment. First, if available, a clinical dementia rating (CDR) score > = 0.5 was used as the primary measure of cognitive impairment; if CDR was not available, then the presence of any clinical diagnosis suggestive of cognitive impairment was used as the secondary measure; and finally, if the first two more global metrics were not available, then a Mini-Mental State Examination (MMSE) score < = 24 was used as a measure of cognitive impairment [28]. To maximize the sample size available for the study, cognitive data was included even if the time of assessment relative to death was unknown. Brain donors with any evidence of cognitive impairment were considered a part of the cognitively impaired (CI) group, while donors with negative data in all the available categories were included in the non-cognitively impaired (NCI) group. CDR scores with dementia severity score greater than 3 were converted to a maximum of 3 for consistency across centers.

Slide preparation
For the hippocampus, the WSIs represented the entire hippocampal tissue block, which variably included adjacent structures (e.g., parahippocampal gyrus, temporal horn of the lateral ventricle). Luxol fast blue, hematoxylin, and eosin (LH&E) stains were performed on 4-µm-thick FFPE sections as previously described [23]. Sections mounted on positively charged slides were dried overnight. For each batch of slides stained, a known severe AD case was included as a positive staining control. WSI were scanned using an Aperio CS2 (Leica Biosystems, Buffalo Grove, IL) digital slide scanner at 20 × magnification (0.5 microns per pixel).

Weakly supervised learning pipeline
We used Python (v. 3.7.7), PyTorch (v. 1.3.1), and CLAM to perform deep learning on WSIs [14]. Models were trained using 4 NVIDIA V100 GPUs available on Minerva, a high-performance computing cluster at the Icahn School of Medicine at Mount Sinai. LH&E WSIs were segmented into non-overlapping tiles of 256 × 256 pixels using the default automated segmentation settings in CLAM. After segmentation, there was a median of 16,310 tiles per WSI (minimum of 2546, maximum of 32,036) in the hippocampus data set and 19,878 tiles per WSI (minimum of 2886, maximum of 31,535) in the frontal cortex data set. All the tissue in the WSI was included in the segmentation and downstream analysis. For example, for WSIs generated from blocks with two tissue sections on the slide, both tissue sections were automatically segmented and used in downstream analyses.
To perform feature extraction, for each tile, the first three blocks of a ResNet50 model pre-trained on Ima-geNet was used to convert each 256 × 256-pixel tile into a 1024-dimensional feature vector. Training in CLAM uses attention-based pooling to leverage tile-level feature Table 1 Description of cohort subset and whole slide image dataset used in this study This table describes the pathoclinical characteristics of the subset of brain donors employed in this study. The significance of differences in categorical variables between the non-cognitively impaired and cognitively impaired groups was assessed with a two-proportions z-test, while the significance of differences in numerical variables was assessed with a t-test. WSI = Whole slide image; SEM = Standard error of the mean; ARTAG = Aging-Related Tau Astrogliopathy; CERAD = Consortium to Establish a Registry for Alzheimer's Disease

Category
Non-cognitively impaired We used R (v. 4.0.1) and ggplot2 (v. 3.3.5) to perform downstream analysis and visualization of results from the weakly supervised deep learning analysis. To evaluate the performance of the deep learning algorithm, we compared the performance of all 10 independently trained models to chance (i.e., an area under the receiver operating characteristic (ROC) curve, or AUC, of 0.5) using one-sample Wilcoxon signed rank tests with continuity correction and plotted the average ROC curves using vertical averaging and linear interpolation [29]. For the analysis from each of the two brain regions studied, we used the best-performing model, as measured by the arithmetic mean of the area under the curve and the balanced accuracy on the test set, for subsequent analyses. To perform differential rank correlation analysis between groups, we used the DGCA package [30], with 10,000 permutations of the data used to generate empirical p-values.

Attention interpretation analysis
For each WSI, we used CLAM to perform tissue-level attention analysis of the top performing trained models. In these heatmaps, the red colors represent regions assigned relatively higher attention by the model and blue colors represent regions assigned relatively lower attention, normalized to the attention values in the rest of the slide.
To evaluate the macrostructural features most associated with cognitive impairment, we used V7 to annotate the macroscopic tissue types in a randomly chosen subset of WSIs from both the hippocampus and frontal cortex. One trained researcher (M.S.) created the annotations, and an expert neuropathologist (J.F.C.) reviewed them to ensure accuracy. The V7 annotations were converted to the same tile-level space as the tile-level attention score output from CLAM. We z-transformed the attention scores and we then calculated the median tile-level attention score for each tissue region within each slide. We compared the median attention scores across tissue types with paired t-tests.
To evaluate the microstructural features most associated with cognitive impairment, we examined the 100 tiles with the highest attention scores from each WSI. In order to quantify the amount of dark blue staining in LHE stained tiles, we used the positive pixel counting function in the Python package HistomicsTK (v 0.1.0) [31]. This function converts RGB color space images to HSI (hue, saturation, intensity) color space and calculates the number of pixels in a user-defined hue range. The parameters used to count the positive pixels were created based on manually identifying the appropriate dark blue hue range in HSI space (Additional file 1: Fig. S1). As a normalization measure, we also measured the ratio of the dark blue to light blue color stain in each tile. Outlier tiles with zero positive pixels were removed from further analysis. The same positive pixel counting analysis pipeline was applied to each of the top 100 attention tiles identified from each WSI in the data sets. To minimize the impact of outliers, the median of the results was found for each slide. The slide-level median values between different groups were then compared with t-tests. The multivariate combination of the total dark blue pixel counts and the ratio of dark to light blue pixel counts between groups predicted to be cognitively impaired or not were compared with two-dimensional kernel density estimation using the MASS R package (v. 7.3-51.6) and visualized with contour lines.

Association of deep learning predictions with pathoclinical traits
We used rank correlation analysis to compare slidelevel probability estimates of cognitive impairment and slide-level averages of dark blue color density with other pathoclinical traits in the data set, namely age, cerebrovascular pathology, hippocampal aging-related tau astrogliopathy (ARTAG) positivity, and Braak score provided by the brain bank of origin. To further dissect the relationships between age, clinical labels of the presence or absence of cognitive impairment, and the slide-level median values of dark blue pixel counts, we used asymptomatic chi square conditional independence tests from the R package bnlearn (v. 4.6.1) [32].

Code availability
We used the publicly available software tool CLAM [14] to perform deep learning on WSIs and the publicly available software tool HistomicsTK [31] to perform positive pixel counting of the top attention tiles. Scripts used to perform key custom parts of the downstream data analysis are available at the following URL: https:// github. com/ andym ckenz ie/ deep_ histo patho logy_ manus cript.

Prediction of cognitive impairment using weakly supervised deep learning
We re-purposed a weakly supervised deep learning algorithm previously used for classification in the setting of a known gold standard label as method for inference of pathophysiology in the setting of noisy cognitive labels ( Fig. 1) [14]. We ran this analysis pipeline on an existing collection of WSIs and trained the model to classify brain tissue sections as coming from the subset of individuals with evidence of antemortem cognitive impairment or not (Table 1). We trained two sets of models across 10 folds of cross-validation, one for the hippocampus and the other for the frontal cortex. In the hippocampus, across the test set of each 10 folds of cross-validation, we found a mean AUC on the 10% of held out test subsets of 0.63 (one-sample Wilcox signed rank test p-value = 0.006; Fig. 2b, c) and a mean balanced accuracy of 0.59 (p = 0.013, Fig. 2c). In the frontal cortex, we found a mean AUC of 0.67 (p = 0.002, Fig. 2b, c) and a mean Fig. 1 Workflow for performing weakly supervised deep learning of age-related cognitive impairment. a: Generation of digital neuropathology whole slide images (WSI) with associated cognitive labels. Human brain sections were stained with Luxol fast blue (LFB) and counterstained with hematoxylin & eosin (LH&E). Cognitive labels were generated based on clinical diagnosis, clinical dementia rating (CDR) scores, and/or mini-mental state exam (MMSE) scores. b: WSI were segmented into tiles and passed through a convolutional neural network for feature extraction. The resulting tile-level feature vectors were passed through an attention network. Each feature vector was multiplied by its associated attention score and a weighted summation operation was performed to create slide-level feature vectors. The slide-level feature vectors were then passed through a classification network. The attention and classification networks were trained via backpropagation. c For interpretation analysis, attention heatmaps were created by mapping the attention scores at their associated tile locations in the original WSI. Among the top attention tiles, a dark blue hue range associated with LFB staining was counted and quantified to calculate a slide-level median staining intensity value balanced accuracy of 0.58 (p = 0.009; Fig. 2b). While the models have modest accuracy as a pure classification task, in both brain regions the classification accuracy was significantly greater than chance, suggesting that the models have utility for the inference of pathophysiology. We next evaluated slide-level predictions of the probability of cognitive impairment using the highest performing models in each brain region, parsing out the sub-components of the cognitive impairment classifications (Fig. 2d, e). In the hippocampus data, we found that the probability of cognitive impairment estimate was significantly associated with the diagnostic category (57% in the CI group vs 33% in the NCI group, t-test p-value < 2.2e−16), MMSE (ρ = − 0.32, p = 1.1e−7), and CDR (ρ = 0.5, p < 2.2e−16). In the frontal cortex data, we found that the probability of cognitive impairment estimate was also significantly associated with the diagnostic category (49% in the CI group vs 39% in the NCI group, p = 9.8e−11), MMSE (ρ = − 0.30, p = 1.2e−4), and CDR (ρ = 0.52, p = 3.9e−12). While these strong associations with the cognitive labels are expected because they are what the models were trained on, they show that the model has not overly anchored on any one of the three cognitive labels employed. The correlation of the probability estimates of the models from the hippocampus and frontal cortex was highly significant and of moderate strength (ρ = 0.41, p = 1.5e−14; Additional file 1: Fig. S2), suggesting that the models trained on the two different brain regions are identifying partially independent signals for cognitive impairment.
To explore the reasons for the imperfect classification accuracy we identified, we found the correlation of the probability estimates of cognitive impairment with age across groups (Fig. 3a, b). In the hippocampus, there was a significant correlation between age and the estimated probability of cognitive impairment in the noncognitively impaired group (ρ = 0.37, p = 1.2e−12), a weaker but still significant correlation in the cognitively impaired group (ρ = 0.18, p = 9.0e−4), and a significant difference in correlation (z-score for difference = − 2.6; empirical p-value = 0.014). In the frontal cortex, there was a significant correlation between age and the estimated probability of cognitive impairment in the noncognitively impaired group (ρ = 0.45, p = 1.1e−9), no significant correlation in the cognitively impaired group (ρ = − 0.12, p = 0.11), and a significant difference in correlation (z-score = − 5.3; empirical p-value = 1e−4). Longitudinal biomarker imaging data has shown that there is a substantial time lag between the development of AD pathophysiology in the brain and the emergence of cognitive impairment, which may be mediated by differences in cognitive reserve [33]. These differential correlation results with age suggest that there may a differential association between aging and the delay between development of brain pathology and phenotypic expression of that pathology as cognitive impairment. Another possible reason for these observed differential correlations may be mislabeling of brain donors with more advanced age who did have cognitive impairment as not cognitively impaired.

Attention-based interpretation identifies an association of white matter pathology with cognitive impairment
To explore the underlying anatomical features used as evidence by the deep learning algorithm, we performed attention-based interpretation analysis using the highest performing models from each brain region. In the hippocampus, on a macro-anatomic scale, the model was found to have qualitatively higher attention in white matter regions as opposed to grey matter (Fig. 4a, b). On a microanatomic scale, the models were qualitatively found to have a lower level of LFB staining intensity in the hippocampal top attention tiles from the cases labeled with cognitive impairment (Fig. 4c). To quantify sub-regional differences of the attention signal in the hippocampus, we manually annotated tissue types in a randomly chosen subset of WSIs and used these annotations to measure the region-specific attention scores produced by the model. Quantitative attention scores were found to be significantly higher in the white matter (average attention z-score = 0.62) than in the grey matter (average attention z-score = − 0.41; paired t-test for difference p-value = 9.4e−8). The same trend of higher attention scores in the white matter was found across cognitive status labels (Fig. 4d), suggesting that this result is not due to confounding by cognitive impairment label but instead to properties of the models.
To quantify the microanatomic scale results from the hippocampus, we used positive pixel counting on the 100 tiles with the highest attention scores as measured by the model, henceforth called the "top tiles. " LFB stains CNS myelin sheaths dark blue [24] and our chosen pixel range was designed to capture this LFB staining intensity (Additional file 1: Fig. S1). In the hippocampus, the WSIs predicted to be from brain donors with cognitive impairment had a significantly lower LFB staining intensity in the top tiles (t-test difference p-value = 7.6e−7, Fig. 4e). To normalize for possible variation in staining intensity across slides, we measured the ratio of dark blue staining to light blue staining in the top 100 attention tiles. We found that there was a significantly lower ratio of dark blue staining to light blue staining (t-test p = 4.1e−5; Fig. 4f ). The LFB staining intensity and the ratio of dark blue to light blue staining intensity in the top tiles are correlated (ρ = 0.14, p = 2.8e−4) and jointly distinguish donors predicted by the model to be cognitively impaired or not (Fig. 4g). We next performed the same analysis in the frontal cortex data set, where the results largely echoed those of the hippocampus, with generally stronger effect sizes. Qualitatively, the frontal cortex model was also found to have higher attention in white matter regions (Fig. 5a, b) and a lower level of LFB staining in the top tiles (Fig. 5c). Quantitatively, attention scores were found to be significantly higher in the white matter (average attention z-score = 1.03) than in the grey matter (median attention z-score = − 0.85, t-test p-value < 2.2e−16; Fig. 5d). The group labeled as cognitively impaired had a significantly lower LFB intensity in the top tiles (t-test p = 7.3e−6, Fig. 5e) and there was a significantly lower ratio of dark blue staining to light blue staining (t-test p < 2.2e−16, Fig. 5f ). And as with the hippocampus data set, these two measures are correlated and jointly distinguish between brain donors with and without labels of cognitive impairment (Fig. 5g).

Deep histopathological findings are partially independent of several known pathoanatomic features
We compared the deep learning model results with previously established clinicopathologic features, namely age, Braak stage, cerebrovascular pathology, and hippocampal ARTAG. This association analysis was focused on the hippocampal data set because it has a substantially higher sample size and is therefore better powered to detect correlations (Fig. 6a). We found that there was a significant rank correlation of the model's cognitive impairment probability estimates with age (ρ = 0.32, p < 2.2e−16), Braak stage (ρ = 0.13, p = 6.2e−4), ARTAG positivity (ρ = 0.15, p = 1.5e−4), and cerebrovascular pathology (ρ = 0.29, p = 9.0e−5). We also found that there was a significant association of LFB staining intensity in the top attention tiles with age (ρ = − 0.18, p = 1.0e−6) and cerebrovascular pathology (ρ = − 0.24, p = 0.0014), but not with Braak stage (ρ = − 0.05, p = 0.18) or with the presence of ARTAG pathology (ρ = − 0.05, p = 0.24). This result suggests that the deep learning algorithm has identified a signal for cognitive impairment that is associated with some aspects of known pathophysiology. We further dissected the association between age, LFB staining intensity in the top tiles, and cognitive impairment labels in the hippocampal data set with conditional independence tests. We found that the label of cognitive impairment was not conditionally independent of age when accounting for LFB staining intensity in the top tiles (mutual information = 42.3, p = 3.4e−9 by asymptomatic chi square test). Additionally, the label of cognitive impairment was not conditionally independent of LFB staining intensity in the top tiles when accounting for age (mutual information = 21.8, p = 7.1e−5). These results suggest that while these three variables are all significantly associated with one another, chronological age does not fully explain the association of LFB staining intensity in the top tiles with cognitive impairment, nor vice versa.
We next performed correlation analysis on the frontal cortex data set (Additional file 1: Fig. S3), omitting cerebrovascular pathology as a variable because the intersected sample size was too low for reliable estimates in the frontal cortex data set. While the results between brain regions were predominantly similar, one difference is that there was not a significant correlation identified between the model's cognitive impairment probability estimates and Braak stage in the frontal cortex, although it trended towards significance (ρ = 0.11, p = 0.06). Because the hippocampus has a larger sample size than the frontal cortex, it is better powered to detect a significant correlation between Braak stage and probability of cognitive impairment. To address the possibility that this difference in sample size affected any differences in correlation between the regions, we filtered the sample to select only those cases containing data from both the hippocampus and frontal cortex and tested for a differential correlation. In this subset of the data, we found a higher rank correlation between Braak stage and the probability of cognitive impairment derived from the hippocampus (ρ = 0.29, p = 7.4e−8) than in the frontal cortex (ρ = 0.11, p = 0.06), which was a significant difference in correlation (z-score for difference = − 2.4, empirical p-value = 0.02; Fig. 6b). In order to query the robustness of this result, we employed data on positive pixel counts for AT8 staining in the medial temporal lobe (MTL), a measure of tau burden that has been previously described in this cohort [22]. We found that there was a significant rank correlation between AT8 staining burden in the MTL and the probability of cognitive impairment derived from the hippocampus (ρ = 0.37, p = 9.2e−12), a weaker but still significant correlation with the probability of cognitive impairment derived from the frontal cortex (rho = 0.12, p = 0.029), and that there was a significantly higher correlation between these two measures in the hippocampus (z-score for difference = 3.2, empirical p-value = 0.001; Fig. 6c). One way to interpret these findings is that the contributions of different types of histopathology to the deep learning-derived predicted probability of cognitive impairment may differ by brain region.

Discussion
In this study, we used deep learning models to identify a reduction in LFB staining intensity in the top attention tiles from brain sections of donors with antemortem evidence of cognitive impairment. Because LFB staining in brain tissue is generally used to quantify the amount of myelin [24,25], the signal that we identified is likely due to decreased myelin staining intensity. Our results are not able to distinguish decreased myelin density with spared axons as opposed to axon injury and associated myelin loss. In many cases, diminished myelin density in aging is associated with cerebrovascular disease [34]. This is consistent with the strong correlations we identified in this study between cerebrovascular pathology, the predicted probability of cognitive impairment, and decreased LFB staining in the top attention tiles. Even when accounting (See figure on next page.) Fig. 6 Deep histopathology features are partially associated with several known clinicopathologic features and partially independent. a Correlation analysis of deep histopathology results and clinicopathologic features: age, Braak score, evidence of cerebrovascular pathology (coded as 0 = not present and 1 = present), ARTAG positivity in the hippocampus (coded as 0 = not present and 1 = present), cognitive label (coded as 0 = not cognitively impaired and 1 = cognitively impaired), probability of cognitive impairment as predicted by the top-performing model trained on the hippocampal data, and median LFB staining intensity in the top attention tiles in the hippocampus data set. Upper right: rank correlation values and associated p-values (*p < 0.05, **p < 0.01, ***p < 0.001). Diagonal: histograms of variables. Lower left: scatterplots with linear model trend lines for the variable pairs (red lines) and 95% confidence intervals (blue envelopes). This plot was made using the R package GGally (v. for age, there was still an association between decreased LFB staining in the top attention tiles and cognitive impairment. Treating cerebrovascular disease risk factors such as hypertension has been found to decrease white matter pathology and partially reverse age-related cognitive impairment [34]. However, age-associated decreases  . It is unclear the extent to which the decreased myelin density we found to be associated with age-related cognitive impairment are explained solely by cerebrovascular pathology as opposed these other possible etiologies, which warrants further investigation. While postmortem brain gene expression and pathoanatomical studies in aging and AD have often focused on grey matter, neuroimaging findings over the past several decades have frequently found alterations in the white matter to be strongly associated with cognitive impairment [42]. Leukoaraiosis (leuko-white, araiosisrarefaction) is a common neuroimaging abnormality of the white matter that can be found in periventricular or subcortical areas [43,44]. On T2-weighted and FLAIR MRI, leukoaraiosis is frequently described as white matter hyperintensities [45]. While leukoaraiosis is strongly associated with cerebrovascular disease, the precise etiology remains unclear [44,45]. Clinically, leukoaraiosis is associated with cognitive deficits such as bradyphrenia [34]. Histologically, leukoaraiosis has been suggested to be associated with decreased density of myelin sheaths [46]. Our deep learning models identified a neurohistologic signal for cognitive impairment that was (a) focused in the white matter, (b) in some cases scattered in a nonuniform pattern across the tissue, (c) and associated with decreased myelin staining intensity. Although our data set lacks associated in vivo neuroimaging data to draw conclusive statements, one clear possibility is that the white matter histologic alterations the deep learning models identified may reflect similar etiopathology as the neuroimaging finding of leukoaraiosis. We propose that diminished LFB staining intensity in particular areas identified by a deep learning model may be a quantitative way to assess for the presence of leukoaraiosis-associated neuropathology in postmortem brains.
It is important to consider the limitations of this study. First, compared to previously published weakly supervised learning publications in oncology (which are often n > 1000), the data set employed here (n = 716) is not as large [13,14]. Because there is an absence of significant Aβ burden in this cohort, it also limits the representativeness of the cohort to the population at large. This adds to the numerous selection biases in brain donation-based autopsy cohorts in general [47]. Second, the WSI data set analyzed only contains one stain, the LH&E stain. While LFB staining is ideal for detecting myelin, it is possible that it may have highlighted the white matter to a disproportionate degree that affected the deep learning algorithm results. Third, we were unable to assess for comorbid TDP-43 pathology, which would allow us to screen for limbic-predominant age-related TDP-43 encephalopathy (LATE), a common TDP-43 proteinopathy associated with an amnestic dementia in elderly individuals [48]. Additionally, because we only looked at two brain regions, we have limited anatomical sampling, which is problematic because we know that cognitive impairment is determined by accumulated lesion burden across the brain.
Another potential limitation is the possibility of systemic variation in staining properties across WSIs. We did not perform WSI color stain normalization because the slides were all stained uniformly at the same center and with the same platform, minimizing systemic heterogeneity. Furthermore, as a tile-level normalization measure, we calculated the ratio of the blue color intensity in each tile, which yielded the same general result as when we only used the dark blue pixel range. It is still possible that a fixation or staining artifact may have affected the cognitive impairment probability estimates and/or attention signals. For example, deeper areas of the brain often had qualitatively higher attention signals. However, the attention signal appeared to follow anatomical compartments, such as the deep white matter, while sparing the subcortical U-fibers, regardless of the depth of these compartments, and therefore our results are considered less likely to be due to artifactual variation in staining intensity. Taken together, addressing the issue of stain normalization without introducing other biases is a complex topic, especially in WSIs stained with multiple types of stains such as in our data, warranting further research [49].
One of the concerns with contemporary deep learning models is that the basis of their predictions is challenging to understand. In this study, we provided one measure of interpretability, by leveraging the intrinsic attention mechanism of the model to quantify the degree of myelin pallor in the top attention tiles. However, this interpretability measure does not explain all of the model's attention scores, nor does it fully explain the model's predictions of cognitive impairment. For example, it is unclear why some of the top attention tiles from brain donors predicted to have cognitive impairment, especially those in the model trained on the hippocampus, still appear to have intact myelination. This suggests that other structural or cellular features are playing a role. Additionally, the probability of cognitive impairment prediction is more strongly correlated with Braak stage in the model from the hippocampus than the frontal cortex, but the features that underlie this difference remain to be determined. As a result, much of the variance in the models' predictions remains opaque. Furthermore, our interpretability result was derived manually, by inspecting the results of the trained models, noting a qualitative difference, and then developing a metric to quantify this difference. If deep learning models are eventually going to be used safely and effectively in neuropathology research and clinical practice, then there is a critical need to make them understandable to humans. Improving the transparency of deep learning model predictions in a more automated way, for example by using more explainable architectures or distillation tools, is an essential research direction for the field [50].
Although there are some additional limitations to our study, we expect that our methodology lays the groundwork for further probing of the histopathology of agerelated cognitive impairment in future studies that will be able to address these limitations. While our current slidelevel predictive accuracy is modest, as our annotated WSI data sets grow, we expect that our trained models will improve in stability, discriminative power, and ability to pinpoint morphological features associated with cognitive impairment. Related to this, while some of the clinicopathologic correlations were statistically significant, they occasionally had weak correlation strengths. Larger and more richly annotated data sets will help to further parse out the practical implications of these correlations. Because the accuracy of our trained models is limited, this study can be conceptualized as a proof of concept for further studies, and certainly not fully dispositive of the underlying pathophysiology of cognitive impairment or applicable to clinical practice. Additionally, while we only focused on a robust yet general approach to assessing myelin, i.e. LH&E stained tissue sections, future studies deploying additional modalities of assessing myelin injury, such as immunohistochemical staining for oligodendrocyte, axonal, vascular, inflammatory, and myelin markers, will help to further elucidate the pathogenesis of age-related cognitive impairment. Finally, because we only have WSI data available from two brain regions, we are limited in our ability to explain why the results from the two brain regions appeared to differ in some ways.
Our ability to interpret differences in deep learning-derived metrics across brain regions will improve with richer data sets containing WSIs from more brain regions. As compared to cancer pathology, which has heretofore been the main use case of weakly supervised deep learning in digital pathology, in studying the neuropathology of dementia, there is less of an emphasis on diagnosis and more of an emphasis on the inference of pathophysiology. This is in part because cancer can be more frequently associated with one causal type, whereas cognitive deficits in the brain are generally due to overlapping pathologies with complex patterns of comorbidity. The multifactorial nature of cognitive deficits lends itself well to multidimensional interpretation studies. First, it emphasizes the value of quantitative probability estimates of cognitive impairment instead of binary labels, which allow for more precise correlation analysis with other clinicopathologic features. Second, the relative focus on understanding pathophysiology in neuropathology also underscores the value of deterministic computer vision studies, such as positive pixel counting, as a downstream method for interrogating attention or other interpretability signals present in deep learning models. While the prediction capacity of deep learning models in digital pathology can be expected to continue to improve rapidly, our ability to understand what histopathologic features those models are focused on is lagging. Improving our suite of methods for the interpretation of deep learning models will allow us to best harness them and to understand how they may be flawed or biased. Because the study of the neuropathology of dementia remains driven by human ingenuity, more interpretable deep learning methods will be essential to accelerate its adoption across the field.
Our results also suggest several future directions that would illuminate additional aspects of the histopathology of cognitive impairment. One possible analysis would be to combine tiles from the hippocampal and frontal cortex regions into one unified data set prior to training the model. This analysis would potentially show similarities and differences in the pathology present in the frontal cortex and hippocampus, as well as allow an assessment of the relative contributions of each to the models' predictions. Another future research direction would be to query the pixel-level features that are important in making the prediction, rather than tile-level summary statistics such as positive pixel counts. This will require approaches such as semantic segmentation that can parse a tile into overlapping components. This would allow a dissection of which pixel-level features, such as vacuolization, nuclei shape, or fiber orientation, are important for making the cognitive impairment predictions. Finally, assessing the impact of preprocessing procedures, for example to determine the robustness of training deep learning models with different scanners, tile parcellation schemes, and color normalization methods, is a critical future research direction.
Predicting the presence or absence of cognitive impairment with the use of single histology sections on an individual level is an extremely challenging task. There are known barriers related to disease heterogeneity, variation in clinician practices, and cognitive reserve [8,51]. In this study, we employed a deep learning classification model for inference of pathophysiology from histology slides with noisy labels of cognitive impairment, resulting in predictions with modest accuracy but significantly above chance level. Interpretation studies suggested that top performing models in the hippocampus and frontal cortex focused on similar aspects of white matter pathology. On a macroanatomic level, they had higher attention on white matter than gray matter; on a microanatomic level, the highest attention tiles showed differences in LFB staining intensity between slides from brains donors predicted to have cognitive impairment or not. Both the probability estimates of cognitive impairment and the measure of LFB staining intensity in the top attention tiles were partially independent of several known pathoclinical features, suggesting that they may be identifying unexpected aspects of pathophysiology. On the other hand, the probability estimates of cognitive impairment were not completely explained by LFB intensity in the top attention tiles; for example, ARTAG positivity was significantly associated with the probability estimates of cognitive impairment from the deep learning models but not with LFB intensity in the top attention tiles. Our results demonstrate that weakly supervised deep learning is a promising approach to dissect pathoanatomic features associated with cognitive deficits in neurohistologic data sets in an unbiased manner.