Neuropathologic evaluation, immunohistochemistry and whole slide scanning
Forty-nine autopsy brain cases were retrieved from the UT Southwestern Neuropathology archives, including 16 cases diagnosed as AD, 13 cases as CBD, and 20 cases as PSP (Additional file 1: Table S1). Involvement of the frontal cortex and white matter is associated with advanced disease for all three tauopathies, and therefore the presence of tau aggregates in the frontal gray and white matter was the main inclusion criterion. All AD cases featured Braak neurofibrillary stages V and VI, and all CBD and PSP cases featured some degree of frontal cortical and white matter involvement. Since the major focus of this study was the tau pathology in the neocortex and subjacent white matter, CBD and PSP cases with coexistent low levels of Alzheimer type neuropathological change (Braak neurofibrillary stages I and II, where tau pathology is essentially restricted to the entorhinal cortex) were not excluded.
All sections were cut from formalin fixed, paraffin embedded tissue blocks at 5 micron thickness and stained with anti-AT8 antibody (Thermo Scientific MN1020, 1:200 dilution) using Leica Bond III automated immunostaining platform. Monoclonal antibody AT8, first reported in 1992 by Mercken et al. , was developed against phosphorylated tau isolated from human Alzheimer disease brain. We selected this anti-phospho-tau antibody because it has been used in laboratories worldwide for decades for the immunohistochemical detection of disease-associated phospho-tau in many different neurodegenerative conditions. In 2003, Arai et al.  investigated the differences in immunoreactivity of 13 antibodies to epitopes spanning the entire length of the tau molecule to phospho-tau lesions in autopsy brain tissue from subjects with Alzheimer disease, Pick disease, progressive supranuclear palsy, and corticobasal degeneration. While antibodies directed at epitopes within the microtubule binding domain of tau showed different levels of immunoreactivity among these tauopathies, antibodies to the middle region of tau, including AT8, showed similar immunoreactivity among all 4 tauopathies. Importantly, we also included exposure of tissue sections to concentrated formic acid preceding epitope retrieval and immunostaining, as this has been shown to maximize the specific immunoreactivity of tau lesions for use in quantitative image analysis applications .
AT8 stained slides were digitized using an Aperio ScanScope CS2 (Leica Biosystems, Buffalo Grove, Il) whole slide scanner at 20X magnification.
We note that only frontal regions were used for the analysis of cortical and white matter tau aggregation patterns. However, to provide additional data to improve the accuracy of our classifier that distinguishes cortex from white matter, we additionally incorporated hippocampal, temporal, parietal and occipital regions from each case. All DL models described below were trained using either a Tesla p40 or v100 single Graphical Processing Units with the open-source package TensorFlow.
Model training for region (cortex vs white matter) identification
For region identification (as opposed to disease/aggregate analysis) a total of 37 WSIs (16 AD, 11 PSP, 10 CBD) imaged at 20X magnification (0.5 microns per pixel) were used for training and testing. Among these WSIs, 21 were taken from the middle frontal gyrus, the focus of our study, while the remaining were taken from the hippocampus, superior temporal gyrus, lateral parietal cortex, and calcarine cortex. WSIs were first annotated by a trained neuropathologist, using the QuPath bioimage analysis software , to indicate cortex and white matter regions, as well as background areas. From each annotated WSI, we generated an average of ≈ 12,000 image patches (128 × 128 microns = 256 × 256 pixel at 20× magnification) giving a total of ≈ 430,000 patches to be used as input for the DL model. WSIs (and correspondingly the patches derived from them) were split using threefold cross validation, with random assignments of diseases to folds while ensuring that each fold contained the same proportions of slides from each disease.
We used a fully convolutional network (FCN) with a custom architecture (Additional file 1: Table S2), which takes as input an image patch and predicts whether the center pixel of the patch was in the cortex, white matter or background. We trained a different model for each fold, giving us three models in all.
During training, images were transformed with the following augmentations: random horizontal flip, random vertical flip, and random addition to hue and saturation from the Imgaug library (https://github.com/aleju/imgaug). We used a categorical cross entropy loss, with different weights for each class based on their frequency in data to account for class imbalance. We used an Adam optimizer with a learning rate of 0.0001. Model training was performed for 20 epochs, after which each model achieved an average classification accuracy of at least 92% on training data and 85% on testing data.
For the final region identification, we applied all three models at the whole slide level in fully convolutional fashion  and took their consensus prediction to achieve the most accurate result. Specifically, for each model we took their values at the activation layer (i.e., class response for WM/CTX/Background), averaged them together across models, and then selected the class with the the maximum average response as the final classification. Lastly, post-processing steps using the python library, scikit-learn, functions remove_small_objects and remove_small holes (area threshold of 1000 pixels) were applied to further refine region identification. Consensus predictions were then evaluated by pathologists.
Model training for aggregate identification
A total of 18 WSIs (6 AD, 6 PSP, 6 CBD) taken from the middle frontal gyrus and imaged at 20X magnification (0.5 microns per pixel) were used for training and testing. Aggregates were identified under pathologist supervision using the QuPath software as follows. First, to standardize staining levels across slides, tissue samples were pre-processed using the Estimate Stain Vectors function in QuPath . Next, using the rectangle tool, 4–6 regions of interest (ROIs) were next selected across the white matter areas of the sample. Finally, manual thresholding of the DAB signal was performed using the positive pixel count algorithm to identify AT8 positive areas. These areas were manually inspected and then exported to a format readable by our machine learning pipelines, such that each pixel in the analyzed rectangular area was identified as belonging to an aggregate (positive DAB staining), background (negative brown staining) or edge (at the border of aggregates and background) to serve as ground truth for training our models. In all, 74 rectangular regions with sizes between 160 K and 4 M microns2 (~ 1 to 17 M pixels) were generated. For 3 out of the 18 WSIs (one from each disease) we additionally generated ROIs from the cortex in the same manner to test the ability of our models to detect tau aggregates in this region.
To get single pixel resolution in our aggregate prediction we employed a convolutional neural network with UNet architecture . The model takes as input immunohistochemistry (IHC) image patches and outputs patches of the same size so as to match the pathologist-generated ground truth where each pixel is classified as aggregate, background or edge. WSIs were split using twofold cross validation, such that each fold had equal number of slides from each disease. We trained a separate model for each fold.
During training, images were transformed with the following augmentations from the Imgaug library: random horizontal flip, random vertical flip, and random addition to hue and saturation. The models were trained on 400 × 400 pixel tiles randomly sampled the rectangular regions identified above. We used a categorical class entropy loss (measuring difference between the predicted and true pixel classes) with different weights for each class based on their frequency in data to account for class imbalance. Additionally, to prevent over- or under-splitting of aggregates we added an additional image level loss that encourages the prediction to produce the same overall amount of edges as the ground truth. We used a Stochastic Gradient Descent optimizer with a learning rate of 0.001 and momentum of 0.5. Model training was performed for 50 epochs, after which our model achieved an average classification accuracy of at least 89% on training data and 82% on testing data.
From our two trained models, we selected the one with a higher cross-validation accuracy. For each WSI, we then applied our aggregate classifier across the entire slide, to generate an output image “mask” of the same size as the input slide with each pixel classified as background, aggregate or edge. During classification, we implemented a response threshold of 0.5, such that aggregate and edge predictions with a lower class response (i.e., low classification confidence) were classified as background instead. WSIs and their aggregate masks were evaluated side-by-side by pathologists.
Calculation of tau burden
We used output region masks from our region classifier to calculate the total area of a region, either cortex or white matter, in a WSI. We then quantified the tau burden of that region in terms of the fraction of area occupied by aggregates (i.e., pixels classified as Aggregate by the aggregate classifier).
Hand-crafted feature analysis
From our aggregate masks, as described above, we first identified individual aggregates in the white matter region (as determined by the region classifier). Specifically, aggregates were identified by locating connected island of pixels classified as aggregate class, with distinct aggregates separated by pixels classified as either background or edge.
For each individual aggregate, we then used in-house software to calculate our set of hand-crafted features (Additional file 1: Table S3).
Aggregate quality control
Based on the inspection of a pathologist, we observed two types of artifacts in aggregate objects. The first was objects that were deemed too small to be considered an aggregate and were removed using the sklearn function remove_small_objects with an area minimum threshold of 30 pixels. The second type of artifact was rarer instances of AT8 signal localized within nuclei but not attributable to a mature aggregate. To remove these artifacts from further analysis, we first tested whether our handcrafted features with the addition of texture features (Additional file 1: Table S4) recognized them. To this end, we took a subset of 200 aggregates from each WSI, extracted features for each and clustered them using Uniform Manifold Approximation and Projection(UMAP) . After verifying that these artifacts formed a unique cluster apart from other aggregates, we trained a random forest classifier with these data to classify objects as either an aggregate or an artifact, and tested performance on a held out portion of the data. For all subsequent analysis, we applied the trained random forest model to filter out these artifacts.
After removing the artifacts (~ 4% of all initially detected aggregates) that failed our quality control procedure, for each feature, we calculated its median value across all aggregates in the white matter region of each WSI. We performed unsupervised clustering of WSIs based on these median feature values using the Clustermap function from the Seaborn toolbox with the metric ‘Correlation’ and average linkage. For each feature, we used a Mann–Whitney test, with Bonferroni-based multiple-hypothesis testing across disease comparison, to test whether feature values differed across the 3 diseases.
Automated disease classification using deep learning
We trained separate DL models on the cortex and white matter to predict disease (AD/CBD/PSP) directly from AT8 stained images (i.e., relevant image features are learned by the model without any need for manual curation of features as above).
A total of 49 WSIs (16 AD, 20 PSP, 13 CBD) taken from the middle frontal gyrus and imaged at 20X magnification (0.5 microns per pixel) were used for training and testing. From each WSI, we generated 10,000 image patches (224 × 224 pixel, 112 × 112 micron) from either the cortex or white matter regions to be used as input for the DL model.
For our DL model, we used a multiple instance learning (MIL) network in order to mitigate the influence of uninformative patches on model training . An MIL framework makes predictions on a group, or batch, of patches rather than an individual patch (Additional file 1: Figure S1), thereby better accommodating patches which lack any disease-specific information (e.g., those without aggregates).
WSIs were split using threefold cross validation, keeping the distribution of data from each disease consistent across each fold. We trained two models for each fold, one using data from the cortex regions of the WSIs, and one using data from the white matter regions, to assign AT8 stained image patches to their corresponding diseases. We used a binary cross-entropy bag loss with different weights for each class based on their frequency in data to account for with class imbalance. We used an Stochastic Gradient Descent optimizer with a learning rate of 0.0001 and a momentum of 0.5. We trained each model for 3 epochs, after which our cortex and white matter models achieved an average patch-level classification accuracy of at least 93% and 90% on testing data, respectively.
Each of the 3 cross-validation models was applied solely to data in its corresponding testing set (i.e., had not been used in training). For validation, 1000 patches were extracted from each slide. Individual patches were classified into the 3 diseases to measure patch level performance. To determine disease classification at the slide level, we calculated the disease class as the majority class from those 1000 patches.
Interpretation of deep learning classification features
We used 2000 image patches from the testing set of a single fold from our Disease Classification DL model. From each image patch, we extracted 1024 features using the second to last layer of our DL model (i.e., layer before disease classification). We then used the output feature matrix (2000 × 1024) as input for the UMAP function with an n_neighbors value of 5 and a min_dist of 1.
Handcrafted features calculation
For each image patch described above, we applied our aggregate classifier to segment aggregates and removed artifacts as described in the Aggregate Quality Control section. We next calculated the feature values for Area, Eccentricity, and Minor Axis length as described in Aggregate Features section and represented each patch by the mean feature values of the aggregates they contained.