The use of semi-quantitative approaches has been the standard of practice in neuropathology for decades. The introduction of methods such as CERAD, almost 30 years ago, provided a much needed consensus criteria when assessing pathological samples for diagnosis [7, 8, 34]. Since then, the limitations and downsides of these methods have been widely discussed in the literature and many have pursued more robust methods to enhance and improve the current standard [11, 12, 36]. The advent of digital slide scanning technologies and advances in computer vision, driven by improvements in machine learning, can potentially help to overcome the limitations of current scoring systems.
Computational approaches based on machine learning are powerful due to their ability to provide highly accurate results on complicated imaging tasks; the availability of large, well-annotated imaging data sets has been essential to this work. However, the application of these technologies in the medical imaging domain is hampered by the small pool of people qualified to provide expert labels for training data. Unlike famous imaging datasets such as ImageNet [37], which incorporate classes of images such as cats and dogs, the generation of large pathologically-annotated datasets can limit our use of machine learning in the field. The work of Tang et al. was notable because of their creation of a large annotated dataset to classify pathologies at high resolution in WSI.
A well-trained neuropathologist can automatically adjust for differences in brain region, staining intensity, the presence of artifacts (tears, shearing), and aging or fading of slides during their evaluation process. While it is theoretically possible to “teach” a machine learning model to adjust for such variation, if such variation is not present in the training data used for model generation, such factors can cause machine learning models to produce erroneous results. These variations are exacerbated when comparing images across institutions that might not use identical protocols for tissue preparation and staining. The rise of online databanks containing WSIs is still in its infancy but will alleviate some of the variation seen in pathology imaging data, as slides can be digitized proximal to staining and thus artifacts occurring due to slide age will be minimized [10, 28, 38]. Other variations amongst cohorts will remain a challenge, such as stain color variations, cohort inclusion / exclusion criteria, as well as disease heterogeneity. If the aim is to develop computational pipelines to replace or support current methods, they must be clearly shown to be robust to these variations.
In this work we validated a previously published CNN pipeline and were able to not only reproduce the original results on the original data set, but also directly apply the model to a new cohort [17]; and without retraining the model, produce quantitative scores in the Emory data set that strongly correlated with independent CERAD-like scores. Even though the two cohorts showed differences upon high level visual inspection (Additional file 2: Figure S10-S14), the pipeline tested in this work retained its previously published performance when applied to the new cohort. Indeed, performance between the two cohorts was comparable for all three pathologies of interest (Fig. 2). Surprisingly this was true even though the model used to generate the quantitative scores had been solely trained on annotated data from another institution. We noticed that Emory cohort slides showed considerable fading as they had been stained years before. Considering machine learning models perform poorly when the training data poorly represent the population data, it is evident this model is robust enough to account for common pathology slide variations [39]. Of interest in future work would be to train a new model independently on newly stained and annotated Emory cohort images and compare its performance to the original model, as well as extend this work to other cohorts at different institutions, other anatomic areas, and have images annotated by multiple experts.
We were also interested in dissecting this pipeline beyond the original investigation using an Emory cohort selected to contain additional variance. When selecting the Emory cohort we focused on two factors: (1) cases showing a wide range of the three Aβ pathologies of cored and diffuse plaques, and CAA; and (2) cases displaying varied pathological diagnoses (that including concomitant diagnoses). Various neuropathologies often occur together and it is still poorly understood how some of these markers of pathology may interact, and whether there is a clear cause and effect between them [2, 22, 40]. Most of these neuropathological diagnoses have clear criteria, at least within the same institutions, and are often defined by pathologies within select neuroanatomic locations. For example, AD is clearly identified by Aβ and tau pathologies present in the immuno-stained tissue, TDP-43 inclusions are identified on TDP-43 immunohistochemistry and may be localized to limbic areas and / or cortical regions, and Lewy body disease is characterized by the presence and distribution of Lewy bodies identified on alpha-synuclein immunostained tissues and can be located in brainstem, limbic, and/or cortical regions [3, 6, 9, 20, 21, 41, 42]. Our new cohort contained cases that included various categories of concomitant diagnosis (AD + TDP-43, AD + LBD, AD + LBD+ TDP-43) but also cases that showed only AD pathologies and normal control subjects. We want to reiterate that we only evaluated temporal lobe staining for Aβ. LBD and TDP-43 pathology are defined by the presence of different pathologies (Lewy bodies and TDP inclusions); while these inclusions may be present in the temporal lobe in some cases, they are best assessed using staining protocols other than Aβ immunohistochemistry. When we grouped AD with concomitant pathologies separately to assess differences between concomitant groups and the control group, these were clearly distinguishable from each other (Fig. 3). Surprisingly the concomitant diagnosis group of AD + TDP-43 showed significantly greater CNN-score for cored plaques than the AD group. Recent studies have demonstrated associations with AD pathologies and TDP-43 deposition and more research is needed to further determine this significance [43].
Another aspect we investigated in this work was comparing pathologies within gray matter compared to the entire tissue section. Most Aβ deposits are located in the neuronal rich gray matter with little seen in the white matter [26]. This notion was borne out in the confidence heatmaps in Tang et al. [17]. Because of this distribution, one might anticipate that variations in white matter-to-gray matter ratio between the images would introduce inherent noise on the CNN scores. Upon restriction of the analysis to gray matter regions, CNN scores remained correlated with CERAD-like categories, Reagan scores, and pathological diagnosis and did not alter statistical comparison amongst disease groups (Table 2 and Additional file 2: Figure S6-S8). Cored and diffuse plaque CNN scores increased when focusing on the gray matter only, with average percent change seen at 23% for cored and 29.3% for diffuse. CAA pathology showed an average decrease in CNN scores in contrast, seen as an average 22.9% decrease in score. We hypothesize this is mostly due to similar ratios of white-to-gray matter in the imaging cohorts, but also to the low amount of pathologies that do occur in the white matter.
In human scoring schemes the use of a small field of views, usually the highest density region for the CERAD criteria, can improve human consistency and reliability [7, 44]. Computationally, we could also take a similar approach and only score the images by their highest density regions. However, we find using a larger area to calculate the scores results in better comparisons with human semi-quantitative scores. This is promising as a benefit of using computational approaches is the ability to reliably analyze large regions of images that are simply not scalable for humans. The real potential strength of this capability, however, is not displayed by this simple analysis, as ultimately it must still correlate to categories defined by only one observer. Additional works with multiple annotators are warranted. Analysis focusing strictly on whole-tissue distributions of pathologies, not just a single score per image, might shed new light into pathologically unique groups.
Together the work presented here shows strong evidence of a neuropathology imaging machine learning pipeline robust to cohort variations, however, some limitations exist. Although the model displayed great performance on the new cohort, significant variations were seen in select variables. Specifically for diffuse plaques, the most abundant pathology, we saw large standard deviations between the cohorts and even within cohorts (Fig. 2). Upon close inspection, the CNN algorithm was grouping very dense regions of pathologies together and counting them as one. This is the inherent nature of diffuse plaques. This was unexpected since we used the same trained model as the previous published work. Further investigation revealed color preprocessing had created variations between our re-creation and the original published work due to differing computer package versions, including Python language and operating system versioning. Since the pipeline involves some user-defined parameters, variations in preprocessing can result in unforeseen differences. For better reproducibility, we developed and have made available a Docker container that bundles the specific versions of Python and system packages used in this work (https://hub.docker.com/repository/docker/jvizcar/ab_plaque_box). Future work leveraging this contained environment could expand on the methods used in this pipeline. Of interest would be individual models that can focus on different pathologies of interest, such as TDP43 and LBD. This would allow a deeper phenotyping for cases by analyzing multiple stains (as only 4G8 was used) and uniquely stratifying concomitant pathologies. As stated previously additional studies examining other brain regions, staining modalities, and having datasets from multiple experts are warranted.
Another route to improving the CNN scoring might be by switching the analysis to a segmentation problem, which would result in a tighter delineation of the pathologies. The biggest hurdle for this would be in generating sufficient training data to achieve high accuracy in a segmentation machine learning model. However, the benefits of this would be vast as it would allow an even deeper phenotyping of pathology from simple burden scores to distribution populations and morphology subtypes within the pathologies. Ultimately, it would allow machine learning models such as the one used in this work to provide not just re-creation of neuropathology assessment but also a means to investigate complex patterns not feasible in purely human-based analysis. We encourage the use of this pipeline and all the provided tools (Docker container, Emory cohort, and code used in this project is made fully available, see Data Availability) to further investigate the benefits that it could have in common neuropathology practice. Furthermore, we hope this work inspires other research groups to establish collaborations with other institutions to validate machine learning models in pathology in diverse and larger cohorts.