Skip to main content
Fig. 4 | Acta Neuropathologica Communications

Fig. 4

From: Deep learning from multiple experts improves identification of amyloid neuropathologies

Fig. 4

Consensus models performed better than individual-expert models across all benchmarks. a Four evaluation benchmark schemes to compare consensus models with individual-expert models. The row indicates the model and the column indicates the benchmark. For each evaluation scheme, the average AUPRC of the blue region (individual-expert models) is compared with the average AUPRC of the gold region (consensus models) over the hold-out test set. The consensus-of-two is dark-gold for emphasis. The “self benchmarks” scheme was the most internally-consistent scheme that evaluated each individual-expert model according to the labels of its annotator (i.e. its own benchmark). For consensus models, the self benchmark corresponded to labels derived from the matching consensus-of-n strategy. The “consensus benchmarks” scheme independently evaluated each model on every consensus-of-n annotation set from n = 1 to n = 5. The “individual benchmarks” scheme independently evaluated each model on each of the five individual-expert benchmarks. The “all benchmarks” scheme evaluated each model on its average performance across all benchmarks. b Performance gains of consensus models over individual-expert models. Values are reported as the absolute AUPRC difference. We calculated p-values of the comparisons using a two-sample Z-test (Methods). P-values for the self-benchmark are not included because the sample size (n = 20 comparisons) is not large enough to assign significance. 95% confidence intervals shown in parentheses. The row indicates the type of benchmark considered when evaluating the model performance differentials, while the column shows the Aβ class being evaluated. Highest performance differential for each Aβ class in bold. c Heatmap as in b, for only the consensus-of-two model versus the individual-expert models. For this consensus-of-two model evaluation, only dark-gold regions in a corresponding to the consensus-of-two model are compared to the blue region

Back to article page