Toward a generalizable machine learning workflow for neurodegenerative disease staging with focus on neurofibrillary tangles

Table 3 Model performance on the Emory holdout dataset for model-assisted-labeling models

Models	Pre-NFT			iNFT			Macro F1-score
Models	Precision	Recall	F1 score	Precision	Recall	F1 Score	Macro F1-score
iter. 1	0.36 ± 0.03	0.40 ± 0.03	0.38 ± 0.03	0.86 ± 0.01	0.57 ± 0.00	0.69 ± 0.00	0.53 ± 0.01
iter. 2	0.37 ± 0.02	0.45 ± 0.01	0.41 ± 0.01	0.84 ± 0.02	0.63 ± 0.01	0.72 ± 0.01	0.56 ± 0.01
iter. 3	0.29 ± 0.02	0.46 ± 0.01	0.36 ± 0.02	0.82 ± 0.01	0.71 ± 0.02	0.76 ± 0.01	0.56 ± 0.01
iter. 4	0.31 ± 0.02	0.47 ± 0.01	0.37 ± 0.02	0.79 ± 0.01	0.74 ± 0.02	0.77 ± 0.01	0.57 ± 0.00
iter. 5	0.31 ± 0.03	0.51 ± 0.00	0.38 ± 0.02	0.78 ± 0.01	0.76 ± 0.02	0.77 ± 0.01	0.58 ± 0.02
iter. 6	0.30 ± 0.04	0.53 ± 0.02	0.38 ± 0.03	0.75 ± 0.01	0.78 ± 0.02	0.77 ± 0.01	0.57 ± 0.02
iter. 7	0.29 ± 0.01	0.53 ± 0.02	0.38 ± 0.02	0.74 ± 0.00	0.81 ± 0.01	0.77 ± 0.00	0.58 ± 0.01
iter. 8	0.26 ± 0.01	0.54 ± 0.04	0.35 ± 0.02	0.73 ± 0.02	0.80 ± 0.02	0.76 ± 0.02	0.56 ± 0.02
amygdala	0.46 ± 0.06	0.52 ± 0.08	0.48 ± 0.00	0.73 ± 0.03	0.86 ± 0.02	0.79 ± 0.03	0.64 ± 0.02
hippocampus	0.27 ± 0.04	0.44 ± 0.08	0.33 ± 0.04	0.68 ± 0.03	0.78 ± 0.01	0.73 ± 0.02	0.53 ± 0.03
temporal	0.14 ± 0.06	0.20 ± 0.10	0.16 ± 0.07	0.76 ± 0.06	0.67 ± 0.05	0.71 ± 0.04	0.44 ± 0.06
occipital	0.04 ± 0.03	0.22 ± 0.19	0.06 ± 0.05	0.68 ± 0.09	0.76 ± 0.09	0.71 ± 0.04	0.39 ± 0.05
QC ROIs	0.41 ± 0.04	0.45 ± 0.01	0.43 ± 0.03	0.78 ± 0.01	0.85 ± 0.03	0.81 ± 0.01	0.62 ± 0.02
best consensus	0.36 ± 0.04	0.53 ± 0.03	0.43 ± 0.04	0.82 ± 0.01	0.70 ± 0.01	0.76 ± 0.01	0.59 ± 0.02

Additional models are also shown which are modifications to the datasets used. iter.: iteration in model-assisted-labeling, amygdala/hippocampus/temporal/occipital: models trained on ROIs only from specific regions of the brain (temporal and occipital refers to the temporal and occipital cortex), QC ROIs: models trained only with ROIs with curated labels during model-assisted-labeling, best consensus: consensus model when n equal to 4 (Additional file 3: Fig. S4). Values are shown with standard deviation from the average of the three-fold cross-validation models. Bold score is the best performing model trained on the dataset from all brain regions

ISSN: 2051-5960