C9orf72 intermediate expansions of 24–30 repeats are associated with ALS

The expansion of a hexanucleotide repeat GGGGCC in C9orf72 is the most common known cause of ALS accounting for ~ 40% familial cases and ~ 7% sporadic cases in the European population. In most people, the repeat length is 2, but in people with ALS, hundreds to thousands of repeats may be observed. A small proportion of people have an intermediate expansion, of the order of 20 to 30 repeats in size, and it remains unknown whether intermediate expansions confer risk of ALS in the same way that massive expansions do. We investigated the association of this intermediate repeat with ALS by performing a meta-analysis of four previously published studies and a new British/Alzheimer’s Disease Neuroimaging Initiative dataset of 1295 cases and 613 controls. The final dataset comprised 5071 cases and 3747 controls. Our meta-analysis showed association between ALS and intermediate C9orf72 repeats of 24 to 30 repeats in size (random-effects model OR = 4.2, 95% CI = 1.23–14.35, p-value = 0.02). Furthermore, we showed a different frequency of the repeat between the northern and southern European populations (Fisher’s exact test p-value = 5 × 10− 3). Our findings provide evidence for the association between intermediate repeats and ALS (p-value = 2 × 10− 4) with direct relevance for research and clinical practice by showing that an expansion of 24 or more repeats should be considered pathogenic. Electronic supplementary material The online version of this article (10.1186/s40478-019-0724-4) contains supplementary material, which is available to authorized users.


Introduction
Amyotrophic lateral sclerosis (ALS) is a neurodegenerative disease, primarily affecting upper and lower motor neurons, resulting in progressive weakness and culminating in death from neuromuscular respiratory failure, typically 2-5 years after diagnosis [1]. The incidence is 1-2 per 100,000 person-years, point prevalence is 3 to 5 per 100,000 people in Europe and the United States, and the lifetime risk is 1 in 300 [2][3][4].
Following initial linkage and association studies, the massive expansion of a hexanucleotide repeat in the C9orf72 gene was found to be the most frequent cause of ALS [5][6][7][8][9][10]. In most people, the repeat length is 2, but in people with ALS, hundreds to thousands of repeats may be observed [11]. A small proportion of people have an intermediate expansion, of the order of 20 to 30 repeats in size and several studies suggest that a threshold of 20 or 23 repeats could be used to discriminate between pathogenic and neutral expansions [7,[11][12][13][14][15][16]. However, the rarity of intermediate repeats and the consequential lack of statistical power, has limited the validity of these results and it remains unknown whether intermediate expansions confer risk of ALS in the same way that massive expansions do. At present, a repeat length of > 30 is typically used as the threshold to distinguish between neutral and pathogenic expansions [5].
One meta-analysis based on four studies, suggested an association of intermediate expansions between 24 and 30 repeats in size with sporadic ALS (chi-squared p-value = 0.03, fixed-effect model) [17]. However, while this meta-analysis was rigorous, it had a few limitations which prevented the results translating into research and medical practice. First, several studies were excluded because data on repeats of length < 30 were not available [17]. Second, the fixed-effect model used in the meta-analysis assumes that the genetic effects are the same across the combined investigations, and that all differences are due to chance [18,19]. While this assumption appears to be supported by the lack of heterogeneity across the involved studies (Q-test p-value = 0.61), heterogeneity may be masked. Genetic effects may vary across different populations for various reasons, including both genuine differences and differential biases and errors across studies [20,21], and the C9orf72 repeat expansion exemplifies this variation [13,15,16,22,23]. Moreover, the Q-test and I-squared do not describe heterogeneity well when the number of studies is small [24,25] and the datasets did not provide sufficient power to properly test the meta-analysis studies for heterogeneity. Third, other studies of intermediate repeat sizes have been inconclusive [5,7,14].
Taking these considerations into account, we therefore analysed a new group of 1295 people with ALS and 613 controls, all sized for C9orf72 repeats, and included the findings in a meta-analysis of studies for which there are data on intermediate repeats of size 24 or greater. We did not investigate intermediate repeats of size 20 to 23 since such repeats have been observed both in ALS patients and controls with no differences in allele distribution in several studies [5,16,23], including in our new cohort ( Fig. 1 and Additional file 1: Table S1).

Whole-genome sequencing samples
Whole-genome sequencing (WGS) data were obtained from blood samples of 1908 individuals, comprising 1295 people with ALS (all apparently sporadic cases) and 613 unaffected controls from two groups: a UK set of 1295 ALS cases, 340 matched controls [26], and 273 controls from the Alzheimer's Disease Neuroimaging Initiative (ADNI) WGS dataset (http://adni.loni.usc.edu). The UK dataset was generated as part of Project MinE [26,27], on the Illumina Hiseq X platform (150 bp paired-end reads) and the ADNI dataset on the Illumina 2000 platform (100 bp paired-end reads).

Genotyping
We used ExpansionHunter [28] to size the C9orf72 GGGGCC repeat in the WGS data. ExpansionHunter has been previously validated for C9orf72 repeat sizing in 5787 WGS samples from Project MinE that were also genotyped using repeat-primed polymerase chain reaction (PCR) and fluorescence PCR. Using the repeat-primed PCR calls as the gold standard, ExpansionHunter showed an expansion detection accuracy > 99% [28]. ExpansionHunter also showed high concordance with PCR and Sanger sequencing in repeats whose total length did not exceed the read-length [28,29], while providing a confidence interval for longer repeats. We used the Expansion-Hunter repeat length called when repeat sizes had a read length < 25 repeats for the UK dataset and < 17 repeats for the ADNI dataset to account for the difference in sequencing platform, and we used the

Meta-analysis and statistics
For meta-analysis, we selected studies for which data were available for repeats of size 24 or more [17]. Association between the number of repeats and disease status was analysed by chi-square and Fisher's exact tests, with the samples split into two categories: samples with an intermediate expansion of [24][25][26][27][28][29][30] repeats and the remainder including those with repeat length > 30. Heterogeneity across studies was examined using the I-squared statistic and chi-square-based Q-tests. The Q-test power was calculated using the Jackson method [25]. Both fixed-effect and random-effects models were used. Sensitivity analyses were conducted to evaluate the influence of individual results on the pooled estimate after sequential removal of each study. Potential publication bias was assessed by both Begg and Egger tests. The Kolmogorov-Smirnov test was used to assess the difference of the repeat length distributions between cases and controls. To test the phenotype differences between our intermediate repeat carriers (> 23 and < = 30), the expansion carriers (> 30) and the non-expanded patients (< 24), we used the logrank test to compare the survival distributions, Pearson's chi-squared test for gender and site of onset and the Kolmogorov-Smirnov test for the age at onset. The Q-test power calculation was made using Stata 15 (Stata Corp, College Station, TX, USA). All other analyses were done in R 3.4.4 using the Metabin function of the Meta library [30]. The p-value threshold used in this study to indicate statistical significance was 0.05.

Alzheimer's Disease Neuroimaging Initiative
Data used in the preparation of this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). The ADNI was launched in 2003 as a public-private partnership, led by Principal Investigator Michael W. Weiner, MD. The primary goal of ADNI has been to test whether serial magnetic resonance imaging (MRI), positron emission tomography (PET), other biological markers, and clinical and neuropsychological assessment can be combined to measure the progression of mild cognitive impairment (MCI) and early Alzheimer's disease (AD).

C9orf72 repeat expansion distribution in the new cohort
For the British dataset, the median age of ALS onset was 62.4 years and for controls, 60.1 years, with a malefemale ratio of 62:38. In the ADNI dataset the median age was 73.2 and the male-female ratio 49:51.
We identified no difference in the distributions of repeat size between ALS and controls in the 1-23 repeat interval (Kolmogorov-Smirnov test, p-value = 0.96). Nine cases and one control were found to carry 24-30 repeats. Five controls were found to carry expansions whose length is rarely seen in unaffected controls and among which a few were hundreds of repeats long ( Fig. 1 and Additional file 1: Table S1). Specifically, five British controls carried large expansions up to hundreds of repeats (sizes 60, 357, 406, 413 and 484) and one British control carried an intermediate repeat expansion of 26 repeats. 6.6% of cases had a repeat of size > 30.

Meta-analysis of the 24-30 repeat expansion
A literature review [17] identified four studies which provided original data on the number of patients carrying 24-30 repeats, including one study from North America [23], one study from Italy [16], one study from France [15], and one study from Spain [13]. We extended the review to all articles published up to January 2019 and identified one further study [28] providing data suitable for our meta-analysis. However, in this additional study, most of the ALS cases overlapped with our British dataset, and we therefore excluded it from our analysis. Therefore, after reproducing the previously published meta-analysis, we performed two new meta-analyses. In the first we added our British samples (1295 cases and 340 controls) and in the second we added our British/ADNI dataset (1295 cases and 613 controls). The final dataset used in this study comprised 5071 cases and 3747 controls. We were able to reproduce the association between the C9orf72 24-30 repeat expansion and ALS using the previously used four datasets (fixed-effect model odds ratio = 5.13, 95% confidence interval = 1.28-22.24, p-value = 0.03) ( Table 1A). The association was replicated after including our cohorts, both when the British cohort was added (fixed-effect model odds ratio = 4.01, 95% confidence interval = 1.21-13.26 p-value = 0.02) (Table 1B) and when the British/ ADNI dataset was added (fixed-effect model odds ratio = 4.82, 95% confidence interval = 1.45-15.96, p-value = 0.01) ( Table 1C). The use of the fixed-effect model was suggested by the chi-square Q-test which did not show heterogeneity (p-value = 0.61), but this was underpowered to detect heterogeneity (power = 12.07%), suggesting that a randomeffects model is a more appropriate model to use. Under a random-effects model, we did not show association using the four studies from the previous meta-analysis only (p-value = 0.07). Using the UK/ADNI dataset we were able to show this association under the random-effects model hypothesis also (odds ratio = 4.2, 95% confidence interval = 1.23-14.35, p-value = 0.02) (Table 1C).
No evidence of publication bias from either the Egger (p-value = 0.69) or Begg (p-value = 0.14) test was observed (Additional file 1: Figure S1). Sensitivity testing, recalculating the pooled odds ratios by omitting one study per iteration (Additional file 1: Table S2) showed that results were consistent, without major fluctuations.

Association between C9orf72 intermediate expansion on the aggregated dataset
We also investigated the association between the intermediate repeats and ALS in a final dataset generated from the aggregation of the 5 datasets used for the  [5,17]. Expansions of 20-23 repeats have contradictory evidence of association with ALS. This suggests that if such an association exists, their contribution to disease risk is weaker than for longer expansions, and we need larger datasets to investigate it. Due to the very low frequency of intermediate repeats, further investigation and larger sample sizes are needed to consolidate our results and determine whether to lower the threshold further. International initiatives such as Project MinE, whose aim is to collect genetic data (including the length of the C9orf72 repeat) of over 10,000 thousand people with ALS, will be crucial to this end. Interestingly, we also observed a few controls with very large expansions that are only rarely observed in non-affected individuals. We have openly released the repeat frequency in our cohort for a wide range of lengths, to allow further analysis when new cohorts become available (Additional file 1: Table S1). Finally, we looked at age at onset, survival time, gender prevalence and site of onset of the intermediate repeat carriers, and we found no significant differences from the other patients.
However, this analysis was limited by their small number (9), which does not allow us to make any conclusion about the clinical characterization of this sub-group of patients.

Additional file
Additional file 1: Figure S1. Bias detection in the meta-analysis: A) Egger test; B) Begg test. C) Funnel plot of the 5 studies used in our meta analysis. Table S1. C9orf72 expansion analysis results obtained with ExpansionHunter on the British/ADNI dataset.