Tumor samples
A total of 36 craniopharyngioma tumor samples, as diagnosed by surgical pathology, were included in this study (Table S1). Twenty-seven were acquired during surgery for pediatric patients (age at surgery < 18 years; mean = 8.4 years, median = 8.0 years), and 9 from adults (age at surgery ≥18 years; mean = 47.9 years, median = 44.0 years). The pediatric specimens were obtained from patients who underwent surgery at member institutions of the Advancing Treatment for Pediatric Craniopharyngioma (ATPC) consortium. Adult specimens were obtained from University of Colorado Hospital and from the University of Alabama, Birmingham. Tumor samples were snap-frozen in liquid nitrogen in the operating room and subsequently stored in freezers at − 80 °C. For transport between institutions, samples were packaged on dry ice, shipped via overnight courier, and immediately placed in a freezer at − 80 °C.
RNA extraction and sequencing
RNA was extracted from snap-frozen samples using the Allprep DNA/RNA Kit (QIAGEN®, Maryland, USA). The quality of the isolated mRNA was determined via DNA analysis ScreenTape (Aligent Technologies). Samples that passed quality control were used to generate cDNA libraries using the Illumina TruSeq Stranded mRNA Sample Prep Kit. RNA sequencing was carried out using the Illumina HiSeq4000 platform with paired-end reads (2 × 151). On average, 40 million reads were collected for each sample and outputted to FASTQ files. Sequencing reads were subjected to adapter-trimming and quality control using the Trimmomatic package [6]. Files were mapped to the GRCh38 genome (v33) and subsequently sorted to yield BAM files using the standard STAR pipeline [7]. BAM files were converted to feature counts using the R Bioconductor package RSubread [8]. Normal pituitary RNA sequencing data was obtained from the GTEx portal (www.gtexportal.org).
Analysis methodology
Due to the possibility that adult and pediatric transcription patterns would be distinguished by only subtle expression differences, we chose a multimodal bioinformatic approach. This was designed to assess for both linear and non-linear relationships, but also to insulate the statistical analysis from potentially false assumptions.
Canonical differential expression analysis
Feature counts for transcriptomes of adult and pediatric ACP samples were analytically processed using a standard DESeq2 protocol [9]. As the experiment intended to rule out differences in gene expression between the 2 age groups, an Independent Hypothesis Weighted (IHW) p-value threshold of 0.1 was utilized to filter transcripts identified as differentially expressed. Results were visualized using a Mass Action (MA) plot and a Euclidean Distance Matrix was employed to serve as a qualitative foundation from which to make statistical methodology arguments downstream.
Divisive, agglomerative, and probabilistic clustering was performed using the DIANA, AGNES, and FANNY functions in the cluster R package [10], respectively. AGNES was implemented with the complete-linkage method. Complete-linkage clustering groups variables by their most dissimilar members and also avoids combining clusters with highly similar elements (as opposed to single-linkage). As shown below in Fig. 1b, we expect no linear dissimilarity between transcriptomes relative to age group, making complete-linkage an appropriate clustering strategy. As subtle differences in the gene expression patterns could be expected, complete-linkage was employed due to its superiority over a single-linkage approach in handling data groups that closely neighbor one another. DIANA and AGNES calculate metrics for cluster organization known as the Divisive Coefficient (DC) and Agglomerative Coefficient (AC), respectively. These coefficients range between 0 and 1, and indicate the precision with which the dendrogram describes the data, with 1 being the most ideal structure and 0 being the worst. As FANNY is not a hierarchical algorithm, and is visualized in principal component space, the percent of variance explained serves as a surrogate value for AC and DC (i.e., a larger percent variance explained value indicates how well a datasets information is represented in the decomposed space).
Although cluster analysis is a primarily descriptive mathematical technique, we calculated the silhouette width as a means to make a quantitative comparison of the differential expression data. Silhouette widths range between − 1 and 1. A value of 1 indicates that the cluster members are closely related to the other members of the cluster, and not to members outside the cluster. A value of − 1 suggests the cluster member is an outlier within the cluster and is actually more representative of other clusters. The mean Silhouette Value was calculated for a range of cluster numbers for the DIANA, AGNES, and FANNY algorithms (Fig. 1f, g, h, respectively).
Information theory analysis
In order to extend the analysis beyond the linear methodology utilized in the canonical analysis, we calculated the Kullback-Leibler (KL) divergence (DKL) and Maximal Information Coefficient (MIC). This was necessary because non-linear transcriptional relationships (i.e., dynamic networks of transcripts) between age groups could exist and are likely abundant, as seen by the development of numerous pathway-based analytical tools (e.g., GSEA, GO-terms, Reactome, etc.). While these types of modules are interesting they often require predefined lists. However, utilizing information theory allows us to inspect potential networks in the absence of prior known networks, and also reduces analytical bias.
DKL is a mathematical approach to identify differences between data distributions through non-linear integration. While this is most commonly used to compare machine learning outputs to ground truth distributions, we employed it by treating one distribution (e.g., pediatric) as the predicted distribution and the other distribution (e.g., adult) as the target distribution (e.g. “ground truth”). This scenario effectively asks the following question: given the expression distribution of the adult cohort, how well can the distribution of the pediatric cohort be predicted. To calculate DKL, raw feature counts generated by RSubread were first class-balanced using the Synthetic Minority Oversampling Technique (SMOTE) [11] via the python imblearn package [12]. DKL values were then calculated for the synthetically balanced dataset using the python scipy stats module.
MIC is used to measure the strength of any linear and/or non-linear relationship between two variables. This is accomplished by binning continuous variables and iteratively calculating mutual information (i.e., the dependence between variables) to identify the maxima. By plotting MIC against Pearson’s R the magnitude of the relationship between the variables can be visualized (MIC = 0 indicates no relationship; MIC = 1 indicates highly related), as can the nature of the relationship (linear vs. non-linear). MIC was calculated using the python minepy module [13].
HD spot analysis
Raw feature counts derived from RSubread were submitted to the HD Spot algorithm [14]. Due to the substantial imbalance of data classes, the HD Spot algorithm was optimized with respect to maximizing the area under the precision-recall (AUPR) curve. HD Spot developed a classifier that achieved an average AUPR value of 1.00 over 5-fold cross-validation and subsequently determined the mean absolute Shapley value for each transcript. Shapley values can conceptually be understood as importance scores. In this context, a higher Shapley value means a transcript is more important in determining the age group from which the sample was taken. The top 50 transcripts ranked by Shapley value and the list of 20 previously identified therapeutic targets were then submitted for Metascape [15] express analysis to explore potential ontologic connections between gene sets using GO terms.