The primary analytical challenge that we encountered in this project was that of efficiently clustering microbial genes based on co-abundance. This general approach has been proposed and implemented previously [13, 16], but existing implementations do not perform exhaustive searches for co-abundant genes because performing all pairwise comparisons of millions of genes in large microbiome datasets [17] is computationally intractable. To overcome this obstacle, we took advantage of the Approximate Nearest Neighbor (ANN) heuristic, which is able to robustly identify candidate subsets of co-abundant genes without having to perform all pairwise comparisons [18, 19]. We implemented a Python package (“ann_linkage_clustering”) to perform exhaustive average linkage clustering using the cosine distance metric on any dataset containing gene abundance data across a set of samples. While this method is relatively computationally intensive, we were able to execute it in a reasonable amount of time using commodity “cloud” computational resources (e.g., 17 h for a set of five million genes across 199 samples with a 256 GB RAM node). While this clustering procedure is not expected to be deterministic, our experience has been that clusters are generally reproduced across replicates and we are actively studying the generalizability of gene clustering as a function of input data and clustering thresholds. In the ideal case, this approach improves the precision of estimating gene-level abundance by combining data from multiple correlated observations, as well as reducing the number of hypotheses to test in an association study, while maintaining the interpretation advantages of distinct genetic elements (core genome, plasmid, virus, etc.).
We applied this novel approach to gene-level metagenomics to test for an association of the gut microbiome across two distinct human diseases: IBD and CRC (Table 1). We selected these diseases because each has been studied by multiple groups who have collected stool samples and performed metagenomic whole-genome “shotgun” (WGS) sequencing (Table 1) [2,3,4,5,6,7,8]. Because each of these previous studies used slightly different protocols for selecting patients, collecting samples, and performing sequencing, an integrated analysis of these datasets should serve to identify those signals in the microbiome which are most robust to the methodological and experimental confounders.
The Co-Abundant Gene groups (CAGs) identified in this project contained 2–23,856 genes, with the majority of genes found in CAGs ranging between 10 and 2000 genes in size and containing the range of metabolic functions expected from complete and partial microbial genomes (Additional file 1: Figure S1). Visual inspection of the genes making up these CAGs also demonstrated the highly consistent patterns of abundance displayed by the genes which were ultimately grouped into these CAGs (Fig. 1). We also analyzed a published single-cell sequencing dataset from the stool microbiome [20] and found that genes from the same CAG were found in the same physical cell at three to nine times the rate expected by chance (Additional file 2: Figure S2). The size, functional content, and clear pattern of co-abundance displayed by the genes in this analysis suggest that the CAGs used for statistical analysis represent biological units that are meaningful reflections of the composition of the microbiome across multiple independent datasets.
Our approach to the bioinformatic and statistical analysis was to select a single study for each disease as the “discovery” cohort and to use that dataset to build a de novo catalog of microbial genes and identify CAGs. That gene catalog and CAG grouping generated from the discovery cohort was subsequently used to analyze the additional validation cohorts. Our statistical model was relatively straightforward and used random effects modeling to estimate the difference in the centered log-ratio of the relative abundance of each CAG in the samples from people with and without the disease state (accounting for multiple sampling of some individuals with random effects models). We chose to group together all participants with any form of the disease state, as the criteria for disease classification was not consistent across studies. In this discovery-validation approach, those CAGs which had a q value of < 0.2 in the discovery cohort were subsequently tested in an additional “validation” cohort, and those CAGs which also had a q value < 0.2 in that second step and the same direction of effect were considered to be associated with disease.
We found with this approach that the estimated coefficient of disease status in the set of CAGs associated with disease in the discovery cohort was significantly associated with the estimated coefficient in the validation cohort (Fig. 2a, b; CRC—r = 0.36, p < 2E−16; IBD—r = 0.30, p < 2E−16). Within the set of CAGs that were associated with disease in the discovery dataset, 44.0% and 97.2% were significantly associated in the validation dataset for CRC and IBD, respectively. When performing the same analysis with unclustered gene-level abundances (a single gene randomly selected from each CAG), we found a roughly 20–40% lower correlation between the estimated coefficient of disease status (Fig. 2c, d) and a much lower validation rate of 9.8% and 76.0%, respectively. We believe that this evidence supports the proposed utility of CAGs for detecting reproducible biological associations of the microbiome with host disease. Furthermore, 24,502/36,871 CRC-associated CAGs had the same sign of the estimated coefficient in the validation cohort as in the discovery cohort (p < 1E−200, see the “Methods” section), and 28,629/31,895 IBD-associated CAGs had the same signed estimated coefficient (p < 1E−200). We further demonstrated the extent of this association by displaying the abundance of the most strongly associated CAGs across a total of three (IBD) or four (CRC) cohorts (Fig. 2e, f), suggesting that this association is not limited to the cohorts selected for discovery and validation. Over and above the claim that the microbiome is associated with disease in both cohorts, we believe that these results indicate that a substantial number of elements of the microbiome that are associated with disease in a given discovery cohort will also be associated with disease in a corresponding validation cohort.
The pattern of association for the IBD datasets was dominated by the 98.5% of CAGs which had a positive coefficient, indicating that they were more abundant in participants without IBD (Fig. 2b) Additional file 5: Table S2. We therefore investigated the gene-level richness, finding a lower level of gene richness observed in IBD samples compared to healthy controls (Additional file 3: Figure S3) [21], corroborating previous observations of lower alpha diversity in IBD [22, 23]. Without our use of the centered log-ratio to adjust for the compositional nature of these datasets, the decreased abundance of a large fraction of the microbiome may have resulted in a spurious finding that the remainder had increased in abundance [24], but in fact, we found that very few CAGs were consistently increased in abundance in IBD relative to the geometric mean of each sample. In addition to the decrease of overall gene richness, the lower number of CAGs found to be consistently enriched in IBD may also be due to an overall heterogeneity or ‘dispersion’ in the organisms which are positively associated with IBD across different people at a given point in time [14, 25]. However, there was a subset of CAGs which were consistently found to be more abundant in IBD, which may represent those bacteria which are able to thrive in the environment of the inflamed gut. Indeed, the taxonomic annotation of the genes in these CAGs is enriched for organisms which have been implicated in some previous studies of IBD and gut pathogens, including Enterobacteriaceae such as Escherichia/Shigella and Salmonella [3, 22, 23] which may exhibit some growth advantage in the context of either the increased oxygen content of the inflamed intestine or the antibiotics used in IBD treatment [9, 10]. Other organisms, such as Ruminococcus gnavus, were only enriched in IBD for a subset of genes (n = 77), supporting the previous hypothesis of a strain-specific association with IBD [4]. There was also a set of KEGG annotations that were weakly but consistently enriched in this set of IBD-associated genes related to colonization and pathogenesis, such as fimbriae genes fimA (K07345) and fimD (K07347), iron transport (K02010), and putrescine transport (K02052; K11072; K11076).
The pattern of association for the CRC datasets was generally balanced between CAGs that were more abundant in healthy participants and those that were more abundant in disease (Fig. 2a) Additional file 4: Table S1. Of the largest CAGs that were reproducibly associated with disease, those which were more abundant in healthy participants tended to be classified as Clostridia (via alignment to NCBI RefSeq), while those which were more abundant in participants with CRC were more taxonomically diverse (Fig. 3a, b). Moreover, we found the functional annotations of the genes in those CAGs to be particularly interesting. There were four KEGG annotations that were significantly enriched in the set of CAGs found to be more abundant in CRC samples (Fisher’s exact test, Holm-Sidak alpha = 0.01): (1) grdA (K10670) is involved in metabolism of glycine/sarcosine/betaine, and higher levels of glycine is a recognized hallmark of cancer cells [26, 27]; (2) oxyR (K04761) is a transcriptional regulator which regulates genes protecting from the biochemical damage induced by reactive oxygen species, of which markedly higher levels are associated with progressive tumors [28, 29]; (3) abgT (K12942) is a transporter responsible for uptake of p-aminobenzoyl-glutamate and may also import other dipeptides [30]; and (4) afuA/fbpA (K02012) are transporters responsible for importing iron [31], which is likely to be more abundant in the gastrointestinal lumen of individuals with CRC due to bleeding. Three of these four annotated functions have clear links to the altered environment of the gut microbiome expected during CRC and likely promote the growth of these organisms in that setting. It remains to be seen whether those organisms which are able to thrive in the CRC gut microbiome also contribute to the progression of disease (Additional file 4).
One advantage of a gene-based approach to metagenomic analysis is that any CAG of interest can be directly compared with the genomes of bacterial isolates in order to identify strains containing each gene. Of the set of genes that we identified as consistently associated with CRC and IBD, we found a number of strains containing large fractions of these genes (Fig. 3c, d). We furthermore propose that this approach of aligning disease-associated genes to whole microbial genomes may be used to identify the members of any culture collection which are likely to have the largest effect in an experimental model of these human diseases (Additional file 5).