Our analysis compiled 747 samples from five studies encompassing 484 genera. We included 126 samples from the Arumugam et al. [14]: 41 mixed Europe and Asia, 85 Europeans, 24 Native African and African American samples [10], 96 urban and rural Russians samples [12], 210 samples of healthy American adults from the Human Microbiome Project [15], and 291 adult samples from Malawi, Venezuela, and America [11].
Multivariate analysis
The “enterotype” hypothesis was initially proposed based on clustering results seen in a principal coordinates analysis. In our initial ordination analysis, we found that all three populations from the Yatsunenko study clustered separately from the remaining studies (Additional file 1: Figure S1). We therefore performed our primary multivariate analyses without these samples but then confirmed our findings with an independent analysis of them. Combined samples from the four remaining datasets do not cluster discretely with either metric or non-metric multidimensional scaling (PCoAFootnote 2 or NMDSFootnote 3), using Bray-Curtis or Morisita-Horn distances, but instead show a gradient across both dominant taxa and study populations (Fig. 1, Additional file 2: Figure S10). Coloring by dominant taxa (Fig. 1a) and the Prevotella ratio (the ratio of Prevotella to the sum of Prevotella and Bacteroides) (Fig. 1b) illustrate the relationship of axes 1 and 2 to Bacteroides (toward the upper right) and Prevotella (toward the lower left) in the samples. When colored by population, we see that the second axis separates the samples within a population, as does the first axis but to a lesser extent (Fig. 1c). In PCoA analysis of the Malawi, Venezuela, and US populations from the Yatsunenko study (Additional file 3: Figure S7), the first axis appears to correspond primarily to the Prevotella ratio. The three countries are spread along the second axis with the US samples separating from the other two, but the Malawian and Venezuelan samples are intermingled. In general, these data do segregate along the first axis, with fewer intermediate samples than in the combined studies, but distinct clustering is unclear. These ordination results show that the primary axes continue to correlate with Bacteroides and Prevotella when the studies are combined.
To assess the persistent clustering of the microbial communities independent of the relative abundances of Bacteroides and Prevotella, we removed these two taxa from all the samples and reran the PCoA analysis (Fig. 1d, Additional file 3: Figure S7B). Without Prevotella and Bacteroides, the original microbial community classifications (“enterotypes”) no longer segregate the samples. The ellipses in Figs. 1a and 1d were constructed as confidence regions for each of the groups; these regions would represent 95 % of the points if the bivariate data were Gaussian. This demonstrates the effect on the clustering of removing the two dominant taxa. When the dominant taxa are included, the ellipses abut without overlapping. When the dominant taxa are removed, two of the ellipses are almost completely within the third, and all but three of the Prevotella-dominated samples are contained within the ellipse delineating the central Bacteroides samples. Separation of the original clusters is no longer present, and the remaining taxa now account for only 33 % of the variability along axes 1 and 2, as opposed to 54 %. The ordinations of the full samples appear to be driven primarily by the relative quantities of these two highly abundant taxa and not by the presence of distinct microbial communities associated with each group. Interestingly, there is still clustering structure and differentiation of samples visible in Fig. 1d, as one would expect when comparing samples from patients in different geographic regions having different lifestyles and diets. These data structures, however, no longer correlate clearly with Bacteroides and Prevotella.
Taxonomic components of microbial communities
To identify taxa integral to a broader definition of microbial community structures, we looked for taxa that correlated with the dominant genus (Prevotella or Bacteroides) across studies (Additional file 4: Table S1). Several researchers reported correlations within Bacteroides-dominated samples within their studies, including Acidminococcus, Roseburia, Faecalibacterium, Anaerostipes, Parabacteroides, and Clostridiales [14], Alistipes only [16], Escherichia (Enterobacteriaceae) and Acinetobacter [10], and Faecalibacterium and Enterobacteriaceae [8]. None of these associations, however, were apparent across all studies. In fact, only Faecalibacterium and Enterobacteriaceae appear in even two of the four studies that reported correlation analyses. No correlations were found within the Prevotella-dominated samples in more than one previous study. Prevotella was associated with Streptococcus, Enterococcus, Desulfovibrio, and Lachnospiraceae [14], Succinivibrio and Oscillospira [10], and Xylanibacter and Butyrivibrio [8]. The majority of taxa found in the Russian communities [12, 13] were not found in previously reported studies. Tyakht et al. used triplets of the three most abundant members of their various samples to compare with these other studies and found that 43 % of their samples were dominated by triplets not found in the non-Russian groups.
To confirm this finding of no consistent multi-study taxa associated with either dominant taxon, and to increase our statistical power of detection, we combined the samples from all of the studies and reanalyzed them. We used the Spearman coefficient (a nonparametric method used in most of the previous studies) as well as SparCC [17]. Using Spearman on all the data and the DESeq2 [18] negative binomial test on the HMP data, we found no taxa in the combined set of Prevotella-dominated samples that correlated with Prevotella above our threshold for statistical and biological significance (Additional file 4: Table S1). SparCC did find two Clostridiales genera that correlated with Prevotella in the combined dataset, Ruminococcus (R
2 = 0.77) and Dialister (R
2 = 0.73), but neither of these had a significant correlation in any single study. In the combined Bacteroides-dominated samples, we did not find any significant correlations using the Spearman correlation. A few Clostridiales taxa and one Bacteroidales genus in individual studies did seem correlated with Bacteroides, but none of these were found in more than two of the studies. Using SparCC, Subdoligranulum (R
2 = 0.75) and Faecalibacterium (R
2 = 0.73) correlated with Bacteroides but only in the combined set, not in any of the individual datasets. No taxa were significant in more than one of the studies when analyzed separately. These data do highlight that members of both Bacteroidales and Clostridiales are prominent members of the human gut microbiome of Americans and Western Europeans. To explore additional taxonomic differences between groups defined in the Yatsunenko and HMP studies as Prevotella-rich and Bacteroides-rich, we used the package DESeq2 and the tests based on variance transformed data using the negative binomial model (DESeq function with default arguments) as explained in [18]. This provided a ranking of the most differentially abundant taxa. The analysis of the Yatsunenko data showed Parabacteroides, Alistipes, and Subdoligranulum, to be more prevalent in Bacteroides-dominated samples, but these results were not reflected in the HMP data. The DESeq2 analysis confirmed the previous analyses, showing no taxa consistently associating with the Bacteroides- and Prevotella-dominated samples. Recent work by Lovell et al. [19] has emphasized the importance of proportionality analysis when working with relative abundance data. Applying their proposed algorithm on our compiled data, we still found no significant proportionalities across studies. The taxa that had proportionality statistics less than 0.01 were either very rare or even possibly artifacts of the sequencing process. Two of these taxa were found in only one sample, and all of the others were identified in only populations from one study (the European samples from the Arumugam study). They are not prevalent in either the Prevotella- or Bacteroides-dominated samples, and in fact, are absent from the majority of both. These taxa are clearly not important functional components or biomarkers of a human gut community. The analysis itself can be found in the Additional file 5 containing the Rmd commands, data and output. This file contains all the code the reader could require to follow the same workflow on a different data set.
To further understand groupings based on the abundance of Prevotella, we compared the Prevotella ratio (Prevotella/[Bacteroides + Prevotella]) for all samples in each study. Figure 2 illustrates the clear presence of samples across the full spectrum of relative abundances of Prevotella and Bacteroides, although different studies have varying numbers of samples in the intermediate ranges. Americans and Europeans, which make up the largest number of subjects, have fewer samples with intermediate Prevotella ratios, while studies containing more rural subjects have a greater number of intermediate samples. Bacteroides tends to be higher in Western populations, with a more even distribution of values across samples (Additional file 1: Figure S1), while Prevotella tends to have fewer intermediate values in the Western samples and more in Russian and rural samples (Additional file 6: Figure S2).
We compared the abundance distributions of Prevotella and Bacteroides with other common genera in the combined studies (Fig. 3 and Figure S11 in the Additional file 7 for the same boxplot without the Yatsunenko study). Bacteroides, Prevotella, Lachnospiraceae, Faecalibacterium, Roseburia, Ruminococcus, Alistipes, and Coprococcus are among the next most common genera across all studies. Bacteroides, Prevotella, and Alistipes are members of the order Bacteroidales, while the remaining taxa are all members of Clostridiales. In approximately 42 % of samples, Bacteroides has a relative abundance between 10 and 40 %, with more than 80 % of samples having a relative abundance greater than 5 %. Prevotella has a relative abundance between 10 and 40 % in 16 % of samples, but 75 % of samples have a relative abundance less than 5 %. While Bacteroides exhibits a higher proportion of intermediate abundances overall than does Prevotella, Prevotella shows a spectrum of abundances rather than a bimodal distribution as some have assumed, with more samples between 40 and 80 % than any taxon other than Bacteroides.