Phylogenetic approaches to microbial community classification
© Ning and Beiko. 2015
Received: 4 February 2015
Accepted: 28 September 2015
Published: 5 October 2015
The microbiota from different body sites are dominated by different major groups of microbes, but the variations within a body site such as the mouth can be more subtle. Accurate predictive models can serve as useful tools for distinguishing sub-sites and understanding key organisms and their roles and can highlight deviations from expected distributions of microbes. Good classification depends on choosing the right combination of classifier, feature representation, and learning model. Machine-learning procedures have been used in the past for supervised classification, but increased attention to feature representation and selection may produce better models and predictions.
We focused our attention on the classification of nine oral sites and dental plaque in particular, using data collected from the Human Microbiome Project. A key focus of our representations was the use of phylogenetic information, both as the basis for custom kernels and as a way to represent sets of microbes to the classifier. We also used the PICRUSt software, which draws on phylogenetic relationships to predict molecular functions and to generate additional features for the classifier. Custom kernels based on the UniFrac measure of community dissimilarity did not improve performance. However, feature representation was vital to classification accuracy, with microbial clade and function representations providing useful information to the classifier; combining the two types of features did not yield increased prediction accuracy. Many of the best-performing clades and functions had clear associations with oral microflora.
The classification of oral microbiota remains a challenging problem; our best accuracy on the plaque dataset was approximately 81 %. Perfect accuracy may be unattainable due to the close proximity of the sites and intra-individual variation. However, further exploration of the space of both classifiers and feature representations is likely to increase the accuracy of predictive models.
Marker-gene profiles of human microbiota can provide a detailed view of microbial diversity across many body sites . Body sites typically show very distinctive profiles; for example, healthy human gut samples are dominated by Bacteroidetes and Firmicutes, while skin samples tend to be much richer in Actinobacteria and other groups [2–4]. Clustering and ordination approaches such as principal coordinates analysis (PCoA) can illustrate the differences among different classes of body site [5, 6]. Similarly, many medical conditions are associated with dysbiosis, which is readily detectible through changes in the diversity or composition of human-associated microbes [4, 7–10]. Distinguishing samples within a site, such as on the skin (e.g., volar forearm, plantar, foot) or in the oral cavity (e.g., plaque, throat, saliva) is often more difficult [11–14]. Understanding these finer-grained degrees of variation is critical for building models of healthy microbiota. Models that conflate different sites, or fail to distinguish successional patterns, may not be as sensitive in the detection of, for example, the transition from a healthy to diseased state.
The similarity of sites can be understood in a metacommunity framework  as a combination of selective factors and proximity. From a selective point of view, similar environmental conditions such as site pH, oxygen availability, or adhesion potential may support the growth of taxonomically similar sets of bacteria [16, 17]. Geographic proximity can support mass-effect models where microbes from one site are transferred to another via migration processes. Examples of these processes can include skin-to-skin contact within or between individuals  and the transfer of microbes within the oral cavity due to direct contact and salivary mixing [17, 19]. The oral microbiome provides a particularly interesting test case for the classification of biodiversity, for several reasons. First, many ecologically distinct sites including different types of plaque, different surfaces, and saliva are found in close proximity [12, 20]. The oral habitat is highly variable with frequent inputs of nutrients, often followed by mechanical removal of the biofilm (e.g., via tooth brushing). The oral microbiome is also subject to well-characterized successional patterns  and frequently transitions to a diseased state [7, 21].
Unsupervised approaches such as ordination and clustering build associations from the most salient patterns of variation in a dataset; these primary patterns may or may not correlate with the features of interest . By contrast, supervised classification approaches use knowledge of features to train models that can draw on any pattern of co-variation in the data [22, 23] and may perform better than unsupervised approaches when between-sub-site variation is small. Supervised approaches have previously been used to classify human microbiota [22, 24–26], using species or operational taxonomic units (OTUs) to distinguish different types of samples. However, microbiome data are typically high-dimensional, with potentially thousands of OTUs observed in each sample. Feature selection aims to identify a subset of all features that are most promising for classification, thereby eliminating uninformative features and decreasing the running time for the classifier . Even when the accuracy of a classifier is not substantially improved, feature selection can still reveal key species or molecular functions of particular biological interest, because only the set of features that are most useful to classification (typically a very small subset of all features) is retained.
Supervised methods are effective for many classification problems; however, many previous studies took all oral samples as one class and tried to distinguish them from microbes from body sites such as skin or gut [22, 24]. One area of potential improvement is the augmentation of generic machine-learning techniques with biological and evolutionary insights. For example, support vector machines (SVMs) can base their classifications on customized similarity values between samples from the same or different body sites; distances such as UniFrac [28–30] can be informed by phylogenetic relationships among species or OTUs. Similarly, the use of OTUs in classification builds on an assumption that groups of closely related organisms can be treated as functional or ecologically cohesive units. This assumption may be violated by strain-level variation and conversely may apply to aggregations of clades that are broader than a single OTU, which again suggests a phylogenetic approach. Finally, the recently developed PICRUSt  algorithm can map taxonomic samples to functional profiles, based on known gene repertoires of closely related organisms: although less informative than shotgun metagenomic sequencing, such functional-prediction approaches may be more informative than taxonomic ones if key functions are decoupled from phylogenetic similarity due to processes such as lateral gene transfer and convergence. Transferred functions can become characteristic traits of phylogenetically distinct lineages  and PICRUSt can potentially identify sets of clades whose similarities are functional rather than phylogenetic. Some of these approaches yield significant increases in classification accuracy, while feature selection highlights key phylogenetic and functional features. We have implemented these ideas in a machine-learning framework and used oral microbiome samples from the Human Microbiome Project  as a challenging test case.
Dataset and sequence preprocessing
Details of human oral cavity samples from HMP, with associated abbreviations
Seqs/sample (mean ± sd)
OTUs/sample (mean ± sd)
8596 ± 6034
521 ± 183
Attached keratinized gingiva
8998 ± 5756
313 ± 105
9465 ± 10,268
447 ± 166
8935 ± 6575
441 ± 154
9586 ± 7247
448 ± 146
9053 ± 7233
422 ± 147
10,351 ± 10,450
398 ± 129
9877 ± 5926
495 ± 147
10,413 ± 6564
497 ± 152
The pre-processed dataset with barcode and primer sequences removed was obtained directly from the HMP DACC (see Table 1 for summary statistics). All samples were processed using QIIME version 1.8.0 . Sequences were clustered into OTUs at 97 % similarity through UCLUST version 1.2.22q , using a closed-reference OTU-picking strategy with GreenGenes (gg_13_08) as our reference database . The resulting table contained the counts of each identified OTU in each sample. To account for disparities in OTU counts in different samples, we converted raw counts to proportions for each sample. Each of the nine oral cavity sites was used as the attribute label to be predicted, with relative OTU abundance as the potential predictors or features. The relative abundance was scaled such that the largest value in each sample was set to 1.0.
Phylogenetic beta diversity
Sequences were aligned using QIIME’s default alignment method PyNAST version 1.2.2q , which implements the NAST alignment algorithm in Python. Sequences with alignment length <150 nucleotides or <75 % identity with the reference dataset were removed. A phylogenetic tree of OTUs was generated from the sequence alignment using FastTree version 2.1.3q . Trees were visualized with ETE version 2.1 .
We used 14 non-phylogenetic and phylogenetic beta-diversity metrics to calculate the distance between each pair of samples with QIIME. To visualize the dissimilarity of the samples, PCoA was performed to visualize clusters of samples in a low-dimensional space. We also used the Unweighted Pair Group Method with Arithmetic Mean (UPGMA) approach to build hierarchical clusters that the group samples by their similarity.
Feature representation and selection
In addition to phylogenetic features such as taxon and clade abundance, we also generated functional predictions to use as input features. PICRUSt is a tool that can predict the functional repertoire of genomes associated with environmental sequences by mapping the content of closely related sequenced genomes. We used PICRUSt 1.0.0 to predict functions based on the previously constructed OTU tree to predict the functional trait abundance. The Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology group descriptions  were used as the basis for functional predictions. As we did with OTUs, function counts were set to a maximum of 1.0 in each sample. We used the nearest sequenced taxon index (NSTI) of PICRUSt to estimate the reliability of functional predictions. NSTI sums all the branch lengths separating each OTU and its respective nearest sequenced genome, weighting for the relative abundance of each OTU in the community. Community members that have no close sequenced relatives will make a larger contribution to the NSTI score and have predictions that are in general less reliable.
Taxonomic and functional mapping generated thousands of features, many of which are likely to be uninformative, and in aggregate, can reduce the speed and accuracy of SVM training. Although some species appear in only a small number of samples, rare features may nonetheless be useful for classification and should not be removed by default. We used feature selection to accelerate learning by removing uninformative OTUs. Among the multitude of available feature selection techniques, we used two types of approach: filter methods, which consider the usefulness of features (often one at a time) based on their apparent relevance to the classification problem, and wrapper methods, which assess features by quantifying their effect on the accuracy of a trained model. The filter methods used were information gain, which ranks the features based on the amount of predictive information obtained from the presence or absence of a term  and the chi-square (Χ 2) statistic, which quantifies the extent of correspondence between a feature and the class label . Filter methods are fast and suitable for problems of high dimensionality but are independent from classifiers and often ignore interactions between features. We also considered random forest (RF) feature permutation  as a wrapper method. In this approach, variables were ranked based on the effect of randomizing their values between the categories to be predicted. In the context of a trained RF classifier, randomizing a useful variable would lead to a significant drop in accuracy, whereas a similar procedure on an uninformative variable would have no effect. Although OTUs with strong marginal effects (i.e., those that have good predictive power independent of any other variables) should be identified by all three approaches, useful combinations of variables might be highlighted by the RF approach.
Classification using SVMs
SVMs have been widely used in various applications since their introduction by Cortes and Vapnik in 1995 . SVMs are model-based classification methods that try to maximize the width of a decision boundary between categories. This decision boundary or hyperplane is typically defined by a small number of boundary cases (the support vectors) with relatively small distances to cases of the other type. A key attribute of SVMs is their ability to accept any similarity values that satisfy a set of constraints; the “kernel trick” allows mapping of cases into a higher-dimensional space where the linear SVM classifier can perform well.
Classification was performed using the LIBSVM package . We chose the generic one-parameter radial basis function (RBF) kernel for classification. To pick the best combination of kernel width γ and error penalty parameter C, a grid search using every combination of C and γ was done (finite sets of attempt values for C = [log2 −5, log2 15], γ = [log2 −15, log2 3]). A fivefold cross-validation approach was adopted to evaluate the classification models. This cross-validation procedure was repeated 100 times for each trial, each time using a different random number seed, in order to generate distributions of accuracy scores.
Generic polynomial and radial basis function kernels are widely used, but custom kernels that incorporate biological insights can be useful as well. For example, alignment-based kernels improved SVM performance in subcellular protein localization prediction [48, 49]. Since phylogenetic distance is an effective measure in the comparison of microbial communities, we developed a custom kernel based on the weighted and unweighted UniFrac distances. We also constructed several non-phylogenetic beta-diversity kernels including Bray-Curtis and Euclidean distance (Additional file 4). Since beta-diversity expresses the dissimilarity between each pair of samples, we subtracted each such value from 1.0 in order to generate similarity values for the SVM classifier. These similarity scores were combined with several different OTU table preprocessing approaches, including raw OTU count, relative abundance, rarified counts from 500 to 3000 per samples, and cumulative sum scaling (CSS) normalization .
Although our focus was on SVMs, we also considered two other supervised classification methods, SourceTracker  and RF. SourceTracker is a Bayesian approach that assigns probabilities that a given sample is derived from each of a set of environment types. By calculating the posterior probability of each source environment assignment with Gibbs sampling, SourceTracker gives probabilities of where a sink sample came from. We used SourceTracker version 0.9.5 software as implemented in QIIME with default settings. Analogous to fivefold cross-validation, the set of samples was divided into five subsets: one subset was used as sink samples while the other four are source samples. We repeated this process five times with different cross-validation subsampling. RFs, first introduced in 2001 , are an ensemble method merging decision trees with voting schemes. Each decision tree is constructed based on M (mtry) randomly chosen features from the input dataset. The prediction of every sample is determined by the majority vote of all these decision trees. RF classifiers are popular both for feature selection and classification and were found by Knights et al.  to perform well on several test datasets. RF classification was implemented with scikit-learn 0.15 .
Results and discussion
OTU diversity across nine oral cavity sites
A total of 2706 human oral cavity samples from nine oral sites were collected from the HMP database. The samples covered the V3–V5 region of the 16S rRNA gene (Table 1, Additional file 2). All sites had at least 281 associated samples. The total OTU richness across all samples of a given site varied from a minimum of 3741 (attached keratinized gingiva) to over 6000 (saliva and throat). The average number of sequences per sample ranged from approximately 8500 to 11,500, although the variation within each site was high.
Classification of all oral cavity sites
Recoloring sample points in the original PCoA plot to reflect the four groups (Fig. 2b) shows the clearer distinction among sites, albeit still with a substantial amount of overlap among all but the plaque group. The nine sites were recoded into their four constituent categories and once again classified using an SVM with the RBF kernel (Fig. 3b). The classification accuracy of plaque samples is 96.9 %, as compared with 73.2 % accuracy for saliva, 87.0 % for gums, and 92.3 % for the back of the mouth. Since the plaque samples were well separated from the other groups, but difficult to distinguish based on the confusion matrix in Fig. 3a, we chose to focus on this two-class problem in order to try and improve the classification accuracy for a challenging subset of sites.
Distinguishing plaque samples using taxonomic information
Accuracy of SVM classifiers trained with different combinations of input features
Cross-validation accuracy (number of features)
Type of input features
Without feature selection
With feature selection
SVM with custom beta-diversity kernels
We developed custom kernels based on 14 different beta-diversity measures that express the similarity between all pairs of samples. The hypothesis underlying the use of these kernels is that similarity scores based on ecological similarity measures will outperform a naïve RBF kernel, especially when these measures are based on information not available to the classifier (for example, phylogenetic information in the case of UniFrac). The performance of SVMs with different custom kernels and OTU table preprocessing approaches is given in Additional file 4. Colors are consistent with , which identified subsets of beta-diversity measures that gave highly correlated predictions. Phylogenetic measures did not yield improved accuracy relative to non-phylogenetic measures: for example, the widely used unweighted and weighted UniFrac measures yielded 74.4 and 73.7 % accuracy. The Canberra distance, recommended by Kuczinski et al. , yielded an accuracy score of 76.5 %, an improvement on UniFrac but still worse than using OTU abundance with an RBF kernel. For the OTU tables used in calculating distance, we pre-processed them with four different methods. CSS normalization was able to separate different samples well, especially for Euclidean distance. OTU table rarefaction produced the lowest score and largest deviation. Although many types of microbial samples cluster well based on beta-diversity measures such as Bray-Curtis or UniFrac, this is clearly not the case with the two types of plaque. A possible reason for the discrepancy between RBF and our custom kernels is the optimization of the gamma parameter of the radial basis function in the SVM grid search, whereas none of the 14 beta-diversity measures have a free parameter that can similarly be optimized.
Classification with clade-abundance features
We then highlighted the optimal features that were selected by each method in the phylogenetic tree. The filter methods, information gain (Fig. 5b), and chi-square (Fig. 5c) chose similar clades including a large clade within Bacteroidetes and smaller groupings within Firmicutes and Fusobacteria. The chi-square approach chose the largest number of features, including Spirochaetes and Clostridia clades that were not chosen by the information gain criterion. By contrast, the RF feature permutation approach, which included the fewest features in its optimal set, selected a greater diversity of features (Fig. 5d). This set of features included unique clades of Firmicutes and Actinobacteria that were not identified by the information gain or chi-square approaches. For all three feature selection methods, near-optimal classification accuracy was obtained for many different numbers of selected features, suggesting that some of the highlighted clades in Fig. 5 may not be important for classification purposes. Nonetheless, the higher variety of features selected by the RF feature permutation approach shows the value of testing combinations of features during the selection process.
Classification based on predicted functional profiles
The PICRUSt software allows the prediction of functional gene complements in microbial samples that have been characterized with marker genes such as 16S. We used these predictions as the basis for classification: if the functional capacity of microbes in a system is more important than their specific taxonomic affiliations, then a function-based approach to classification may yield higher accuracy. PICRUSt uses phylogenetic information to make its predictions, and thus functional information will be highly correlated with the OTU and clade data. However, since phylogenetically distant lineages can share common functional features, the predictions made by PICRUSt may identify functional similarities between OTUs that belong to different high-level taxonomic groups such as classes and phyla. Thus, the predictions made by PICRUSt are not completely redundant with the OTU and clade features considered in this work.
To measure the reliability of the functional predictions, we calculated the NSTI values for each sample (mean NSTI = 0.04 ± 0.01 sd). This is similar to the values reported for HMP samples (mean NSTI = 0.03 ± 0.02 sd), which were generally well predicted by PICRUSt, as compared with 0.23 ± 0.07 for a less well-predicted hypersaline community. A total of 6191 KEGG orthologs, which incorporate functional predictions in addition to homology information, were used as input features to an SVM with an RBF kernel as performed above (Additional file 8). The cross-validated accuracy of the model trained with all features was 76.1 %, very similar to the value obtained with the corresponding OTU abundance model. These observations are consistent with those of Xu et al. , who found that taxonomy alone was sufficient to model microbial community structure. Functional features are still useful for predictive purposes, but their failure to improve classification accuracy may be attributable to several factors. It may be that the crucial functions are not well annotated by KEGG, because of misannotations or a failure to assign to any meaningful functional category. The granularity of KEGG functional attributions and the presence of irrelevant features may also impede the discovery of important predictive attributes.
Three of the top ten groups of features included only functional features (KOs), while the other seven included at least one taxonomic and one functional feature (Additional file 9). Two groups, including the group ranked first in feature selection, were restricted to Streptococcus, with several clades in this group restricted to the opportunistic pathogen Streptococcus anginosus. Two additional groups included other clades of Streptococcus, underscoring the importance of different members of this genus in the oral cavity. Although Streptococcus is typically a more significant component of supragingival plaque, consistent with its facultative anaerobic lifestyle, three of the Streptococcus-containing groups were overrepresented in subgingival plaque, while the fourth was 50 % more abundant in supragingival plaque. This finding suggests that the most common types of Streptococcus may not be the best discriminators between the two types of plaque. Other selected groups of features were broader in their taxonomic distributions, although some of these groups included genera such as Prevotella, Fusobacterium, and Dialister. The second-ranked group included eight genera (including Streptococcus) as well as a number of higher clades, which suggests a set of co-occurring and possibly interacting bacteria that are characteristic of subgingival plaque.
The best feature group included a single functional class, sagA, which encodes the basic structural unit of Streptolysin S (SLS). Bacteria such as Streptococcus pyogenes use SLS to lyse host cells and acquire iron [59, 60]. By its strong correlation with the primary feature in group 1 (Spearman r = 0.97, p < 10−250), this function appears to be strongly associated with subgingival plaque. Functions that were either the primary feature for their group or had the best correlation with the primary feature of their group included a beta-lactam resistance protein, overrepresented in subgingival plaque: streptokinase, which can aid the spread of Streptococcus infection through cleavage of fibrils ; proteins involved in resistance to tellurium and vancomycin; and a type IV secretion system component. Although many of the implicated functions relate to host-microbial interactions, we found no clear, strong connections to aerobic or anaerobic lifestyles.
Comparison with other supervised learning methods
RFs were also implemented with sets of 10 to 200 features ranked highly by the three feature selection methods. The running time of SourceTracker was much higher, and we considered only the entire set of features when testing its performance. For the RF model, the lowest and highest accuracies were respectively achieved by functional abundance and hybridization features, consistent with the SVM results (Additional file 10). OTU and clade representations had similar performance when features were ranked by information gain (Additional file 10A) and chi-square (Additional file 10B), but clade abundance had improved accuracy when RF feature permutation (Additional file 10C) was used to rank features. SourceTracker is able to estimate the proportion of possible sources for each sample; the source with the largest associated probability was used as the final prediction. SourceTracker models trained on OTU features outperformed all other representations, suggesting that redundant features impeded the accuracy of the classifier.
Both SourceTracker and RF had a similar performance on the plaque datasets, with a classification accuracy between 75 and 78 % when OTU abundance features were used. However, the three methods correctly classified overlapping but non-identical subsets of all plaque samples. Additional file 11 compares the classification of all samples for each pairwise combination of methods. The plots show all three methods had consistent predictions for most of the samples, but a small subset of samples were consistently correctly predicted by one approach and incorrectly predicted by the other. This pattern suggests that “ensemble” methods which combine the predictions of multiple classifiers may be able to outperform individual classifiers.
A primary objective of machine learning is to train models that can distinguish classes of entities, in this case, microbial samples encoded as OTU tables, with a high accuracy. In our examination of oral cavity samples, the best test set accuracy scores we have obtained are on the order of 80–81 %, which demonstrates useful learning but is of little value for diagnostic applications and is probably not suitable as a reference model for comparison with diseased states, for example, unless it is satisfactory to use both plaque types as a single reference group. Previous authors have tested many different machine-learning algorithms on reference data sets [22, 24–26, 58]; our exclusive focus here on SVMs allowed us to consider different encodings of the input data and biologically inspired kernels in detail. Of the modifications we tried, clade-based representations gave the largest increase in performance. Although the combinations of OTUs that constituted clades could in principle be discovered by the classifier, it is clear that explicit clade representations yielded some advantage in both feature selection and classification. Selected clades contained genera known to be important in the human oral cavity, in particular Streptococcus. Our predictive approach to function did not improve the accuracy of our classifiers, in spite of the potential for PICRUSt to identify functional as well as phylogenetic connections between OTUs and clades. It may be that shotgun metagenome sequencing, which generates accurate information about even those genes that are frequently transferred, may yield a higher predictive accuracy.
Why did we not obtain higher accuracy scores? Previous work suggests that a different choice of classifier or data encoding may yield higher classification accuracy; clearly, further work is needed to explore this question, and there is a multitude of different approaches that can be applied to the data. Changing the definition and inference of OTUs may improve performance as well: in particular, changing the OTU threshold from 97 to 99 % would highlight finer-scale differences in abundance, for example, differences that may manifest only at or below the species level. In this work, we used closed-reference OTU picking because it maps sampled sequences to reference groups that are defined prior to the analysis. However, closed-reference picking discards any sequences that do not map to existing OTUs at the required level of sequence similarity, a phenomenon that is especially acute at higher thresholds such as 99 %. An approach that combines closed-reference and de novo OTU generation would likely be ideal but requires that new OTUs be comparable between samples and across studies. An alternative would be to use methods that modify the OTU definition or dispense with it altogether such as SWARM  or approaches that compare observed sequences across multiple samples .
In the case of oral samples, and gingival samples in particular, complete separation (i.e., 100 % classification accuracy) may not be achievable, for several reasons. Chief among these is the physical proximity of the supragingival and subgingival plaque. Although the two sites are different in terms of nutrient and oxygen availability, among other factors, some amount of mixing is inevitable due to mass effects, even if microbes from one site are not viable in the other. Sample misidentification may also contribute to diminished classification; indeed, this was one identified use of SourceTracker. However, we expect the impact of misidentified samples to be very low, for two reasons: first, the HMP followed very strict protocols regarding the collection and handling of samples; second, the overlapping of sample types we see in Fig. 2 suggests a gradient of diversity from one sample type to the others, rather than a few scattered outliers that might be indicative of misclassified samples. It is also unlikely that there is a single type of “healthy” subgingival and supragingival microbial community, which would impede the ability of a classifier to learn a single, general model of classification.
We thank Debora Matthews and Mary McNally for the helpful feedback on earlier versions of the manuscript. We also thank Dennis Wong and Michael Hall for assistance with data set preparation and experimental design. This work was supported with funding from the Faculty of Computer Science, Dalhousie University, and a grant to RGB from the Natural Sciences and Engineering Research Council of Canada. Computational analysis was supported by a grant from the Canada Foundation for Innovation. RGB acknowledges the support of the Canada Research Chairs program.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Costello EK, Lauber CL, Hamady M, Fierer N, Gordon JI, Knight R. Bacterial community variation in human body habitats across space and time. Science. 2009;326:1694–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou Y, Gao H, Mihindukulasuriya KA, La Rosa PS, Wylie KM, Vishnivetskaya T, et al. Biogeography of the ecosystems of the healthy human body. Genome Biol. 2013;14:R1.PubMed CentralView ArticlePubMedGoogle Scholar
- Schloss PD. Microbiology: an integrated view of the skin microbiome. Nature. 2014;514:44–5.View ArticlePubMedGoogle Scholar
- Cho I, Blaser MJ. The human microbiome: at the interface of health and disease. Nat Rev Genet. 2012;13:260–70.PubMed CentralPubMedGoogle Scholar
- Parks DH, Beiko RG. Measures of phylogenetic differentiation provide robust and complementary insights into microbial communities. ISME J. 2013;7:173–83.PubMed CentralView ArticlePubMedGoogle Scholar
- Huse SM, Ye Y, Zhou Y, Fodor AA. A core human microbiome as viewed through 16S rRNA sequence clusters. PLoS One. 2012;7:1–12.Google Scholar
- Galimanas V, Hall MW, Singh N, Lynch MDJ, Goldberg M, Tenenbaum H, et al. Bacterial community composition of chronic periodontitis and novel oral sampling sites for detecting disease indicators. Microbiome. 2014;2:32.PubMed CentralView ArticlePubMedGoogle Scholar
- Turnbaugh PJ, Hamady M, Yatsunenko T, Cantarel BL, Duncan A, Ley RE, et al. A core gut microbiome in obese and lean twins. Nature. 2009;457:480–4.PubMed CentralView ArticlePubMedGoogle Scholar
- Schmidt BL, Kuczynski J, Bhattacharya A, Huey B, Corby PM, Queiroz ELS, et al. Changes in abundance of oral microbiota associated with oral cancer. PLoS One. 2014;9:e98741.PubMed CentralView ArticlePubMedGoogle Scholar
- Wade WG. The oral microbiome in health and disease. Pharmacol Res. 2013;69:137–43.View ArticlePubMedGoogle Scholar
- Grice EA, Kong HH, Conlan S, Deming CB, Davis J, Young AC, et al. Topographical and temporal diversity of the human skin. Science (80-). 2009;324:1190–2.View ArticleGoogle Scholar
- Segata N, Haake SK, Mannon P, Lemon KP, Waldron L, Gevers D, et al. Composition of the adult digestive tract bacterial microbiome based on seven mouth surfaces, tonsils, throat and stool samples. Genome Biol. 2012;13:R42.PubMed CentralView ArticlePubMedGoogle Scholar
- Ximénez-Fyvie LA, Haffajee AD, Socransky SS. Comparison of the microbiota of supra- and subgingival plaque in health and periodontitis. J Clin Periodontol. 2000;27:648–57.View ArticlePubMedGoogle Scholar
- Bik EM, Long CD, Armitage GC, Loomer P, Emerson J, Mongodin EF, et al. Bacterial diversity in the oral cavity of 10 healthy individuals. ISME J. 2010;4:962–74.PubMed CentralView ArticlePubMedGoogle Scholar
- Costello EK, Stagaman K, Dethlefsen L, Bohannan BJM, Relman DA. The application of ecological theory. Science. 2012;336(6086):1255–62.PubMed CentralView ArticlePubMedGoogle Scholar
- Ding T, Schloss PD. Dynamics and associations of microbial community types across the human body. Nature. 2014;509:357–60.PubMed CentralView ArticlePubMedGoogle Scholar
- Simón-Soro A, Tomás L, Cabrera-Rubio R, Catalan MD, Nyvad B, Mira A. Microbial geography of the oral cavity. J Dent Res. 2013;92:616–21.View ArticlePubMedGoogle Scholar
- Meadow JF, Bateman AC, Herkert KM, O’Connor TK, Green JL. Significant changes in the skin microbiome mediated by the sport of roller derby. PeerJ. 2013;1:e53.PubMed CentralView ArticlePubMedGoogle Scholar
- Kort R, Caspers M, Van De GA, Van EW, Keijser B, Roeselers G. Shaping the oral microbiota through intimate kissing. Microbiome. 2014;2:1–8.View ArticleGoogle Scholar
- Faust K, Sathirapongsasuti JF, Izard J, Segata N, Gevers D, Raes J, et al. Microbial co-occurrence relationships in the human microbiome. PLoS Comput Biol. 2012;8:e1002606.PubMed CentralView ArticlePubMedGoogle Scholar
- Claridge JE, Attorri S, Musher DM, Hebert J, Dunbar S. Streptococcus intermedius, Streptococcus constellatus, and Streptococcus anginosus (“Streptococcus milleri group”) are of different clinical importance and are not equally associated with abscess. Clin Infect Dis. 2001;32:1511–5.View ArticlePubMedGoogle Scholar
- Knights D, Costello EK, Knight R. Supervised classification of human microbiota. FEMS Microbiol Rev. 2011;35:343–59.View ArticlePubMedGoogle Scholar
- Knights D, Kuczynski J, Charlson ES, Zaneveld J, Mozer MC, Collman RG, et al. Bayesian community-wide culture-independent microbial source tracking. Nat Methods. 2011;8:761–3.PubMed CentralView ArticlePubMedGoogle Scholar
- Statnikov A, Henaff M, Narendra V, Konganti K, Li Z, Yang L, et al. A comprehensive evaluation of multicategory classification methods for microbiomic data. Microbiome. 2013;1:11.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang Y, Zhou Y, Li Y, Ling Z, Zhu Y, Guo X, et al. An improved dimensionality reduction method for meta-transcriptome indexing based diseases classification. BMC Syst Biol. 2012;6(3):S12.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Z, Hsiao W, Cantarel BL, Drábek EF, Fraser-Liggett C. Sparse distance-based learning for simultaneous multiclass classification and feature selection of metagenomic data. Bioinformatics. 2011;27:3242–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007;23:2507–17.View ArticlePubMedGoogle Scholar
- Lozupone C, Knight R. UniFrac: a new phylogenetic method for comparing microbial communities UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005;71(12):8228–35.PubMed CentralView ArticlePubMedGoogle Scholar
- Lozupone C, Lladser ME, Knights D, Stombaugh J, Knight R. UniFrac: an effective distance metric for microbial community comparison. ISME J. 2011;5:169–72.PubMed CentralView ArticlePubMedGoogle Scholar
- Chang Q, Luan Y, Sun F. Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny. BMC Bioinformatics. 2011;12:118.PubMed CentralView ArticlePubMedGoogle Scholar
- Langille MGI, Zaneveld J, Caporaso JG, McDonald D, Knights D, Reyes JA, et al. Predictive functional profiling of microbial communities using 16S rRNA marker gene sequences. Nat Biotechnol. 2013;31:814–21.View ArticlePubMedGoogle Scholar
- Andam CP, Gogarten JP. Biased gene transfer and its implications for the concept of lineage. Biol Direct. 2011;6:47.PubMed CentralView ArticlePubMedGoogle Scholar
- The NIH HMP Working Group. The NIH human microbiome project. Genome Res. 2009;19:2317–23.PubMed CentralView ArticleGoogle Scholar
- Human microbiome project [ftp://public-ftp.hmpdacc.org] Access February 4, 2014.
- Gonzalez A, Stombaugh J, Lauber CL, Fierer N, Knight R. SitePainter: a tool for exploring biogeographical patterns. Bioinformatics. 2012;28:436–8.PubMed CentralView ArticlePubMedGoogle Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7:335–6.PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–1.View ArticlePubMedGoogle Scholar
- DeSantis TZ, Hugenholtz P, Larsen N, Rojas M, Brodie EL, Keller K, et al. Greengenes, a chimera-checked 16S rRNA gene database and workbench compatible with ARB. Appl Environ Microbiol. 2006;72:5069–72.PubMed CentralView ArticlePubMedGoogle Scholar
- Caporaso JG, Bittinger K, Bushman FD, Desantis TZ, Andersen GL, Knight R. PyNAST: a flexible tool for aligning sequences to a template alignment. Bioinformatics. 2010;26:266–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Price MN, Dehal PS, Arkin AP. Fasttree: computing large minimum evolution trees with profiles instead of a distance matrix. Mol Biol Evol. 2009;26:1641–50.PubMed CentralView ArticlePubMedGoogle Scholar
- Huerta-Cepas J, Dopazo J, Gabaldón T. ETE: a python environment for tree exploration. BMC Bioinformatics. 2010;11:24.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Sato Y, Furumichi M, Tanabe M. KEGG for integration and interpretation of large-scale molecular data sets. Nucleic Acids Res. 2012;40:109–14.View ArticleGoogle Scholar
- Yang Y, Pedersen JO. A comparative study on feature selection in text categorization. Mach Learn Work Then Conf. 1997;9:412–20.Google Scholar
- Zheng Z, Wu X, Srihari R, Srihani R. Feature selection for text categorization on imbalanced data. ACM SIGKDD Explor Newsl. 2004;6:80–9.View ArticleGoogle Scholar
- Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010;26:1340–7.View ArticlePubMedGoogle Scholar
- Cortes C, Vapnik V: Support-Vector Networks. Mach Learn. 1995, 20:273–297.Google Scholar
- Chang C-C, Lin C-J. LIBSVM. ACM Trans Intell Syst Technol. 2011;2:1–27.View ArticleGoogle Scholar
- Davis L, Hawkins J, Maetschke SR, Boden M. Comparing SVM sequence kernels: a subcellular localization theme. 2006 Work Intell Syst Bioinforma (WISB 2006). 2006;73(Platt):39–47.Google Scholar
- Chen J, Li H. Topics in applied statistics. Springer Proceedings in Mathematics & Statistics. 2013;55:191–201.View ArticleGoogle Scholar
- Paulson JN, Stine OC, Bravo HC, Pop M. Differential abundance analysis for microbial marker-gene surveys. Nat Methods. 2013;10:1200–2.PubMed CentralView ArticlePubMedGoogle Scholar
- Breiman L. Random forests. Mach Learn. 2001;45:5–32.View ArticleGoogle Scholar
- Pedregosa F, Varoquax G, Gramfort A, Michel V, Thirion B, Grisel O, et al. Scikit-learn: machine learning in Python. J Mach Learn Res. 2011;12:2825–30.Google Scholar
- McInnes P, Cutting M. Manual of procedures for human microbiome project: Core microbiome sampling, protocol A, HMP protocol no. 07–001, version 11. 2010. Current version: http://hmpdacc.org/doc/HMP_MOP_Version12_0_072910.pdf.
- Daniluk T, Tokajuk G. Aerobic and anaerobic bacteria in subgingival and supragingival plaques of adult patients with periodontal disease. Adv Med Sci. 2006;51(1):81–5.PubMedGoogle Scholar
- Zijnge V, Van Leeuwen MBM, Degener JE, Abbas F, Thurnheer T, Gmür R, et al. Oral biofilm architecture on natural teeth. PLoS One. 2010;5:1–9.View ArticleGoogle Scholar
- Aas JA, Paster BJ, Stokes LN, Olsen I, Dewhirst FE. Defining the normal bacterial flora of the oral cavity defining the normal bacterial flora of the oral cavity. J Clin Microbiol. 2005;43:5721–32.PubMed CentralView ArticlePubMedGoogle Scholar
- Kuczynski J, Liu Z, Lozupone C, McDonald D, Fierer N, Knight R. Microbial community resemblance methods differ in their ability to detect biologically relevant patterns. Nat Methods. 2010;7:813–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Xu Z, Malmer D, Langille MGI, Way SF, Knight R: Which is more important for classifying microbial communities: who’s there or what they can do? ISME J 2014;8:1–3.Google Scholar
- Salim KY, De Azavedo JC, Bast DJ, Cvitkovitch DG. Role for sagA and siaA in quorum sensing and iron regulation in Streptococcus pyogenes. Infect Immun. 2007;75:5011–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Bates CS, Montañez GE, Woods CR, Vincent RM, Eichenbaum Z. Identification and characterization of a Streptococcus pyogenes operon involved in binding of hemoproteins and acquisition of iron. Infect Immun. 2003;71:1042–55.PubMed CentralView ArticlePubMedGoogle Scholar
- Schymeinsky J, Mócsai A, Walzog B. Neutrophil activation via beta2 integrins (CD11/CD18): molecular mechanisms and clinical implications. Thromb Haemost. 2007;98:262–73.PubMedGoogle Scholar
- Mahé F, Rognes T, Quince C, de Vargas C, Dunthorn M. Swarm: robust and fast clustering method for amplicon-based studies. PeerJ. 2014;25:e593.View ArticleGoogle Scholar
- Tikhonov M, Leach RW, Wingreen NS. Interpreting 16S metagenomic data without clustering to achieve sub-OTU resolution. ISME J. 2015;9:68–80.View ArticlePubMedGoogle Scholar