Skip to main content
Fig. 1 | Microbiome

Fig. 1

From: Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

Fig. 1

Relationships between the datasets, databases, and training sets in constructing training sets for a specific habitat: the human aerodigestive tract. a Datasets gathered from public repositories or obtained by sequencing of new samples are used to explore the 16S rRNA gene diversity of the habitat of interest. These include both 16S rRNA full-length sequences and region-specific short-read sequences used for method validation or benchmarking. b A curated habitat-specific full-length 16S rRNA gene reference database is assembled and expanded in an iterative way by selecting from those datasets representative sequences for both named and as-yet unnamed or uncultivated species (i.e., HMTs in eHOMD), and placing them in a phylogenetic tree (See Figure 1 in [20]). c Training sets are derived from the taxonomical hierarchy of the habitat-specific database and enhanced by the following steps: compiling multiple 16S rRNA gene sequences to represent the natural sequence variability of each taxon, trimming the training set to match the sequenced region/s, and placing species sharing closely related sequences into a supraspecies taxonomic level. Datasets in gray are the specific examples used for the construction of the eHOMD derived training sets described here. Solid arrows indicate where the sequences described come from and dotted arrows indicate when datasets were used for validation or benchmarking

Back to article page