Skip to main content
Fig. 2 | Microbiome

Fig. 2

From: Construction of habitat-specific training sets to achieve species-level assignment in 16S rRNA gene datasets

Fig. 2

Schematic representation of the steps to generate sequential habitat-specific training sets. a The FL_eHOMDrefs_TS training set contains all full-length eHOMDrefs (thick lines) from eHOMDv15.1 together with their respective taxonomic assignment. When only one read represents each taxon (M = 1), a given distinguishing k-mer (green fragment) can only be either present (1) or absent (0). b A higher number of sequences per taxon (M) allows for better resolution on the assignment, with the presence of a given distinguishing k-mer (wi) across each cluster of reads (green fragments) being represented as a proportion (m) out of the total number of reads in that taxon (M). Therefore, to better represent the known sequence diversity of the 16S rRNA gene(s) for each taxon, the training set FL_Compilation_TS includes clusters of sequences (thin lines) recovered from the NCBI nonredundant nucleotide (nr/nt) database that matched with 99% identity and ≥ 98% coverage (see methods) to each eHOMDref (thick line). c The training set V1V3_Raw_TS is a V1–V3 trimmed version of the FL_Compilation_TS training set. The schematic illustrates how trimming to this region leads to identical reads (purple lines) having two different taxonomic designations. Here, G is genus and species are labeled as A or B. d To construct the V1V3_Curated_TS training set, identical V1–V3 sequences in the V1V3_Raw_TS training set were collapsed into one. If identical sequences came from more than one taxon (purple), species-level names of all taxa involved were concatenated (AB). e The V1V3_Supraspecies_TS training set includes the same sequences that the V1V3_Raw_TS training set; however, the headers in the fasta file include the supraspecies taxon (AB) as an extra level between the genus (G) and species taxonomic levels (A, B, or AB), as illustrated here

Back to article page