Comparison of the three amplicon sequencing processing pipelines
Performance validation of new sequencing and processing strategies is usually done using relatively low complexity synthetic mock communities. However, the goal of this study was to explore variation at the subspecies level. Since it would not be possible to construct a mock community using laboratory strains that accurately mimics the degree of strain variation found in natural communities, clinical samples were used for comparison of pipelines. Longitudinal dental plaque samples were collected from unrelated healthy adult subjects. This study design provided a model system for measuring community stability and inter-individual variation.
Our novel approach included two improvements over standard 16S amplicon pipelines, a more variable target sequence, the ISR, and incorporation of a high-resolution probabilistic error modeling pipeline, DADA2. Oral microbial communities have been well characterized at the species level and well-curated databases are available [46, 51]. Mapping 16S reads against databases has become the standard for community analysis. Without an existing comprehensive ISR database, standard reference-based processing was not possible for the ISR reads. Instead, we processed the ISR sequences with the DADA2 pipeline and compared the results to standard 16S rRNA gene V1-V3 region amplicon sequencing processed by reference database matching and to the same 16S sequences processed using DADA2.
Our goal was to resolve complex microbial communities into the highest possible number of statistically verified true biological variants. As expected, the highest number of sequence variants were inferred from the ISR sequences, with a 5.2-fold increase over species-level taxons. Processing the 16S reads using the DADA2 pipeline also resulted in improvement as compared to the reference-based pipeline, showing a 4.8-fold increase. The relatively small difference between these two approaches was the result of the V8-V9-ISR library being less comprehensive than the V1-V3 library, as discussed below. This suggests that future efforts focused on improving the comprehensiveness of ISR amplification protocols could improve resolving power even more.
Overall, the ISR pipeline discriminated among hosts with the greatest confidence and accuracy and provided the highest resolution for amplicon sequencing-based characterization of bacterial communities. Individuals could be perfectly resolved, with no overlap seen among samples from different subjects in NMDS ordinations and hierarchical clustering dendograms. The 16S high-resolution pipeline did not perform as well as the ISR pipeline, but it did perform better than the 16S reference-based pipeline.
Personalized oral microbiota
This year-long study of five individuals provides the strongest evidence to date that oral microbial communities are highly personalized, as demonstrated by clustering and machine learning-based approaches. The ISR pipeline showed 100% accuracy in predicting the correct source of an oral clinical sample. It is interesting that the subjects were all in nearly daily contact with each other, although none lived together or were related, suggesting that casual contact is not an important determinant of microbial transmission. We had also collected plaque samples from additional subjects, though not for all the time points. Including these samples in the distance-based analysis did not alter the distinctiveness of each subject’s microbiome (data not shown).
Personalization of the oral microbiota has been observed in at least two previous, similar studies, although neither was able to perfectly resolve individuals. In a re-analysis of a 16S rRNA gene sequencing study of longitudinal supragingival plaque dataset, using an “oligotyping” approach based on individual nucleotide positions , the resolution was comparable to that seen in the current study for 16S-DADA2 but less than the ISR-DADA2. In both the Utter study and another study using 16S sequencing and OTU clustering , the accuracy observed for reference-based mapping of 16S was much lower, similar to that observed for 16S reference-based approach in the present study.
In addition to examining differences among subjects as described above, stability over time for each subject was evaluated. Using the ISR approach, for each subject, oral microbial community composition at baseline was compared to that observed at each succeeding time point. Although individuals’ microbial communities exhibited slight drift in membership (Fig. 5) and levels (data not shown) over time, that difference was always smaller than the differences between any two subjects, even after 1 year. In contrast to results using the ISR, with a conventional 16S reference mapping-based pipeline (Fig. 5), intra-subject differences occasionally exceeded inter-subject differences, showing that fine-scale techniques give a more accurate window on community stability.
Our observations of the change in microbial profiles over time, although based on a small sample size (n = 5), are in agreement with the latest findings from the Human Microbiome Project, where a baseline level of intra-individual variation was observed by analyzing strain level communities from metagenomic sequencing data . Overall, our findings with regard to the stability and individuality of the plaque microbiome are similar to those seen in other sites, including tongue and saliva microbiome [54, 55], skin  and gut microbiome [15, 16], and recently in fecal microbiome . This provides further support for the utility of subspecies level microbial community characterization as a method for ‘microbial fingerprinting’ as has been previously suggested .
Exploring subspecies-level population structure and dynamics
The ISR locus provided sufficient diversity to allow exploration of individual population structures for many of the species found in the samples. Species-level taxonomy was assigned to the ISR ASVs using the ISR database described above.
Development of this database will be an ongoing project as new genomes are sequenced. To avoid inclusion of potential amplification artifacts, ASVs were only considered if they were detected in at least two samples, even though this may have resulted in an underestimation of population diversity.
Most of the common oral species exhibited considerable ISR-type heterogeneity, demonstrating a high level of subspecies variation in the oral microbiota of the five subjects that were sampled. Diversity within the broader human population may be much higher. H. parainfluenzae demonstrated the most variation, with 41 ISR types. This is consistent with a recent metagenomic strain profiling showing that this bacterium has high strain diversity compared to other human-associated microbes . Rarefaction analysis showed that sequencing depth was sufficient to estimate population structures for all 15 species we analyzed. Three species were selected for more detailed profiling based on highest diversity and overall abundance, Haemophilus parainfluenzae, Granulicatella adiacens, and Streptococcus mitis (Fig. 6). The degree to which ISR types were shared among subjects, or prevalence, ranged from one S. mitis group variant that was shared by all subjects to a large number of variants from each species that were present in only one subject (Fig. 6b, c). These individually unique variants are major contributors to the ability to discriminate among subjects based on the beta diversity analysis (data not shown). Surprisingly, abundance and prevalence were not well correlated; some of the less abundant ISR types were found in multiple subjects (Fig. 6a, b).
Population dynamics for three species are shown for five subjects in Fig. 7. Many ISR types were retained over 1 year, but considerable fluctuation in levels and frequent loss and gain of ISR types were observed (Fig. 7).
Relationship between the ISR and rest of the genome
While our results show that the ISR technique provides a powerful tool for distance-based beta diversity analysis and tracking variants within bacteria species, the functional significance of ISR-type variability has not yet been established. The relationship between the ISR and the rest of the genome can vary from species to species. One study involving several species of phototrophic bacteria belonging to the order Rhizobiales showed that 70% DNA hybridization was associated with 92% ISR sequence similarity . How closely this correlates with conventional phenotype-based strain designations is not presently fully understood. Our future work in this area will focus on establishing genomic and functional correlations for ISR types.
Certain bacterial species also possess multiple ISR types, which may vary in length and the inclusion of tRNA genes. Although bacterial genomes usually contain multiple copies of the ribosomal operon, for many of the published genomes we analyzed, these multiple copies of the ISR were identical. However, some taxa such as Veillonella spp., Aggregatibacter actinomycetemcomitans, and Porphyromonas gingivalis have multiple non-identical ISR sequences. This information is currently available for a limited number of species and strains. Potential methods for detecting non-identical ISRs might include identifying integer multiples or significant concordance of ISR types within samples.
Comparison between the 16S V1-V3 and 16S V8-V9-ISR libraries
A portion of the 3′-end of the 16S rRNA gene was included in our original ISR amplicon, with the intention of allowing direct mapping of ISR variants to species-level taxa using the V8-V9 region, but the following inherent constraints limited its utility. In Illumina paired-end sequencing, the second sequencing reads are lower quality compared to the first reads. Since the ISR read quality was critical, these were sequenced first, followed by the 16S V8-V9 reads. This resulted in the loss of over 25% of the V8-V9 reads during quality and length filtering. Additionally, species assignments using the V8-V9 region are not as accurate as those based on the well-validated, standard V1-V3 sequences . So we instead developed the ISR database to map the ISR variants to species level taxa and relied on separate amplification of the V1-V3 region for species-level community characterization.
Although it was not ideal for species mapping, the inclusion of the V8-V9 region in the ISR fragment did allow comparison of microbial diversity of the 16S-ISR library with the standard, comprehensive 16S-only library. Mapping the V8-V9 sequences generated a subset of these species-level taxons (222 of the 359) identified from the V1-V3 region in our dataset. Comparison of taxa represented in the two libraries showed that many of the species missing in the V8-V9 library had longer (> 500 bp) ISRs, resulting in very long amplicons (> 800 bp) that are not amplified or sequenced as well as the shorter amplicons from species with smaller ISR regions. Species missing due to the length bias included some potentially clinically important members of the genera Prevotella, Veillonella, Neisseria, Tannerella, and Porphyromonas. However, with the development of the ISR database, it is no longer necessary to include the V8-V9 region for species mapping. A pilot experiment targeting only the informative ISR fragment, using the rD1f primer GGCTGGATCACCTCCTT  in place of the 1237f primer, produced libraries of shorter amplicons that included these missing genera (data not shown). Rothia, Actinomyces, and Fusobacterium were also missing in the 16S V8-V9-ISR library, compared to the 16S V1-V3 library. Examination of genomic sequences showed that, although the V8-V9-ISR amplicons were generated using previously validated universal primers (1237f and EricM primers, Additional file 2: Table S1), the primer contained mismatches for these genera. Conveniently, adoption of the shorter fragment schema described above also solves this problem, as the primers used to target only the ISR (rD1f) generated amplicon libraries which included Rothia and Actinomyces. Fusobacterium was missing due to a mismatch in the distal primer EricM and can be included using a Fusobacterium-specific primer (EricM_fuso: ACCTAGGCATCCTTT). These adjustments parallel the early work done to develop 16S rRNA gene approaches for taxonomic assignment, including both modifications to the universal primer sequences and building of comprehensive databases. Ultimately, refinements should allow the expansion of the phylogenetic range of the ISR approach for broader application in future studies.
Application of the ISR technique
High-resolution community characterization, beyond species level, is currently achievable only by whole metagenome shotgun sequencing. However, the ultra-deep sequencing required to profile strain level communities can be prohibitively expensive and computationally challenging when applied to large-scale surveys of the human microbiome. Our ISR pipeline provides a high-resolution yet cost-effective approach for molecular epidemiology studies by allowing subspecies level tracking of microbiota. This method can be easily integrated into existing amplicon sequencing protocols by simply changing PCR primers, where it could provide the ability to resolve communities into a large number of variants.
This approach could be valuable for studies that address fundamental questions important for oral health. The human oral microbiome is a complex system whose composition is driven by the host and environmental factors, with community alterations at the level of species being implicated in several common diseases such as dental caries and periodontitis. It is unclear if these perturbations in species abundance are simply fluctuations in levels within the existing community or if strains adapted to the changing environment are acquired or lost. Transmission and acquisition of communities is not thoroughly understood either. Insights into these questions have far-reaching implications for therapeutic approaches. Further, the ability to accurately predict the source for a dental plaque sample analyzed using the ISR method could be useful for microbial fingerprinting and forensic tracking applications.