Integrating compositional and functional content to describe vaginal microbiomes in health and disease

Background A Lactobacillus-dominated vaginal microbiome provides the first line of defense against adverse genital tract health outcomes. However, there is limited understanding of the mechanisms by which the vaginal microbiome modulates protection, as prior work mostly described its composition through morphologic assessment and marker gene sequencing methods that do not capture functional information. To address this gap, we developed metagenomic community state types (mgCSTs) which use metagenomic sequences to describe and define vaginal microbiomes based on both composition and functional potential. Results MgCSTs are categories of microbiomes classified using taxonomy and the functional potential encoded in their metagenomes. MgCSTs reflect unique combinations of metagenomic subspecies (mgSs), which are assemblages of bacterial strains of the same species, within a microbiome. We demonstrate that mgCSTs are associated with demographics such as age and race, as well as vaginal pH and Gram stain assessment of vaginal smears. Importantly, these associations varied between mgCSTs predominated by the same bacterial species. A subset of mgCSTs, including three of the six predominated by Gardnerella vaginalis mgSs, as well as mgSs of L. iners, were associated with a greater likelihood of bacterial vaginosis diagnosed by Amsel clinical criteria. This L. iners mgSs, among other functional features, encoded enhanced genetic capabilities for epithelial cell attachment that could facilitate cytotoxin-mediated cell lysis. Finally, we report a mgSs and mgCST classifier for which source code is provided and may be adapted for use by the microbiome research community. Conclusions MgCSTs are a novel and easily implemented approach to reduce the dimension of complex metagenomic datasets while maintaining their functional uniqueness. MgCSTs enable the investigation of multiple strains of the same species and the functional diversity in that species. Future investigations of functional diversity may be key to unraveling the pathways by which the vaginal microbiome modulates the protection of the genital tract. Importantly, our findings support the hypothesis that functional differences between vaginal microbiomes, including those that may look compositionally similar, are critical considerations in vaginal health. Ultimately, mgCSTs may lead to novel hypotheses concerning the role of the vaginal microbiome in promoting health and disease, and identify targets for novel prognostic, diagnostic, and therapeutic strategies to improve women’s genital health. Video Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-023-01692-x.


Background
The vaginal microbiome plays a vital role in gynecological and reproductive health.Lactobacillus predominated vaginal microbiota constitute the first line of defense against infection.Protective mechanisms include lactic acid production by Lactobacillus spp., which acidifies the vaginal microenvironment and elicits anti-inflammatory effects [1][2][3][4].This environment wards off non-indigenous organisms, including causative agents of sexually transmitted infections (STIs) like HIV, and bacteria associated with bacterial vaginosis (BV) [5][6][7].However, vaginal Lactobacillus spp.are functionally diverse.For example, L. crispatus, and L. gasseri are capable of producing both the D-and L-isomers of lactic acid, L. jensenii produces only the D-isomer, and L. iners only the L-isomer [4,8].These key features have implications for susceptibilities to pathogens [9,10].
The vaginal microbiota has been previously shown to cluster into community state types (CSTs) that reflect differences in bacterial species composition and abundance [1,11].Lactobacillus spp.predominate four of the five CSTs (CST I: L. crispatus; CST II: L. gasseri; CST III: L. iners, CST V: L. jensenii).In contrast, CST IV communities are characterized by a paucity of lactobacilli and the presence of a diverse array of anaerobes such as G. vaginalis and "Ca.Lachnocurva vaginae".CST IV is found, albeit not exclusively, during episodes of BV, a condition associated with an increased risk of sexually transmitted infections, including HIV, as well as preterm birth and other gynecological and obstetric adverse outcomes [12][13][14][15][16][17][18][19][20].BV is clinically defined by observing 3 of 4 Amsel's criteria (Amsel-BV; vaginal pH > 4.5, abnormal discharge, and on wet mount, presence of clue cells and fishy odor with 10% KOH) [21].Patients presenting with symptoms and satisfying Amsel's criteria (symptomatic Amsel-BV) are treated with antibiotics, however, efficacy is poor, and recurrence is common [21][22][23][24].In research settings, BV is often defined by scoring Gram-stained vaginal smears (Nugent-BV) [25] or molecular typing of bacterial composition by sequencing marker genes (molecular-BV) [26].There is no definition of BV that relies on both the composition and functional potential of the microbiome.
Species-level composition of the vaginal microbiota may not suffice to accurately capture associations between the vaginal microbiome and outcomes of interest because functional differences exist between strains of the same species.For example, in the skin microbiome, strains of Staphylococcus aureus or Streptococcus pyogenes elicit different acute immune responses [27].Similarly, genomic and functional analyses of Lactobacillus rhamnosus strains demonstrate distinct adaptations to specific niches (for example, the gut versus the oral cavity) [28].While functional differences likely exist between strains of the same species in the vaginal microbiota, metagenomic studies show that combinations of multiple strains co-exist within a single vaginal microbiome [29,30].These strain assemblages have been defined as metagenomic subspecies or mgSs [29], and are important to consider as they potentially impact the functional diversity and resilience of a species in a microbiome.Determining the mechanistic consequences and health outcomes associated with metagenomic subspecies may improve the precision of risk estimates and interventions.
To integrate the taxonomic composition and functional potential of vaginal microbiomes, we developed metagenomic community state types (mgCSTs).MgCSTs are composed of unique combinations of mgSs.We developed and validated a two-step classifier that assigns metagenomic subspecies and mgCSTs and is designed to work in concert with the vaginal non-redundant gene database, VIRGO [29].This classifier may facilitate reproducibility and comparisons across studies.

Table 1 Characteristics of study participants
For each data category (age, race, etc.), the total number of women and samples are noted.Percentages represent the proportions of women or samples within each category.Some women contributed multiple samples Vaginal Non-redundant Gene Database (VIRGO, virgo.igs.umaryland.edu)[29], the University of Maryland Baltimore Human Microbiome Project (UMB-HMP, PRJNA208535, PRJNA575586, PRJNA797778), the National Institutes of Health Human Microbiome Project (NIH-HMP, phs000228), Li et al. [31]    Age was also reported for 1623 samples from 897 individuals and ranged from 15 to 45 years old.After adjusting for between-cohort heterogeneity, certain races and age categories were associated with mgCSTs (Fig. 2).The vaginal microbiomes of Black women were more likely to be classified as Gardnerella mgCST 22 (p = 0.0006) and least likely to be in L. crispatus mgCST 1 (p = 0.005) as compared with microbiomes for other races (Table S2).

Functional potential of mgCSTs and metagenomic subspecies L. crispatus mgCSTs differ by species diversity, stability, and the potential to produce D-lactic acid
L. crispatus is known to produce both L-and D-lactic acid, which acidifies the vaginal environment and confers protective properties [4,10,33,34].Metagenomic analyses revealed differences among L. crispatus mgSs.First, VIRGO identified two L-and two D-lactate dehydrogenase genes in L. crispatus.All genes were present in L. crispatus mgSs except for mgSs 2. MgSs 2 was missing a D-lactate dehydrogenase gene (V1806611) that has a 96.1% identity to a functionally validated ortholog, P30901.2 (Fig. 5a) [35].The other D-lactate dehydrogenase, V1891370, is found in all L. crispatus mgSs but only 82.4% identical to P30901.2 because it contains a 55 aa insertion after V101 (position in P30901.2) and a point mutation at position 218 (D218Y) located within a NAD binding site domain, the functional consequences of which are unknown.Because V1806611 is most similar to P30901.2, its absence may influence the production of D-lactic acid.Second, in mgSs 2, 4, and 6 60-70% of samples were significantly more likely to be from the high vaginal pH category compared to 41% of samples with any L. crispatus mgSs (Fig. 5b).Third, mgSs 2 had significantly fewer estimated numbers of L. crispatus strains compared to other L. crispatus mgSs (Fig. 5c).Fourth, mgSs 2 and 4 were on average more compositionally diverse than other mgSs (Fig. 5d).Lastly, mgCST 1 (dominated by mgSs 1) was significantly more longitudinally stable than mgCST 2 or 3 as defined using Yue-Clayton's θ (Fig. 5e).Overall, these differences reveal important genetic and functional differences among vaginal microbiomes primarily comprised of L. crispatus which could be important in understanding the role of L. crispatus in the vaginal microbiome.

Automated classification of mgCSTs using random forest models
Random forest models were built for each of the 135 mgSs identified and used to perform mgSs assignments (see "Methods" section).There was good concordance between mgSs assigned by hierarchical clustering of Jensen-Shannon distances and random forest-based assignments with κ > 0.8 for most species (Figure S3a).
Ten-fold cross-validation of the classifier revealed the misclassification error for mgSs assignment ranged from 0 to 30% (Figure S2c mgSs).The error estimates for most major vaginal taxa were near or less than 10%, with L. gasseri having the lowest (2.2%).L. iners consistently provided high misclassification error estimates (20%) regardless of attempts to fine-tune the model and was likely the result of high genetic homogeneity between and/or heterogeneity within L. iners mgSs.Following the assignment of mgSs, mgCSTs were assigned using the nearest centroid classification method, as previously used for vaginal taxonomy-based community state-type assignments [11].We observed good concordance between Fig. 7 The proportions of Gardnerella genomospecies in a sample differ by Gardnerella mgSs (a).For samples with Amsel-BV evaluations, Amsel-BV status is indicated below each sample (column).b Gene clusters of Gardnerella contain genes attributed to a variety of Gardnerella genomospecies (as indicated by colored bars, black bars indicate unknown genomospecies).c The total number of clinical evaluations per Gardnerella gene cluster is indicated.The dark gray bars indicate the proportion of negative Amsel-BV diagnoses, and the colored bars indicate the positive Amsel-BV diagnoses.Within-gene group proportions were statistically compared to the study-wide proportions of Amsel-BV for any samples containing a Gardnerella gene cluster.(*p < 0.05, **p < 0.01, ***p < 0.001) mgCSTs assigned by hierarchical clustering of Jensen-Shannon distances and nearest centroid-based assignments (κ = 0.78, Figure S3b).Ten-fold cross validation of centroid classification revealed the mean classification error was 9.6%, with some mgCSTs classified more accurately than others (Figure S4).
Three external, publicly available metagenomic datasets illustrate the generalizability of the mgCSTs assignment (Fig. 8).Most samples produced similarity scores > 0.5 indicating high similarity to the reference centroid of the assigned mgCST (Fig. 8a).Samples in each dataset were distributed across mgCSTs (Fig. 8b, primary y-axis).In all datasets, the lowest similarity scores were observed in mgCST 27 (Fig. 8b, secondary y-axis).
The source code for the mgCST classifier is an R script and is available at https:// github.com/ ravel-lab/ mgCSTclass ifier and uses direct outputs from VIRGO.

Discussion
Recent findings that motivated the development of mgCST classification are that multiple strains of the same species are commonly observed in the vaginal microbiome [29], and that samples can be clustered into metagenomic subspecies defined by unique strain combinations represented by species-specific gene sets, and thus unique sets of functions.These critical observations led us to conceptualize a vaginal microbiome classification based on their mgSs compositions and abundance and thus defined by both species' composition and functions, i.e., metagenomic community state types.MgCSTs describe vaginal microbiomes through a new lens, one that includes both compositional and functional dimensions.
L. iners-predominated vaginal microbiota have been associated with an increased risk of experiencing bacterial vaginosis (BV) [41,42].Longitudinal observational prospective studies support this conclusion and present several critical findings: (1) L. iners is often detected at low to medium abundances during episodes of BV [43], (2) L. iners commonly dominate the vaginal microbiota after metronidazole treatment for BV [44], and, (3) L. iners predominated vaginal microbiota are more prevalent prior to incidence of BV [41,43].However, L. iners is not always associated with an increased risk of negative outcomes and its role in the vaginal microbiome has been widely debated [36,[45][46][47][48]. MgCST classification provides insight into the contradiction to prevailing dogma regarding L. iners and increased risk of BV.L. iners mgSs 4 was associated with Amsel-BV, while L. iners mgSs 3 (predominates mgCST 12) was significantly associated with negative BV diagnoses.This is the first evidence of genetically distinct combinations of L. iners strains (mgSs) in healthy versus BV-like states.This critical finding points to the possibility of beneficial L. inersdominated microbiomes that had not been evidenced previously.A L. iners dominated microbiota is frequently observed among Black and Hispanic women (30 and 35%, respectively) [1,11] and, among these groups, the Fig. 8 Three external metagenomic datasets were processed with VIRGO [29] and assigned mgCSTs using the mgCST classifier.a Most samples were statistically similar to the reference centroid of the assigned mgCST.b Samples in each dataset were distributed across mgCSTs.In all cases, the lowest similarity scores were observed in mgCST 27 incidence of BV is higher than White or Asian women [49,50] and L. iners is associated with the prevalence and incidence of BV [41,51].In the vaginal microbiomes of Asian women, L. iners predominated communities are also frequently detected with multiple studies citing a prevalence near 40% (35%, n = 22 in this study, ca.40%, n = 38 in France et al. [11], 42.7%, n = 24 in Ravel et al. [1] and 45%, n = 30 of women in Xu et al. [52]), and there is a lower prevalence and incidence of BV but a positive association between L. iners and BV has yet to be identified [49,50,52,53].MgSs and mgCST classification may provide insight that would need to be further demonstrated into the contradictory associations of L. iners and increased risk of BV in different racial groups.For example, of all L. iners mgCSTs (10-14), Asian and White women in this study were most likely to be in mgCST 12 (predominated by L. iners mgSs 3), while those from Black women were distributed similarly across L. iners mgCSTs even among samples with low Nugent scores.We hypothesize that L. iners mgCST 12 might represent a more stable ecological network of strains of L. iners.Along the same lines, multiple studies among Asian individuals do not find "Ca.Lachnocurva vaginae" as a prevalent member of the vaginal microbiota during BV [52,53].Together, these observations may have implications for the disproportionate burden of BV among racial groups.For example, selective pressures by the host environment may result in niche specialization by vaginal bacteria.Sources of selective pressure could relate to host-provided nutrient availability (e.g., mucus glycan composition), the host's innate and adaptive immune system, the circulation of other species' mgSs in a population, or any such combination.However, because race is a social construct influenced by many other factors including stress, socioeconomic status, and education level [54,55], it is difficult to interpret "racial" differences in the vaginal microbiome and specific hypotheses should be tested to better understand the findings reported here.
Our analyses also identified a specific set of L. iners genes associated with positive Amsel-BV diagnoses.Macklaim et al. 2018 reported marked differences in L. iners gene expression between two control patients versus two diagnosed with BV, including increased CRISPRassociated proteins gene expression in BV samples [56].However, our mgSs analysis of L. iners indicates that it is not simply alterations in gene expression of a common gene pool that differentiates BV from non-BV microbiomes, but L. iners mgSs that also differ.Microbiomes from women with Amsel-BV diagnoses were enriched for host immune response evasion and hostcolonization functions by L. iners.For example, serine/ threonine-protein kinases (STPKs) contribute to resistance from phagocytosis by macrophage, invasion of host cells including epithelia and keratinocytes, antibiotic resistance, disruption of the NF-κB signaling pathway, and mucin binding [57].Bacteria attached to host cells (clue cells) is a hallmark of high Nugent scores (a bacterial morphology-based definition of bacterial vaginosis) and a criterion in Amsel-BV diagnoses [21,25].L. iners can appear as Gram-variable cocci (like G. vaginalis) or rods [46,58], and our data suggest that certain strains of L. iners (specifically those containing gene cluster 6) may adhere to epithelial cells, contributing to the appearance of clue cells.In addition, epithelial cell adherence could make certain L. iners strains more difficult to displace in the vaginal environment and contribute to the common observation of L. iners following antibiotic treatment [59].
Several distinct mgCSTs are associated strongly with Amsel-BV.Critically, these data support the need for an improved definition of BV and the importance of a personalized approach to treatment."Ca.Lachnocurva vaginae" predominated mgCSTs were strongly associated with asymptomatic Amsel-BV and contained more high Nugent scores than other mgCSTs.Conversely, intermediate Nugent scores were most prevalent in Gardnerellapredominated mgCSTs, and only three of these six mgCSTs were associated with Amsel-BV, which suggests that not all Gardnerella-dominated microbiomes are related to Amsel-BV.Gardnerella contains vast genomic diversity, supporting a split into different genomospecies [40,60,61].Because different genomospecies can co-exist our data show that Gardnerella predominated mgSs represent unique combinations of genomospecies and strains of these genomospecies.MgCSTs 20-22 contain high Gardnerella genomospecies diversity and were associated with positive Amsel-BV diagnoses in studies using qPCR or transcriptomic data to define Gardnerella species [31,32,60].Our data corroborate these reports.We hypothesize that in mgCSTs with higher numbers of Gardnerella genomospecies that there are more gene variants coding for virulence factors like cholesteroldependent pore-forming cytotoxin vaginolysin and neuraminidase sialidase present, thus expanding functional redundancy of these enzymes and potentially contributing to the association with positive Amsel-BV diagnoses [62][63][64].However, mgCST 24 which is comprised of Gardnerella mgSs 4 (and specifically genomospecies G. vaginalis and G. swidsinkii), has relatively lower Gardnerella genomospecies diversity and was also highly associated with Amsel-BV and symptomatic Amsel-BV.Together these data suggest that enumeration and classification of Gardnerella genomospecies may prove to be an important diagnostic of different "types" of Amsel-BV which could inform treatment options.For example, it is possible that harboring more Gardnerella genomospecies may predict BV recurrence following metronidazole treatment, suggesting the need for a different approach to treatment.Alternatively, some Gardnerella genomospecies may be important and novel targets of therapy.
Antibiotic treatment is indicated for BV diagnosis by Amsel clinical criteria when 3 of 4 criteria are observed, which is estimated to occur in fewer than half of women with BV [24,65,66].In research settings, both symptomatic and asymptomatic Amsel-BV can be evaluated.Indeed, in the observational research studies included in this analysis where Amsel criteria were evaluated along with whether participants reported symptoms or not, symptomatic Amsel-BV accounted for only 12% of Amsel-BV cases and 30% of these were in mgCST 24 (dominated primarily by G. swidsinkii and G. vaginalis).We hypothesize that the high prevalence of BV recurrence post-treatment may be due to the heterogeneity in the genetic makeup of the microbiota associated with BV as revealed by mgCSTs.MgCSTs reduce this heterogeneity resulting in more precise estimates of risk.Furthermore, these findings highlight the potential importance of developing specialized treatments that target "types" of BV.
The mgCST framework can also be used to identify vaginal microbiomes that are associated with positive health outcomes.For example, mgCSTs are predominated by different L. crispatus mgSs varied in their association with low Nugent scores, the number of L. crispatus strains present, and the longitudinal stability of communities.The vaginal microbiome can be dynamic [67][68][69].Shifts from Lactobacillus to non-Lactobacillus predominated microbiota can increase the risk of infection following exposure to a pathogen.Our study identified L. crispatus mgCSTs with variable stability, suggesting that not all L. crispatus predominated microbiomes are functionally similar and may be differently permissive to infection.Those found to be associated with higher stability may reduce the window of opportunity for pathogens to invade.Microbiome stability may be related to both the diversity of other non-Lactobacillus members of the microbiome and/or the number of L. crispatus strains present.In any case, our study shows that there is a range of protective abilities even among L. crispatus predominated communities.This information could be critical in selecting and assembling strains of L. crispatus to design novel live biotherapeutics products aimed to restore an optimal vaginal microenvironment.
Data are still emerging about what factors contribute to vaginal strain assemblages and what rules define their biology and ecology.However, such assemblages can now be detected and further characterized using the concepts of mgSs and mgCSTs presented here.The use of metagenomic sequencing and mgSs and mgCSTs will contribute to a much-needed functional understanding of the role of the vaginal microbiome in reproductive health outcomes.Our findings support the hypothesis that genetic and functional differences between vaginal microbiomes, including those that may look compositionally similar, are critical considerations in vaginal health [7].To aid in further exploration, we also provide a validated classifier for both mgSs and mgCSTs at https:// github.com/ ravellab/ mgCST-class ifier/ blob/ main/ README.md.The classifiers are dependent upon VIRGO and therefore rely on VIRGO-based taxonomic and gene function annotations.Additionally, as noted in the validation of external studies, some mgCSTs may be cohort-specific and thus may not be generalizable for all studies.

Conclusion
MgCSTs reveal differences between vaginal microbiomes both compositionally and functionally, and thus more finely describe the vaginal microbiome.Associations between mgCSTs and bacterial vaginosis highlight the multi-faceted aspects of the condition and call for new and expanded definitions.Furthermore, mgCSTs have the potential to predict clinical outcomes such as recurrent BV and develop specialized treatments that target different types of BV.Further, we provide tools for the classification of mgSs and mgCST that have the potential for use and harmonization of analytical strategies in future studies.

Study cohorts
Raw metagenomic data from 1890 vaginal samples were used in this study (Supplementary file S6).This included publicly available metagenomes including those used in the construction of the vaginal non-redundant gene database, VIRGO (virgo.igs.umaryland.edu,n = 342) [29], the University of Maryland Baltimore Human Microbiome Project (UMB-HMP, n = 677, PRJNA208535, PRJNA575586, PRJNA797778), the National Institutes of Health Human Microbiome Project (NIH-HMP, n = 174, phs000228 [70]), metagenomes from Li et al. [71] (n = 44, PRJEB24147), the Longitudinal Study of Vaginal Flora and Incident STI (LSVF, n = 653, dbGaP project phs002367).All samples in LSVF (n = 653) and some in UMB-HMP (n = 20) had clinical diagnosis information about Amsel-BV.Amsel-BV was diagnosed based on the presence of 3 out of 4 Amsel's criteria [21] and in this study, Amsel-BV was classified as symptomatic Amsel-BV when a participant self-reported complaints of vaginal discharge, irritation, itching, burning, foul odor, and other [43,55].At the time of these studies, gender identity information was not collected.We know all women responded to recruiting materials which included "women" or "woman".In addition, individuals are referred to as women in previous publications, thus we refer here to individuals as "woman" or "women" to maintain consistency.Cohorts used in this study are described in Supplementary file S9.

Sequence processing and bioinformatics
Host reads were removed from all metagenomic sequencing data using BMTagger and the GRCh38 reference genome, and reads were quality filtered using trimmomatic (v0.38, sliding window size 4 bp, Q15, minimum read length:75 bp) [72].Metagenomic sequence reads were mapped to VIRGO using bowtie (v1; parameters: -p 16 -l 25 --fullref --chunkmbs 512 --best --strata -m 20 --suppress 2,4,5,6,7,8), producing a taxonomic and gene annotation for each read.Samples with fewer than 100,000 mapped reads were removed from the analysis (n = 59).The number of reads mapped to a gene was multiplied by the read length (150 bp) and divided by the gene length to produce a coverage value for each gene.Conserved domain and motif searches were performed with CD-SEARCH and the Conserved Domain Database (CDD), using an e-value threshold of 10 −4 .The taxonomic composition table generated using VIRGO was run through the vaginal CST classifier VALENCIA [11].

Metagenomic subspecies
For each species, a presence/absence matrix was constructed from a metagenome which included all genes with at least 0.5× average coverage after normalizing for gene length.Metagenomic subspecies were generated for species present (> 75% estimated median number of genes encoded in reference genomes from the Genome Taxonomy Database [73], see Table S3) in >20 samples using binary gene counts and hierarchical clustering with Ward linkage of sample Jaccard distances calculated using the vegdist function from the vegan package (v2.5-5)[74] in R (v. 3.5.2).MgSs were defined using the dynamic hybrid tree cut method (v.1.62-1)and minClusterSize = 2 [75].Heatmaps of gene presence/ absence were constructed for each species using the gplots package heatmap.2function [76] (Supplementary file S5).MgSs were tested for associations with low species coverage using logistic regression in which the mgSs was the binary outcome and the log 10 -transformed coverage of the species was the predictor.Tests were done at the participant level; if a participant had more than one sample and both samples were the same mgSs, only one sample was used, but if the mgSs differed, the samples were included in each.P-values were adjusted for multiple comparisons using Bonferroni correction.Significant dependence was observed in multiple mgSs of Atopobium vaginae, Gardnerella, and Lactobacillus iners (Table S4).For these species, the classifiers were built using samples with ≥ 5.5e5 reads.The cluster stability of each mgSs was evaluated using the clusterboot function of the R package fpc (v 2.2-10) [77,78] and 100 bootstraps.

Metagenomic CSTs
Using gene abundance information (normalized by gene length and sequencing depth), we estimated the proportion of vaginal species in each sample.For species that were sub-divided into mgSs, the mgSs proportion in a sample was equal to the proportion of the species in that sample.When a species was present in a sample but with too few genes present to constitute a mgSs (< 75% estimated median number of genes encoded in reference genomes), it was labeled as "mgSs 0".Samples in the resulting compositional table were hierarchically clustered using Jensen-Shannon distances.Clusters were defined using the dynamic hybrid tree cut method (v.1.62-1)[75].A heatmap for metagenomic CSTs was produced using the gplots package heatmap.2function (Fig. 1) [76].For each mgCST, the mgSs most frequently observed (prevalence) and the mgSs with the greatest mean abundance were noted.Cluster stability was evaluated with the clusterboot function from the fpc package (v.2.2-10) using ward linkage of Jensen-Shannon distances and 100 bootstraps [77].Cluster stability ≥ 0.75 is considered high stability.

Statistical analysis of the association between mgCST and age, race, Nugent score, vaginal pH, and BV
For those samples with race, age category (15-20, 21-25, 56-30, 31-35, 36-40, 41-45 years old), Nugent score category (0-3, 4-6, 7-10), vaginal pH category (pH < 4.5 or pH ≥ 4.5), or Amsel-BV diagnoses information (Table 1), the Cochran-Mantel-Haenszel Chi-Squared Test (CMH test, "mantelhaen.test"from the samplesizeCMH R package, v 0.0.0,github.com/pegeler/samplesizeCMH)was used to determine associations with mgCSTs while accounting for source study (the confounding variable).The CMH test evaluates associations between two binary variables (i.e., "mgCST X or not" and "high Nugent score or not").Tests were done at the participant level; if a participant had more than one sample and both samples were the same mgCST, only one sample was used, but if the mgCSTs differed, the samples were included in each.

Statistical analysis of the association between mgSs, gene clusters, and BV
For both L. iners and Gardnerella, associations between mgSs and Amsel-BV were evaluated at the participant level using chi-square analyses which compared the proportion of Amsel-BV positive to negative participants within mgSs to those in all participants containing any L. iners.For both L. iners and Gardnerella, associations between gene clusters and Amsel-BV were evaluated at the participant level using chi-square analyses which compared the proportion of Amsel-BV positive to negative participants within a gene cluster to those in all participants containing any L. iners or Gardnerella gene cluster, respectively.Gene cluster presence in a sample was defined as the presence of ≥ 30% of genes in a gene cluster.

Longitudinal stability and Shannon diversity of L. crispatus mgSs
For participants in the HMP cohort that contributed multiple samples with at least one sample assigned to an L. crispatus mgSs, the Yue-Clayton θ was measured to define microbiota stability for each participant [79].Here, 16S rRNA gene amplicon sequencing-based CSTs from all samples from a participant [68] were used to produce a reference centroid, and then each sample was compared to that reference (Yue-Clayton's θ).The mean θ for each participant represented the overall microbiota compositional stability.Values closer to 1 indicate high compositional stability.The number of strains in each sample was compared between mgSs using the Wilcoxon signed rank test.Shannon's H diversity index was calculated for each sample using the vegan package diversity function.Shannon Diversity was compared between mgSs using the Wilcoxon signed rank test.

Estimating the number of L. crispatus strains
The number of L. crispatus strains in a mgSs was estimated using a pangenome accumulation curve which was generated by mapping the gene contents of publicly available isolate genome sequences (Table S3) to VIRGO (blastn, threshold: 90% identity, 70% coverage).Bootstrap (n = 100) combinations of N (N = 1 to 61) isolates were selected and the number of unique L. crispatus Vaginal Orthologous Groups (VOGs; provided in the VIRGO output [29]) encoded in their genomes was determined.An exponential curve relating the number of isolates to the number of VOGs detected was then fit to the resulting data and produced the equation: Y = 2057N 0.14 where Y is the number of L. crispatus VOGs detected, and N is the estimated number of strains.This equation was then used to estimate the number of L. crispatus strain's detected in each metagenome based on the observed number of L. crispatus VOGs in each metagenome.The number of strains in each sample was compared between mgSs using the Wilcoxon signed rank test.

Construction of the random forests for mgSs classification
We constructed random forests for the classification of mgSs using the R package randomForestSRC v2.12.1R [80].For mgSs, a random forest was built for each species (n = 28) where the training data contained presence/ absence values of genes.Gene presence was defined as above for mgSs.We implemented random forest classification analysis with all predictors included in a single model.For each mgSs random forest, predictors were all genes in a species.Ten-fold cross-validation (90% of data as training, 10% as testing) was performed wherein each training set was used to build and tune a random forest model using tune "tune.rfsrc".A random forest model using optimal parameters was then used to predict mgSs classifications for the test set and out-of-bag error estimates (misclassification error) were reported.The overall misclassification error is the average misclassification error from each fold and the "correct" assignment is based on the original hierarchical clustering assignment.The final models included all data and the optimal tuning parameters determined for that species.For mgSs assignment, the mgSs which provide the highest probability (based on the proportion of votes in the tree) is used for assignment.The user is provided with both the assignment, as well as the probability of that assignment as a measure of confidence.

Construction of the nearest centroid classifier for mgCSTs
Using mgCSTs as defined above, reference centroids were produced using the mean relative abundances of each mgSs in a mgCST.For classification, the similarity of a sample to the reference centroids is determined using Yue-Clayton's θ [79].Compared to Jensen-Shannon, the Yue-Clayton θ measure depends more on the high relative abundance metagenomic subspecies than those at lower relative abundances.Samples are assigned to the mgCST to which they bear the highest similarity and the degree of similarity to that mgCST can be taken as a measure of confidence in the assignment.Ten-fold crossvalidation was applied wherein each training set was used to build "reference" centroids and each test set was used for assignment.The misclassification error was determined by subtracting the number of correct assignments (based on the original hierarchical clustering assignment) divided by the total number of assignments from 1.The overall misclassification error is the average of misclassification error from each fold.

Running the mgCST classifier
The required inputs are direct outputs from VIRGO [29] and include the taxonomic abundance table ("summary.Abundance.txt")and gene abundance table ("summary.NR.abundance.txt").It is imperative that taxonomic and gene column headings match those output by VIRGO.The expected output is a count table with samples as rows, taxa as columns, and counts normalized by gene length as values.Additional columns indicate the sample mgCST classification and the Yue-Clayton similarity score for all 27 mgCSTs.A heatmap is also produced showing taxon relative abundances in samples, where samples are labeled with assigned mgCSTs substantial differences may indicate either an incongruence in taxonomic or gene names or the need for an additional mgCST.The classifier is contained in an R script, which is available at https:// github.com/ ravel-lab/ mgCST-class ifier.MgSs and mgCST classifications were robust at sampling depths greater than 100,000 reads per sample (Supplementary file S10).

Validation of the mgCST classifier in external datasets
Three external, publicly available vaginal metagenome datasets were used to validate the generalizability of mgCST assignments beyond the training dataset.ENA PRJEB34536 [81], NIH PRJNA576566 [82], and NIH PRJNA779415 [83].Briefly, host reads were removed using BMTagger [84] and the GRch38_p12 human reference genome.Ribosomal RNA reads were removed using sortmerna [85] and reads were quality filtered using fastp [86] with a minimum length of 50 bp, and a mean quality of 20 in a sliding window of 4 bp.The remaining reads were processed through VIRGO using default settings (https:// virgo.igs.umary land.edu) [29] and summary tables were used to assign mgSs and mgCSTs with the mgCST Classifier.The generalizability of the mgSs and mgCST reference datasets is illustrated using the probability of assignment by the mgSs random forest classifier and the Yue-Clayton similarity scores of assigned mgCSTs.
All bioinformatic and statistical analyses are available in R Markdown notebooks (Supplementary file S7 and file S8).Research reported in this publication was supported in part by the National Institute for Allergy and Infectious Diseases of the National Institutes of Health under award numbers F32-AI136400 (JH), K01-AI163413 (JH), U19AI084044 (JR), UH2AI083264 (JR), R01-AI116799 (RB), and the National Institute for Nursing Research of the National Institutes of Health under award number R01NR015495 (JR).The funders had no role in study design, data collection, and interpretation, or the decision to submit the work for publication.

Fig. 4
Fig. 4 Clinically diagnosed Amsel bacterial vaginosis (a) and symptomatic Amsel bacterial vaginosis (b) associated with mgCSTs.a Within each mgCST, the total number of clinical evaluations per mgCST is indicated.The black bars indicate the proportion of negative Amsel-BV diagnoses, and the colored bars indicate the positive Amsel-BV diagnoses.Within-mgCST proportions were statistically compared to the study-wide proportions of Amsel-BV diagnoses.b Of positive Amsel-BV diagnoses in a, the proportions of asymptomatic (light gray) and symptomatic (dark gray) Amsel-BV diagnoses are shown within each mgCST.Within-mgCST proportions were statistically compared to the study-wide proportions of asymptomatic to symptomatic Amsel-BV diagnoses.(*p < 0.05, **p < 0.01, ***p < 0.001)

Fig. 5 a
Fig. 5 a D-lactate dehydrogenase orthologs in VIRGO compared to functionally validated reference, P30901.2.The presence of D-lactate dehydrogenase orthologs differs by L. crispatus mgSs with mgSs 2 missing V1806611 which is 96% identical to the functionally validated ortholog P30901.2.b The proportion of samples with low vs. high vaginal pH differed by the L. crispatus mgSs present (reference was based on the proportions among samples containing any L. crispatus mgSs: 40.7% low and 59.4% high).c The estimated number of L. crispatus strains differed by the L. crispatus mgSs present.d Shannon diversity of the vaginal microbiome differed by the L. crispatus mgSs present.e Microbiome stability differed by L. crispatus mgCST

Fig. 6 a
Fig. 6 a The total number of clinical evaluations per L. iners mgSs are indicated.The black bars indicate the proportion of negative Amsel-BV diagnoses, and the colored bars indicate the positive Amsel-BV diagnoses.b Gene presence map represents gene content of L. iners mgSs (columns) and L. iners gene clusters (rows).c The total number of clinical evaluations per L. iners gene cluster is indicated.The black bars indicate the proportion of negative Amsel-BV diagnoses, and the colored bars indicate the positive Amsel-BV diagnoses.Within-gene group proportions were statistically compared to the study-wide proportions of Amsel-BV for any samples containing a L. iners gene cluster.(*p < 0.05, **p < 0.01, ***p < 0.001) (PRJEB24147), the Longitudinal Study of Vaginal Flora and Incident STI (LSVF, dbGaP project phs002367)