### The finite Dirichlet multinomial mixture model

First, we briefly review the finite DMM model that describes the heterogeneity of cross-sample variability among microbiome species [5]. This model allows a dataset to be generated by a mixture of K metacommunities instead of a single metacommunity. The key concepts behind the DMM model are as follows:

Given a microbiome dataset consisting of *N* community samples and *S* taxonomic units (or species), the observed count of the \(i^{th}\) community for \(j^{th}\) taxonomic unit is denoted as \(X_{ij}\) (*i* = 1, ..., *N*; *j* = 1, ..., *S*). The total number of counts (i.e., sequence reads) from the \(i^{th}\) community sample is \(J_{i} = \sum \nolimits _{j=1}^S X_{ij}\). The DMM model [5] considers a vector \(\overrightarrow{X_{i}} = (X_{i1}, \ldots , X_{iS})\), drawn from a multinomial distribution with community vector \(\overrightarrow{p_{i}} = (p_{i1}, \ldots , p_{iS})\) as follows:

$$\begin{aligned} p \left( \overrightarrow{X_{i}} | J_{i}, \overrightarrow{p_{i}} \right) \sim \mathrm {Multi} \left( J_{i}, \overrightarrow{p_{i}} \right) \end{aligned}$$

(1)

where \(p_{ij}\) is the probability that a single read in the \(i^{th}\) community belongs to the \(j^{th}\) taxonomic unit. The DMM model defines a mixture of K Dirichlets for the multinomial parameter probability vectors \(\overrightarrow{p_{i}}\). \(\overrightarrow{\alpha _{k}} = (\alpha _{k1}, \ldots , \alpha _{kS})\) are the parameters of the Dirichlet distribution representing the \(k^{th}\) metacommunity (or cluster), and \(\boldsymbol{\pi } = (\pi _{1}, \ldots , \pi _{K})\) represents the mixing coefficients with \(\sum \nolimits _{k=1}^{K} \pi _k = 1\), \(\pi _k \ge 0\) for \(k \in (1, \ldots , K)\). The finite DMM model examines a case where the number of metacommunities, K, is fixed. Each sample is assumed to be drawn from each unique community vector \(\overrightarrow{p_{i}}\), which is derived from one of the K metacommunities. The DMM model introduces the allocation variable \(\overrightarrow{Z_{i}} = (Z_{i1}, Z_{i2}, \ldots , Z_{ik})\), where \(Z_{ik} \in \{0,1\}\) and \(\sum \nolimits _{k=1}^K Z_{ik} = 1\). If \(\overrightarrow{X_{i}}\) belongs to the \(k^{th}\) metacommunity (i.e., the \(k^{th}\) cluster), then the value of \(Z_{ik}\) is one; otherwise, it is zero. The distribution of **Z** follows the categorical distribution \(p \left( \overrightarrow{Z_{i}} | \boldsymbol{\pi } \right) = \prod _{k=1}^K \pi _k^{Z_{ik}}\). Therefore, Eq. (1) can be rewritten by marginalizing the multinomial parameters as follows [5]:

$$\begin{aligned} p \left( \boldsymbol{X} | \boldsymbol{Z}, \boldsymbol{\alpha } \right) = \prod\limits_{i=1}^N \prod\limits_{k=1}^K \left( \frac{B \left( \overrightarrow{\alpha _k} + \overrightarrow{X_{i}} \right) }{B \left( \overrightarrow{\alpha _k} \right) } J_{i}! \prod\limits_{j=1}^S \frac{1}{X_{ij}!} \right) ^{Z_{ik}} \end{aligned}$$

(2)

where the function B is the multinomial beta function \(B \left( \overrightarrow{\alpha _k} \right) = \frac{\prod _{j=1}^S \Gamma (\alpha _{kj})}{\Gamma \left( \sum \nolimits _{j=1}^S \alpha _{kj} \right) }\) and \(B \left( \overrightarrow{\alpha _k} + \overrightarrow{X_{i}} \right) = \frac{\prod _{j=1}^S \Gamma (\alpha _{kj} + X_{ij})}{\Gamma \left( \sum \nolimits _{j=1}^S \left( \alpha _{kj} + X_{ij} \right) \right) }\)

### The infinite Dirichlet multinomial mixture model with variable selection

The goal is to consider the number of metacommunities (*K*) as a random variable. To achieve this, it is assumed that the prior distribution of the mixing coefficients \(\boldsymbol{\pi }\) follows a Dirichlet process prior [22]. The stick-breaking representation [23, 24], which is a straightforward constructive definition of the Dirichlet process, is adopted to construct the infinite DMM model proposed in this study. This is defined as follows:

$$\begin{aligned}&p \left( \gamma _{k} \right) \sim \mathrm {Beta} \left( 1, \nu \right) \\&\pi _{k} = \gamma _{k} \prod\limits_{k'=1}^{k-1}(1 - \gamma _{k'}) \end{aligned}$$

where \(\pi _{k}\) is the mixing proportion of an infinite number of successively broken sticks, and independent random variables \(\gamma _{k}\) with \(\left( k \in \left[ 1, \ldots , K \right] \right)\) represent proportions that are sequentially broken from the remaining length, \(\prod _{k'=1}^{k-1}(1 - \gamma _{k'})\), of a unit-length stick, and \(\nu\) represents the total mass parameter of the Dirichlet process. It is assumed that each community sample \(\overrightarrow{X_{i}}\) is generated from the DMM model with a countably infinite of number of clusters (or metacommunities). Therefore, the Eq. (2) can be rewritten as

$$\begin{aligned} p \left( \boldsymbol{X} | \boldsymbol{Z}, \boldsymbol{\alpha } \right) = \prod\limits_{i=1}^N \prod\limits_{k=1}^{\infty } \left( \frac{B \left( \overrightarrow{\alpha _k} + \overrightarrow{X_{i}} \right) }{B \left( \overrightarrow{\alpha _k} \right) } J_{i}! \prod\limits_{j=1}^S \frac{1}{X_{ij}!} \right) ^{Z_{ik}} \end{aligned}$$

(3)

All taxonomic units in the DMM model are assumed to be equally important for clustering microbial community data. However, this is not realistic in microbiome studies, because numerous microbiome species (which can be reflected in taxonomic units) and functions might be irrelevant and significantly influence the performance of clustering algorithms [30]. To overcome this problem, we propose that the count of a given taxonomic unit, \(X_{ij}\), be generated from a mixture of two Dirichlet-multinomial distributions; the first one is assumed to generate a core set of the most significant microbial taxonomic units and is different for each metacommunity (i.e., each cluster), and the second one is assumed to generate the unimportant taxonomic units and was common to all metacommunities (i.e., all clusters). Thus, we can write the likelihood of the observed microbiome dataset \(\boldsymbol{X}\) following the infinite DMM model with microbiome taxonomic unit selection as follows:

$$\begin{aligned} p \left( \boldsymbol{X} | \boldsymbol{Z}, \boldsymbol{\phi }, \boldsymbol{\alpha }, \boldsymbol{\beta } \right) = \prod\limits_{i=1}^N \prod\limits_{k=1}^{\infty } \left[ \begin{array}{c} \left( \frac{B \left( \overrightarrow{\alpha _k} + \overrightarrow{X_{i}} \right) }{B \left( \overrightarrow{\alpha _k} \right) } J_{i}! \prod\limits_{j=1}^S \frac{1}{X_{ij}!} \right) ^{\phi _{ij}}\\ \left( \frac{B \left( \boldsymbol{\beta } + \overrightarrow{X_{i}} \right) }{B \left( \boldsymbol{\beta } \right) } J_{i}! \prod\limits_{j=1}^S \frac{1}{X_{ij}!} \right) ^{1 - \phi _{ij}} \end{array}\right] ^{Z_{ik}} \end{aligned}$$

(4)

where \(\phi _{ij}\) is an indicator variable, such that \(\phi _{ij} = 1\) indicates that the \(j^{th}\) taxonomic unit of the \(i^{th}\) community is important for clustering and follows a Dirichlet multinomial distribution with \(\boldsymbol{\alpha }\), and \(\phi _{ij} = 0\) denotes that the \(j^{th}\) taxonomic unit of \(i^{th}\) the community is unimportant for clustering and follows a Dirichlet multinomial distribution with \(\boldsymbol{\beta }\). \(\phi _{ij}\) characterizes the importance of each taxonomic unit in a sample. Although some samples are assigned to a cluster, each sample has a different group of important taxonomic units that are selected in the clustering process. \(B \left( \overrightarrow{\alpha _k} \right)\) and \(B \left( \overrightarrow{\alpha _k} + \overrightarrow{X_{i}} \right)\) are the multinomial beta functions for a core set of taxonomic units that significantly represent the cluster. For unimportant species, the multinomial beta functions are \(B \left( \boldsymbol{\beta } \right) = \frac{\prod _{j=1}^S \Gamma (\beta _{j})}{\Gamma \left( \sum \nolimits _{j=1}^S \beta _{j} \right) }\) and \(B \left( \boldsymbol{\beta } + \overrightarrow{X_{i}} \right) = \frac{\prod _{j=1}^S \Gamma (\beta _{j} + X_{ij})}{\Gamma \left( \sum \nolimits _{j=1}^S \left( \beta _{j} + X_{ij} \right) \right) }\). The prior distribution of the indicator variable of microbiome selection \(\boldsymbol{\phi }\) is defined as follows:

$$\begin{aligned} p \left( \boldsymbol{\phi } |\boldsymbol{\epsilon } \right) = \prod\limits_{i=1}^N \prod\limits_{j=1}^S \epsilon _{j_{1}}^{\phi _{ij}} \epsilon _{j_{2}}^{1 - \phi _{ij}} \end{aligned}$$

where each \(\phi _{ij}\) follows a Bernoulli distribution such that \(p \left( \phi _{ij} = 1\right) = \epsilon _{j_{1}}\) and \(p \left( \phi _{ij} = 0\right) = \epsilon _{j_{2}}\) with \(\epsilon _{j_{1}} + \epsilon _{j_{2}} = 1\) [11]. Furthermore, we use the Beta distributions over \(\boldsymbol{\epsilon }\) [31].

$$\begin{aligned} p \left( \boldsymbol{\epsilon } |\boldsymbol{\xi } \right) = \prod\limits_{j=1}^S \frac{\Gamma (\xi _{1} + \xi _{2})}{\Gamma (\xi _{1}) \Gamma (\xi _{2})} \epsilon _{j_{1}}^{\xi _{1}-1} \epsilon _{j_{2}}^{\xi _{2}-1} \end{aligned}$$

where the hyperparameters \((\xi _{1}, \xi _{2})>0\) are subject to the constraint in order to ensure that the distribution can be normalized. The prior distributions of \(\boldsymbol{\alpha }\) and \(\boldsymbol{\beta }\) follow the Dirichlet distributions with hyperparameters \(\boldsymbol{\zeta }\) and \(\boldsymbol{\eta }\).

$$\begin{aligned} p \left( \boldsymbol{\alpha } |\boldsymbol{\zeta } \right) = \prod\limits_{k=1}^{\infty } \frac{\Gamma \left( \sum \nolimits _{j=1}^S \zeta _{kj} \right) }{ \prod _{j=1}^S \Gamma (\zeta _{kj})} \prod\limits_{j=1}^S \alpha _{kj}^{\zeta _{kj}-1} \end{aligned}$$

$$\begin{aligned} p \left( \boldsymbol{\beta } |\boldsymbol{\eta } \right) = \frac{\Gamma \left( \sum \nolimits _{j=1}^S \eta _{j} \right) }{ \prod _{j=1}^S \Gamma (\eta _{j})} \prod\limits_{j=1}^S \beta _{j}^{\eta _{j}-1} \end{aligned}$$

In our computational experiments, we attempted to use both Gamma distribution and Dirichlet distribution for the prior distributions for \(\boldsymbol{\alpha }\) and \(\boldsymbol{\beta }\). However, scale parameter of Gamma distribution was not able to obtain good updated values. Parameters of Dirichlet distributions obtained the better updated values for each iteration; therefore, we opted to choose Dirichlet distributions.

### Stochastic variational variable selection approach

In this section, we propose an SVI method [12,13,14] for performing the infinite DMM model with feature selection. The basic idea of variational learning in the Bayesian approach is to approximate the posterior distribution using a computationally tractable function called the variational distribution. The variational parameter, which specifies the variational distribution, is estimated by minimizing the Kullback-Leibler (KL) divergence of the posterior distribution to the variational distribution. As a result, the posterior distribution is estimated by numerical optimization without invoking the simulation approaches, such as MCMC algorithms.

Given the observed count dataset \(\boldsymbol{X}\), the infinite DMM model has a set of parameters \((\Xi )\), which consists of the stick-breaking proportions \((\boldsymbol{\gamma })\), the allocation variable \((\boldsymbol{Z})\) of the prior Dirichlet, the indicator variable of the taxonomic unit selection \((\boldsymbol{\phi })\), and the Dirichlet parameters \((\boldsymbol{\alpha , \beta })\). At the initial step of the variational approach, we propose an element of a tractable family of probability distributions \(q \left( \Xi | \Theta \right)\) called the variational distribution, which approximates the true intractable posterior distribution \(p \left( \Xi | \boldsymbol{X} \right)\). This distribution is parameterized by free parameters, called variational parameters \(\Theta\).

Subsequently, variational inference estimates these parameters to find a distribution close to the true intractable posterior distribution of interest. The distance between the distributions \(p \left( \Xi | \boldsymbol{X} \right)\) and \(q \left( \Xi | \Theta \right)\) is evaluated using KL divergence, defined as follows:

$$\begin{aligned}&\mathrm {KL} \left[ q \left( \Xi | \Theta \right) | p \left( \Xi | \boldsymbol{X} \right) \right] \nonumber \\&\quad = \mathrm {E_{q}} \left[ \mathrm {log} \left( q \left( \Xi | \Theta \right) \right) \right] - \mathrm {E_{q}} \left[ \mathrm {log} \left( p \left( \Xi | \boldsymbol{X} \right) \right) \right] \nonumber \\&\quad = \mathrm {E_{q}} \left[ \mathrm {log} \left( q \left( \Xi | \Theta \right) \right) \right] - \mathrm {E_{q}} \left[ \mathrm {log} \left( p \left( \Xi , \boldsymbol{X} \right) \right) \right] + \mathrm {log} \left( p \left( \boldsymbol{X} \right) \right) \end{aligned}$$

(5)

The log marginal probability \(\mathrm {log} \left( p \left( \boldsymbol{X} \right) \right)\) in Eq. (5), which causes computational difficulty in the Bayesian approach, can be treated as a constant term in the numerical optimization for estimating the variational parameters as follows:

$$\begin{aligned} \Theta ^* = \mathrm {argmin KL} \left[ q \left( \Xi | \Theta \right) | p \left( \Xi | \boldsymbol{X} \right) \right] \end{aligned}$$

In addition, the term \(\mathrm {log} \left( p \left( \boldsymbol{X} \right) \right)\), which is known as the evidence of \(\boldsymbol{X}\), can be decomposed as \(\mathrm {log} \left( p \left( \boldsymbol{X} \right) \right) = \mathcal {L} \left[ q \left( \Xi | \Theta \right) \right] + \mathrm {KL} \left[ q \left( \Xi | \Theta \right) | p \left( \Xi | \boldsymbol{X} \right) \right]\). The variational inference maximizes the computationally feasible target function defined as:

$$\begin{aligned} \mathcal {L} \left[ q \left( \Xi | \Theta \right) \right] = \mathrm {E_{q}} \left[ \mathrm {log} \left( p \left( \Xi , \boldsymbol{X} \right) \right) \right] - \mathrm {E_{q}} \left[ \mathrm {log} \left( q \left( \Xi | \Theta \right) \right) \right] \end{aligned}$$

(6)

where Eq. (6) is the Evidence Lower Bound (ELBO) [12]. \(\mathcal {L} \left[ q \left( \Xi | \Theta \right) \right]\) can be considered a lower bound for \(\mathrm {log} \left( p \left( \boldsymbol{X} \right) \right)\). The maximization of ELBO equals the minimization of KL divergence, that is, when the variational distribution \(q \left( \Xi | \Theta \right)\) approximates the true posterior distribution \(p \left( \Xi | \boldsymbol{X} \right)\). However, direct application of the variational approach is unfeasible. Therefore, a mean-field approach is adopted in order to factorize the posterior distribution into disjoint tractable distributions. According to the factorization assumption of mean-field variational approximations [13, 14], each variable in the variational distribution \(q \left( \Xi | \Theta \right)\) is independent. Furthermore, we use truncated stick-breaking representations to approximate the posterior Dirichlet process. The truncation level \(\mathrm {K}\) is not a part of the prior model specification. The variational approach can optimize the value of \(\mathrm {K}\) because it becomes a variational parameter [13, 32, 33]. The family of variational distributions in the infinite DMM model with the selection of representative taxonomic units can be expressed as follows:

$$\begin{aligned}&q \left( \boldsymbol{Z, \phi , \gamma , \epsilon , \alpha , \beta } | \Theta \right) \nonumber \\&=\prod\limits_{i=1}^{N} \prod\limits_{k=1}^{{K}} q \left( Z_{ik} \right) \times \prod\limits_{k=1}^{{K}} q \left( \gamma _{k} \right) \times \prod\limits_{i=1}^{N} \prod\limits_{j=1}^{S}q \left( \phi _{ij} \right) \times q \left( \boldsymbol{\epsilon } \right) \times q \left( \boldsymbol{\alpha } \right) \times q \left( \boldsymbol{\beta } \right) \end{aligned}$$

(7)

where

$$\begin{aligned} q \left( \boldsymbol{Z} \right)= & {} \prod\limits_{i=1}^{N} \prod\limits_{k=1}^{{K}} r_{ik}^{Z_{ik}} \nonumber \\ q \left( \boldsymbol{\phi } \right)= & {} \prod\limits_{i=1}^{N} \prod\limits_{j=1}^{S} f_{ij}^{\phi _{ij}} \left( 1 - f_{ij} \right) ^{1-\phi _{ij}} \nonumber \\ q \left( \boldsymbol{\gamma } \right)\sim & {} \prod\limits_{k=1}^{{K}} \mathrm {Beta} \left( \gamma _{k} | \vartheta _{k}, \vartheta _{k}^{'} \right) \nonumber \\ q \left( \boldsymbol{\epsilon } \right)\sim & {} \mathrm {Dirichlet} \left( \boldsymbol{\epsilon } | \boldsymbol{\xi ^{*}} \right) \nonumber \\ q \left( \boldsymbol{\alpha } \right)\sim & {} \mathrm {Dirichlet} \left( \boldsymbol{\alpha } | \boldsymbol{\lambda ^{*}} \right) \nonumber \\ q \left( \boldsymbol{\beta } \right)\sim & {} \mathrm {Dirichlet} \left( \boldsymbol{\beta } | \boldsymbol{\iota ^{*}} \right) \end{aligned}$$

(8)

The set of free variational parameters \(\Theta\) includes \(\boldsymbol{r}, \boldsymbol{\vartheta }, \boldsymbol{\vartheta {'}}, \boldsymbol{f}, \boldsymbol{\xi ^{*}}, \boldsymbol{\lambda ^{*}}, \boldsymbol{\iota ^{*}}\). We use the variational distributions from exponential families to guarantee tractable computations of expectations.

The key idea of SVI inference is to divide the variational variables into two subgroups: the local variables \(\left[ \Xi _l \in \left( \boldsymbol{Z, \phi } \right) \right]\), which are per-datapoint latent variables, and the global variables \(\left[ \Xi _g \in \left( \boldsymbol{\gamma , \epsilon , \alpha , \beta } \right) \right]\), which potentially control all the data. The \(i^{th}\) local variable \(Z_{ik}\) of the mixture component, which represents the allocation of sample *i*, is governed by the local variational parameter \(r_{ik}\). In addition, the local variational parameter \(f_{ij}\) is proposed to capture the \(i^{th}\) local variable \(\phi _{ij}\), which represents the selection situation of the \(j^{th}\) taxonomic unit in the \(i^{th}\) community. The coordinate ascent algorithm is used to overcome the optimization problems of these variational variables [13, 14]. The main idea of this approach is to optimize each factor of the mean-field variational distribution while fixing the others. For example, we obtain the optimal solution of local variable \(Z_{ik}\) by applying variational distributions in Eqs. (7) and (8) to the ELBO in Eq. (6). We omit terms that do not depend on the variational parameter of \(Z_{ik}\). The logarithm of the optimal value of \(q \left( Z_{ik} \right)\) is proportional to the expected logarithm of the joint distribution as follows:

$$\begin{aligned}&\mathrm {log} q^{*} \left( Z_{ik} \right) \nonumber \\&\propto \sum\limits_{j=1}^S \mathrm {E}_{q} \left[ \phi _{ij} \right] \left( \mathrm {E}_{q} \left[ \mathrm {log} \left( \frac{ \Gamma \left( \sum_{j=1}^S \alpha _{kj} \right) }{\Gamma \left( \sum _{j=1}^S X_{ij} + \sum _{j=1}^S \alpha _{kj} \right) } \right) \right] \right) \nonumber \\&+ \sum\limits_{j=1}^S \mathrm {E}_{q} \left[ \phi _{ij} \right] \left( \mathrm {E}_{q} \left[ \mathrm {log} \left( \frac{\Gamma \left( X_{ij} + \alpha _{kj}\right) }{\Gamma (\alpha _{kj})} \right) \right] \right) \nonumber \\&+ \sum\limits_{j=1}^S \mathrm {E}_{q} \left[ \phi _{ij} \right] \left( \mathrm {log} (J_{i}!) + \mathrm {log} \left( \frac{1}{X_{ij}!} \right) \right) \nonumber \\&+ \mathrm {E}_{q} \left[ \mathrm {log} (\gamma _{k}) \right] + \sum\limits_{k'=1}^{k-1} \mathrm {E}_{q} \left[ \mathrm {log} (1-\gamma _{k'}) \right] \end{aligned}$$

(9)

As \(\gamma _k\) follows a beta distribution, we can obtain the analytically tractable solutions for \(\mathrm {E}_{q} \left[ \mathrm {log} (\gamma _{k}) \right]\) and \(\mathrm {E}_{q} \left[ \mathrm {log} (1-\gamma _{k'}) \right]\). However, the first and second terms of Eq. (9) do not have the same form as the logarithm of the Dirichlet prior distribution. Thus, analytically tractable solutions cannot be obtained directly. The intractable computation of expectations can be resolved using the Metropolis-Hastings algorithm and numerical integration. Nevertheless, the simulation approaches significantly increase the computational burden in the huge dimensionality of microbial metagenomics datasets [5]. Therefore, we adopt the Taylor expansion to obtain the nearly optimal analytically tractable solutions for the first and second terms of Eq. (9), such that the computational burdens are avoided [20, 21, 34]. A nearly optimal analytically tractable value of \(q \left( \phi _{ij} \right)\) can be obtained using the proposed approach. The mathematical details of the Taylor expansion and variational objective functions are provided in the Supplementary Material.

The global variational parameters \(\left[ \Theta _{g} \in \left( \vartheta _{k}, \vartheta _{k}^{'}, \xi ^{*}, \lambda _{kj}^{*}, \iota _{j}^{*} \right) \right]\) are proposed to govern the global variable \(\Xi _g\). The SVI approach uses the stochastic gradient ascent to estimate the global variational parameters [14]. This is mainly because as the sizes of microbiome datasets increase, each iteration of coordinate ascent algorithm becomes more computationally expensive. The computational structure of the algorithm therefore requires iterating over the entire dataset for each iteration. The SVI, however, is based on the stochastic approximation approach that iteratively generates subsampled datasets that are used to update the values of the local and global variational parameters. The main advantage of these computational strategies is that they ensure that algorithms will avoid shallow local optima for complex objective functions. Furthermore, the natural gradients are an important part of the SVI approach that increase the scale of variational inference and allow for the analysis of vast amounts of data [35,36,37]. Natural gradients adjust the direction of the conventional gradients to account for the geometric structure of probability parameters that use the Riemannian metric and the Fisher information matrix. Therefore, the natural gradients are not only cheaper computations but also have faster convergence than conventional gradients.

Principally, we seek to construct a noisy but unbiased and cheap-to-compute natural gradient to reach the optimum of the objective function of the infinite DMM model. First, we generate a uniform a dataset \(\left[ \overrightarrow{X_{n}}^{(N)}, \overrightarrow{Z_{n}}^{(N)}, \overrightarrow{\phi _{n}}^{(N)}\right]\) that is formed by *N* replicated from the microbiome community sample \(\overrightarrow{X_{i}}\), allocation variable \(\overrightarrow{Z_{i}}\), and indicator variable \(\overrightarrow{\phi _{i}}\) at each iteration. Next, noisy estimates of the natural gradient are computed with respect to each global variational parameter \(\Theta _g\) given N replicates of the sampled data point. Using these gradients, the values of \(\Theta _g\) are updated at iteration *m* given the local variational parameters \(\left[ \Theta _{l} \in \left( r_{ik}, f_{ij} \right) \right]\) as follows:

$$\begin{aligned} \widehat{\nabla _{\Theta _{g}}} \mathcal {L}= & {} \mathrm {prior} + N \left( \mathrm {E}_{ \Theta _{l}} \left[ t \left( \overrightarrow{X_{n}}, \overrightarrow{Z_{n}}, \overrightarrow{\phi _{n}} \right) , 1\right] \right) - \Theta _{g} \nonumber \\ \Theta _{g}^{(m)}= & {} \Theta _{g}^{(m-1)} + \rho _m \widehat{\nabla _{\Theta _{g}}} \mathcal {L} \end{aligned}$$

where t(.) denotes the sufficient statistics in the exponential family and \(\rho _m\) denotes the step size at iteration *m*. Owing to the subsampling strategies, the SVI significantly accelerates the computational processes by avoiding expensive sums in the ELBO when the dimensionality of the microbial metagenomics is large. The mathematical explanations of the SVI are described in the Supplementary Material.

### Criteria to evaluate the performance of the approaches

We use the Adjusted Rand Index (ARI) [38] in order to measure the similarity between the truth (or known) clusters and clusters inferred by various algorithms. Given a dataset of \(\boldsymbol{X}\) with n total samples, \(\boldsymbol{Z} = \left[ Z_1, \dots , Z_k \right]\) denotes the true cluster memberships of \(\boldsymbol{X}\) into k clusters, and \(\boldsymbol{Z'} = \left[ Z'_1, \dots , Z'_{k'} \right]\) denotes an inferred cluster membership of \(\boldsymbol{X}\) into k’ clusters. The Rand Index (RI) is calculated as follows:

$$\begin{aligned} R \left( \boldsymbol{Z, Z'} \right) = \frac{a+b}{n \left( n-1 \right) /2} \end{aligned}$$

where *a* denotes the number of times a pair of samples is assigned to the same cluster in \(\boldsymbol{Z}\) and \(\boldsymbol{Z'}\), and *b* denotes the number of times a pair of samples is assigned to different clusters in \(\boldsymbol{Z}\) and \(\boldsymbol{Z'}\). The RI values are in the range of [0,1], where 1 represents a perfect similarity between the truth and inferred clusters. The ARI is proposed to normalize the difference between the RI and its expected value as follows:

$$\begin{aligned} \mathrm {ARI} = \frac{RI - \mathrm {E} (RI)}{\max (RI) - \mathrm {E} (RI)} \end{aligned}$$

where \(\mathrm {E} (RI)\) is the expected value of the RI.

### Database description

#### Study inclusion and data acquisition

Dataset A represents the environmental microbiome data of our field experiments, which includes 196 drought irrigation samples, 197 control conditions samples and 888 microbiome species (or taxonomic units). The experimental explanations of dataset A are described in the Supplementary Material.

We also employ case-control 16S amplicon sequencing from three published human microbiome datasets spanning three different disease states: *Clostridium difficile* infection (CDI) [26], inflammatory bowel disease (IBD) [28], and obesity (OB) [27]. These datasets are available in the MicrobiomeHD database [39]. Dataset B represents the CDI dataset, which includes 183 diarrheal stool samples from the 94 individuals with CDI, 89 diarrheal control samples, 155 non-diarrheal control stool samples, and 3347 microbiome species (or taxonomic units). Dataset C represents the IBD dataset, which includes 146 IBD case samples, 16 non-IBD control samples and 10,119 microbiome species (or taxonomic units). Dataset D representes the OB dataset, which is the largest and most challenging. There are 1081 fecal samples from 977 individuals and 55,964 microbiome species (or taxonomic units).

Finally, we use the variable region 3-5 (V35) of the 16S rRNA gene sequence dataset from the *HMP16SData* package in R to study the considerable variation in the composition of the healthy human microbiome [29]. Dataset E represents the data of stool community types from the *HMP16SData* package, which includes 319 samples and 11,747 microbiome species (or taxonomic units). Moreover, we use other R packages to perform the graphical visualizations for the microbiome datasets, such as the unweighted UniFrac distance and non-metric multidimensional scaling (NMDS) functions in the *phyloseq* package [40].

### Open-source software

The software is implemented in Python and used standard libraries, such as NumPy and SciPy, for mathematical computations. The software inputs microbiome count data in a CSV file and outputs the inferred clusters and a core set of selected taxonomic units. The main options in the software tool are the maximum number of clusters, which pose limitations in estimating the number of clusters, and the number of taxonomic units that users want to select. SVVS uses the iterative optimization algorithms to estimate the parameters; thus, a convergence criterion is used to implement a stopping rule. The SVVS algorithm stops when the change in the ELBO computations is less than 1e−3 (Supplementary Material). We use the convergence criterion fixed across all datasets in this study. The number of iterations should be modified for datasets notably smaller or larger in scale than those considered in this study. This is a tunable option in the software. The software is available at https://github.com/tungtokyo1108/SVVS.

In all our experiments, we initialized the truncation levels of the number of clusters to 10. We set the initial values of hyperparameter \(\nu\) of the stick-breaking representation to 0.1, the initial values of hyperparameters \(\boldsymbol{\zeta }\) and \(\boldsymbol{\eta }\) of the Dirichlet priors to 1, and those of hyperparameters \((\xi _{1}, \xi _{2})\) to 0.1 [41].

To address the selection of species based on the model, we calculate average of \(\phi _{ij}\) over sample *i* after estimating the values of \(\phi _{ij}\) and ranked microbiome species from the highest to lowest values. Our package exports a table containing these ranked values, and a user can then select a core set of microbial species from the higher values in this table. For example, Tables S1 and S2 show the average values of \(\phi _{ij}\) over sample *i* that are arranged in descending order (largest first) in the dataset A and B.