 Methodology
 Open Access
 Published:
Constraining PERMANOVA and LDM to withinset comparisons by projection improves the efficiency of analyses of matched sets of microbiome data
Microbiome volume 9, Article number: 133 (2021)
Abstract
Background
Matchedset data arise frequently in microbiome studies. For example, we may collect pre and posttreatment samples from a set of individuals, or use important confounding variables to match data from case participants to one or more control participants. Thus, there is a need for statistical methods for data comprised of matched sets, to test hypotheses against traits of interest (e.g., clinical outcomes or environmental factors) at the community level and/or the operational taxonomic unit (OTU) level. Optimally, these methods should accommodate complex data such as those with unequal sample sizes across sets, confounders varying within sets, and continuous traits of interest.
Methods
PERMANOVA is a commonly used distancebased method for testing hypotheses at the community level. We have also developed the linear decomposition model (LDM) that unifies the communitylevel and OTUlevel tests into one framework. Here we present a new strategy that can be used with both PERMANOVA and the LDM for analyzing matchedset data. We propose to include an indicator variable for each set as covariates, so as to constrain comparisons between samples within a set, and also permute traits within each set, which can account for exchangeable sample correlations. The flexible nature of PERMANOVA and the LDM allows discrete or continuous traits or interactions to be tested, withinset confounders to be adjusted, and unbalanced data to be fully exploited.
Results
Our simulations indicate that our proposed strategy outperformed alternative strategies, including the commonly used one that utilizes restricted permutation only, in a wide range of scenarios. Using simulation, we also explored optimal designs for matchedset studies. The flexibility of PERMANOVA and the LDM for a variety of matchedset microbiome data is illustrated by the analysis of data from two real studies.
Conclusions
Including set indicator variables and permuting within sets when analyzing matchedset data with PERMANOVA or the LDM is a strategy that performs well and is capable of handling the complex data structures that frequently occur in microbiome studies.
Background
Many studies of the microbiome have matchedpair or matchedset designs, in which data naturally cluster into sets but the samples within each set vary in the traits of interest (e.g., clinical outcomes, environmental factors). Matching allows us to study the association between the microbiome and the traits of interest by comparing samples within sets, ignoring the variability in microbiomes between sets. For example, we may collect paired samples pre and posttreatment from a set of subjects to assess the treatment effects on the microbiome. We may also collect matched casecontrol subjects who were matched on important confounding factors to facilitate casecontrol comparison. Matching is advantageous when the signaltonoise ratio is larger within than between sets. In matched studies, complexities may occur when the data are unbalanced (e.g., having unequal ratio of casetocontrol samples per set), there exist additional confounders that vary within each set, or some traits of interest are continuous.
Only two methods have been developed specifically for analyzing matchedset microbiome data; both are limited to paired data without any withinpair covariates. Shi and Li [1] proposed a pairedmultinomial distribution that is only applicable when the sample size is larger than the number of taxa. Zhao et al. [2] developed a generalized paired Hotelling’s test that relaxed the restriction of Shi and Li’s method, but can only provide tests at the community level. Matchedset data may be also be considered as a special case of longitudinal data with an exchangeable correlation; as a result, some methods for analyzing longitudinal data can be used to analyze matchedset microbiome data. These methods are applied separately to each operational taxonomic unit (OTU; here we use “OTU” generically to refer to any feature such as amplicon sequence variants or taxonomic/functional grouping of microbial sequences). An appealing choice is the linear mixedeffects model (LMM), which has typically been applied to arcsinroottransformed relative abundance data to improve normality [3–5]. A zeroinflated beta regression model with random effects (ZIBR) has also been developed specifically for modeling (untransformed) relative abundance data [6]. Both methods are based on fully parametric models and so may not fit every OTU well. Further, some strategies have been proposed to extend existing tests of the microbiome to analyzing matchedset data. DESeq2 [7], originally a method for RNASeq data, has frequently been used for oneOTUatatime analyses of microbiome data. The manual for the DESeq2 software package recommends that indicators of set membership should be included as terms in the design formula, but DESeq2 does not account for withinset correlations. PERMANOVA [8] is a commonly used distancebased method for testing hypotheses at the community level. The documentation for the two implementations of PERMANOVA, adonis2 (R package vegan) and permanovaFL (R package ldm [9]) that differ in their permutation schemes, suggests that restricted permutation within each set should be performed when analyzing matchedset data. However, the performance of any of these strategies has not yet been evaluated, especially in studies with unbalanced data or withinset confounders.
We previously introduced the linear decomposition model (LDM) [9] primarily for analyzing independent data. The LDM provides tests at both the community level and the individual OTU level. These tests are conducted in a unified manner such that the findings of a communitylevel test can be resolved with the findings at the individual OTU level. Both PERMANOVA and the LDM are regression and permutationbased, making them readily extendable to analyzing matchedset data while accounting for the aforementioned data complexities. Although we considered withincluster permutation in the LDM paper [9], that was in a context in which variables of interest could be below the cluster level. We did not explicitly consider the matchedset data we describe here from either a theoretical or numerical point of view.
In this article, we develop a new strategy for using PERMANOVA and the LDM to analyze a wide range of matchedset microbiome data, for testing both communitylevel hypotheses and individual OTUs whenever applicable. In the “Methods” section, we describe our strategy and establish a connection with the existing strategy of restricted permutation. In the “Results” section, we present the simulation studies and the application to two real microbiome studies with matchedset designs. We conclude with a “Discussion” section.
Methods
We will refer to each observation as a “sample” and refer to the experimental unit that contributes one or more observations as a “set.” We allow each set to be comprised of an arbitrary number of samples. We also allow multiple discrete and/or continuous traits to be tested and additional samplelevel (i.e., withinset) confounding covariates to be adjusted for. In a common scenario with a binary trait (e.g., a casecontrol status or a treatment or exposure variable), each set consists of one case sample and m (m≥1) control samples, usually referred to as 1:m matched data. We assume that, after all covariates (including the traits of interest) have been accounted for, the members of each set are exchangeable.
To present our strategy for analyzing matchedset data, we introduce a common notation to describe both PERMANOVA and the LDM. Both PERMANOVA and the LDM are linear models for which the effects of covariates (metadata) are summarized in a design matrix X. The rows of X correspond to samples while the columns of X correspond to the covariates. We may partition X by columns into K groups (which we call “submodels”) such that X=(X_{1},X_{2},…,X_{K}), where each X_{k} denotes a variable or set of variables we wish to test (jointly). For example, X_{k} may consist of indicator variables for levels of a single categorical variable, or a group of potential confounders that we wish to adjust for simultaneously. Both PERMANOVA and the LDM make the columns of X_{k}orthonormal to the columns of \(\phantom {\dot {i}\!}X_{k'}\) for k^{′}<k using projection (i.e., the GramSchmidt process). Thus, we require an ordering of the submodels, which leads to unambiguous interpretations of pvalues, that is, the test of each submodel is adjusted for the proceeding submodels.
For both PERMANOVA and the LDM, test statistics for the kth submodel can be expressed in terms of the quantity \(X_{k}^{\mathrm {T}}Y\). For PERMANOVA, Y is related to the (squared and Gowercentered) distance matrix Δ by Δ=YSY^{T}, where S is a diagonal matrix with diagonal elements equal to 1 or −1 corresponding to positive and negative eigenvalues of Δ, respectively. For the LDM, Y is the (columncentered) OTU table that has rows for samples and columns for OTUs; the OTU table typically contains the frequency (i.e., relative abundance) data or arcsinroottransformed frequency data. Since Y in either PERMANOVA or the LDM is columncentered and treated as the response of a linear model, we also assume the design matrix X is columncentered.
With no loss of generality, we can write the element of Y in the ith row and jth column as
where s(i) is the set that the ith sample belongs to. Thus, \(\overline {Y}_{s(i),j}\) is the setlevel average of Y_{i,j} and δ_{i,j} is the deviation of the ith sample from the setlevel average. The rationale of a matchedset design is that we wish to treat \(\overline {Y}_{s(i),j}\) characterizing a set as a nuisance parameter and focus the testing efforts on δ_{i,j}s. With this in mind, we note that \(X_{k}^{\mathrm {T}}Y\) is a function of only the δ_{i,j}s (i.e., not a function of the \(\overline {Y}_{s(i),j}\)s) whenever the column values of X_{k} sum to zero for each set of samples belonging to the same set. It is clear that this occurs whenever the columns of X_{k} are orthogonal to the set of indicator variables corresponding to the set IDs. Therefore, our proposed strategy for fitting matchedset data is to introduce an indicator variable for each set to be included in submodel X_{1} along with any samplelevel confounding covariates that are not matched on. Note that any setlevel confounders are automatically controlled for in this strategy, as they can be written as linear combinations of the indicator variables generated by the set IDs. Indeed, it is typical of matchedset analyses that the effect of variables that have been matched on (i.e., that are constant in each set) cannot be determined (see, e.g., [10]).
To see how this works in practice, consider a simple example with two sets, the first having two samples and the second having three samples. For clarity, we work with X_{k}s before orthonormalization and show X_{1} (which has the indicator variables for the two sets) and X_{2} (which has a casecontrol status) before column centering:
After column centering (i.e., subtracting column means), X_{2}=(3/5,−2/5,3/5,−2/5,−2/5)^{T}. Note that the values in X_{2} do not sum to zero within each set. If we constructed a test using this contrast, the setspecific means \(\overline {Y}_{1,j}\) and \(\overline {Y}_{2,j}\) would not be eliminated. However, if we make X_{2} (before column centering) orthogonal to the columns of X_{1} (which automatically achieves column centering), we find X_{2}=(1/2,−1/2,2/3,−1/3,−1/3)^{T}, where we see that the values in X_{2} sum to zero within each set.
We identify a condition under which the nuisance parameters disappear even without projecting off the set ID. We say that a variable in matchedset data is balanced if the sum of the variable within each set is proportional to the number of its samples (with the same constant of proportionality). For example, a casecontrol status is balanced if all sets have as many case as control samples, or if some sets have two case and four control samples and the remaining sets have one case and two control samples. For a balanced variable, column centering alone is sufficient to make the values of that variable sum to zero within each set, even without projecting off the set ID. Note that adjusting for samplelevel covariates can result in imbalance in a variable, even if it was initially balanced; in this case, projection on the set ID is required to restore balance. A simple example with two sets, each contributing two samples along with a samplelevel covariate, shows this. Before column centering (and orthonormalization), suppose the covariate is X_{1}=(9,8,6,9)^{T} and the casecontrol status is X_{2}=(1,0,1,0)^{T}. After column centering, we have
In the absence of the covariate, X_{2} after column centering do sum to zero within each set; however, after adjusting for the covariate, we have X_{2}=(2/3,−1/2,1/6,−1/3)^{T}, which does not eliminate the setspecific means. If we also adjust for the set ID by augmenting X_{1} with the columncentered indicator
we obtain X_{2}=(3/5,−3/5,1/5,−1/5)^{T}, in which values do sum to zero within sets. Finally, note that in this example we have considered a binary casecontrol trait; it should be clear that, for a continuous trait, the withinset sum is unlikely to be the same for each set, and hence, the projection on the set ID is required to eliminate the nuisance parameters.
As we have assumed the samples in each set are exchangeable, we propose to perform restricted permutation among samples from the same set. As permuting residuals of Y in the Freedman and Lane scheme [11] is typically equivalent to permuting X_{k}s [9], the restricted permutation refers to permuting the (orthonormalized) traits of interest within samples from the same set. The same permutation scheme can be used for testing the interactions between traits of interest or between traits and setlevel covariates; the latter allows us to detect whether the associations between the microbiome and the traits of interest are homogeneous across study groups (which can be defined by the setlevel covariates). As noted previously, when all variables are balanced, the columns of X (excluding the set ID indicator vectors) will automatically be orthogonal to the set ID indicator vectors. Since both permanovaFL and the LDM permute the rows of X, it is also clear that this orthogonality holds for every permutation as long as permutations are conducted within sets. As a result, the pvalues for permanovaFL or the LDM will be identical with and without adjustment of the set ID in this situation, as long as the restricted permutation is performed.
Results
Simulation studies
To generate our simulation data, we used the same motivating dataset as Hu and Satten [9], i.e., data on 856 OTUs of the upperrespiratorytract microbiome first described by Charlson et al. [12]. In most simulations, we considered a binary trait such as casecontrol status, but we also considered matched sets with a continuous trait. We defined a “causal” OTU to have frequency that depended on the trait of interest. We considered on two complementary causal mechanisms: the first mechanism (S1) assumed that half (428) of the OTUs (after excluding the three most abundant OTUs) were causal; the second mechanism (S2) assumed the ten most abundant OTUs were causal. In each scenario, we randomly partitioned the causal OTUs into two equalsize subsets, \(S^{\text {trait}}_{}\) and \(S^{\text {trait}}_{+}\), to contain OTUs with decreased and increased frequencies, respectively, in cases relative to controls. We further partitioned \(S^{\text {trait}}_{+}\) into \(S^{\text {trait}}_{+a}\) and \(S^{\text {trait}}_{+m}\) comprised of OTUs whose frequencies are increased in additive and multiplicative manners, respectively. For the simple situation, with no covariates but the trait of interest, we simulated data for the ith set using the following steps.

1.
We assigned trait values \(X_{ij}^{\text {trait}}\) to the jth sample of the ith set. For matched pair samples, \(X_{ij}^{\text {trait}}=0\) was assigned to control samples and \(X_{ij}^{\text {trait}}=1\) to case samples; for continuous traits, \(X_{ij}^{\text {trait}}\) was sampled from the U[0,1] distribution.

2.
We generated the mean OTU frequencies \(\overline {\pi }_{i}\) for set i from the Dirichlet distribution \(Dir(\overline {\pi }, \theta _{1})\), where the mean parameter \(\overline {\pi }\) and overdispersion parameter θ_{1} took the values of the estimated mean and overdispersion (0.02) in the DirichletMultinomial (DM) model fitted to the upperrespiratorytract data. Note that \(\overline {\pi }\) and θ_{1} characterize the population mean of OTU frequencies and betweenset heterogeneity.

3.
Given \(\overline {\pi }_{i}\), we generated the baseline OTU frequencies \(\pi ^{(0)}_{ij}\) for sample j of set i from the Dirichlet distribution \(Dir(\overline {\pi }_{i}, \theta _{2})\), where θ_{2} was set to 0.007, which was the median of the estimated overdispersion in the DM model that was fitted to data for each set with three samples in the MsFLASH study (see the “Analysis of the MsFLASH data” section). Note that θ_{2} characterizes heterogeneity among samples from the same set, and \(\pi ^{(0)}_{ij}\) represents the (true) OTU frequencies we would see when trait \(X_{ij}^{\text {trait}}=0\).

4.
We then generated the (true) OTU frequencies that account for a nonzero effect of trait \(X_{ij}^{\text {trait}}\), denoted \(\pi ^{\text {trait}}_{ij}\), by reducing the frequency of each OTU in \(S^{\text {trait}}_{}\) by a factor of β, then distributing half of the total reduced frequency evenly to OTUs in \(S^{\text {trait}}_{+a}\) and the other half to OTUs in \(S^{\text {trait}}_{+m}\) in proportion to their baseline frequencies in \(\pi _{ij}^{(0)}\). We then formed the (true) OTU frequency for the jth sample from the ith set using \(\pi _{ij}=(1X_{ij}^{\text {trait}})\pi ^{(0)}_{ij}+X_{ij}^{\text {trait}}\pi ^{\text {trait}}_{ij}\). Note that β characterizes the effect size of the trait, i.e., the amount by which OTU frequencies vary at the causal OTUs when the trait \(X_{ij}^{\text {trait}}=1\).

5.
We generated read count data for each sample using the multinomial distribution MN(π_{ij},N_{ij}), where the total read count N_{ij} was generated from the Poisson distribution with mean 10,000 (and set to 500 if the Poisson sampling resulted in a value less than 500).
To induce the effects of additional covariates, we made further modifications to \(\overline {\pi }_{i}\) and/or π_{ij} that were similar to the modifications made to \(\pi ^{(0)}_{ij}\) to construct \(\pi ^{\text {trait}}_{ij}\). For simulations where we wished to include a main effect of a setlevel covariate \(X^{\text {set}}_{i}\), we first sampled values of \(X^{\text {set}}_{i}\) from a Bernoulli distribution with parameter 0.5. We then uniformly sampled 428 OTUs to be associated with the covariate and randomly partitioned them into two equalsize subsets \(S^{\text {set}}_{}\) and \(S^{\text {set}}_{+}\). We then constructed \(\overline {\pi }^{\text {set}}_{i}\) by modifying \(\overline {\pi }_{i}\), reducing the frequency of each OTU in \(S^{\text {set}}_{}\) by a factor of β^{set}=0.2 and distributing the total reduced frequency to OTUs in \(S^{\text {set}}_{+}\) in proportion to their original frequencies in \(\overline {\pi }_{i}\). We then replaced \(\overline {\pi }_{i}\) by \(\left (1X^{\text {set}}_{i}\right)\overline {\pi }_{i} +X^{\text {set}}_{i}\overline {\pi }^{\text {set}}_{i}\), to be used in Step 3.
To account for a samplelevel confounder \(X^{\text {sam}}_{ij}\), we first sampled \(X^{\text {sam}}_{ij}\) from a Bernoulli distribution with parameter \(\left (0.20.1X_{ij}^{\text {trait}}\right)\). We then uniformly sampled 428 OTUs to be associated with the covariate and randomly partitioned them into two equalsize subsets \(S^{\text {sam}}_{}\) and \(S^{\text {sam}}_{+}\). We then constructed \(\pi ^{\text {sam}}_{ij}\) by modifying π_{ij} in the same way that \(\overline {\pi }^{\text {set}}_{i}\) was modified from \(\overline {\pi }_{i}\), but with a factor of β^{sam}=0.5. We then replaced π_{ij} by \((1X^{\text {sam}}_{ij})\pi _{ij} + X^{\text {sam}}_{ij}\pi ^{\text {sam}}_{ij}\). The resulting values were used in Step 5.
Finally, to account for an interaction between a setlevel covariate and the trait, we sampled a third set of OTUs (a random sample of 428 OTUs under S1 and the top 1–5 and 11–15 most abundant OTUs under S2) to be associated with the interaction, and randomly partitioned them into two equalsize subsets \(S^{\text {int}}_{}\) and \(S^{\text {int}}_{+}\). Then, when both \(X^{\text {set}}_{i}=1\) and \(X_{ij}^{\text {trait}}=1\), we further modified π_{ij} by reducing the frequency of OTUs in \(S^{\text {int}}_{}\) by a factor β^{int} and then distributing this extra mass to \(S^{\text {int}}_{+}\) in proportion to the OTU frequencies in π_{ij}. The resulting values of π_{ij} were then used in Step 5. Note that whenever we included an interaction term like this, the main effect of \(X^{\text {set}}_{i}\) (β^{set}=0.2) and \(X^{\text {trait}}_{ij}\) (β=0.5) was also included as described previously.
We evaluated the performance of different strategies and methods in seven scenarios of matchedset data: (1) matchedpair data, (2) unbalanced data, (3) matchedpair data with a samplelevel confounder, (4) matchedpair data with a setlevel covariate, (5) unbalanced data with a setlevel covariate, (6) matchedpair data with a continuous trait, and (7) matchedpair data with an interaction effect. To facilitate comparison across scenarios, the same sets of causal OTUs (S−trait,S+trait), and covariateassociated OTUs \((S^{\text {set}}_{},S^{\text {set}}_{+}), (S^{\text {sam}}_{},S^{\text {sam}}_{+})\) and (S−int,S+int) (if called for) were used for all scenarios. For each scenario except (2), (5), and (6), we generated data for 50 1:1 matched pairs (with a binary trait); for scenarios (2) and (5) with unbalanced data, we generated data for 25 1:1 matched pairs and 25 1:2 matched sets (with a binary trait); for scenario (6), we generated data for 50 matched pairs with a continuous trait.
We also explored various 1:m matched study designs to assess the performance under varying conditions. First, we compared the design that collected 50 1:1 matched pairs with the design that collected 50:50 independent casecontrol samples (first simulating pairs and then selecting only one sample from each pair), over varying values for the withinset heterogeneity θ_{2}. Second, we compared different 1:m matchedset designs with a fixed total of 90 samples. Specifically, we considered m=1, 2, 4, and 5 and collected 45 1:1 pairs, 30 1:2 sets, 18 1:4 sets, and 15 1:5 sets, respectively, to form each dataset. We also considered m=3 and collected 22 of 1:3 sets and 1 pair (to meet the total sample size 90) for the 1:3 design. Lastly, we compared different 1:m (m=1,2,3,4,5) designs when fixing the total number of sets to 50.
We applied PERMANOVA (implemented in both permanovaFL and adonis2) and the LDM with the proposed strategy (adjusting for the set ID and samplelevel covariates if present, not adjusting for setlevel covariates, and performing restricted permutation within sets). PERMANOVA tests were calculated using the BrayCurtis distance unless otherwise noted. We report LDM results for the omnibus test that combines the test results from raw frequency (relative abundance) data and arcsinroottransformed frequency data [9]. For testing individual OTUs, we compared the LDM with the proposed strategy to the following alternative methods: LDM without adjusting for the set ID, LDM without performing restricted permutation, DESeq2 (adjusting for set ID), LMM (applied to arcsinroottransformed relative abundance data), ZIBR when it is applicable (i.e., for data with equal number of samples in each set), and the Wilcoxon signedrank test when it is applicable (i.e., for matchedpair data). We evaluated the type I error and power for the communitylevel (global) test of any microbiome effect at nominal significance level 0.05, and we assessed empirical sensitivity (proportion of truly associated OTUs that were detected) and empirical FDR for the OTU tests at a nominal FDR of 10%. Results for type I error were based on 10,000 replicates; all other results were based on 1000 replicates. OTUs having fewer than 5 nonzero entries were removed before analysis.
Simulation results
Results on type I error for the seven scenarios we considered are summarized in Table 1. The results of power, sensitivity, and FDR for the seven scenarios were displayed in Figs. 1, 2, 3, 4, 5, 6, and 7, respectively. In all scenarios, our proposed strategy, when applied to either permanovaFL or the LDM, yielded correct type I error and the highest power compared to alternative strategies; adonis2 with the proposed strategy produced slightly conservative type I error and slightly lower power compared to permanovaFL. The LDM using the proposed strategy always controlled the FDR and achieved the highest sensitivity compared to the LDM using alternative strategies or DESeq2 when it controlled the FDR or Wilcoxon if it is applicable. The ZIBR method always yielded highly inflated FDRs. With a binary trait, the LMM always resulted in conservative FDR and diminished sensitivity compared to the LDM with the proposed strategy; with a continuous trait, conversely, it led to inflated FDRs.
For (1) the matchedpair data, permanovaFL and the LDM not adjusting for the set ID produced identical results to their counterparts using the proposed strategy as expected. Note that pvalues from adonis2 were not identical with and without adjustment for set ID, but the type I error and power (Fig. 1) of the two strategies were very similar. Here and for all datasets with a binary casecontrol trait, the strategy of performing unrestricted permutation led to conservative type I error and FDR and diminished power and sensitivity (Figs. 1, 2, 3, 4, 5, and 7) when applied to permanovaFL and the LDM, but inflated type I error when applied to adonis2.
For (2) the unbalanced data, the LDM not adjusting for the set ID yielded correct type I error but diminished power and sensitivity relative to its counterpart using the proposed strategy (Fig. 2). The same pattern can be seen in the results of permanovaFL and adonis2.
For (3) the matchedpair data with a samplelevel confounder, permanovaFL, adonis2, and the LDM not adjusting for the confounder had inflated type I error (0.063∼0.080), indicating that we have indeed induced some confounding effect in the data. In the presence of such a confounding effect, permanovaFL and the LDM not adjusting for the set ID (even after adjusting for the confounder) had inflated type I error (0.071∼0.087). In this case, not adjusting for the set ID did not just affect the power, but also affected the validity. These methods with inflated type I error were not included in Fig. 3. In contrast, adonis2 not adjusting for the set ID had more conservative type I error than that adjusting for the set ID (both adjusting for the confounder), so the former had reduced power compared to the latter.
Our proposed strategy was robust to the presence of setlevel covariates. For (4) the matchedpair data with a setlevel covariate, whether or not adjusting for the covariate or the set ID (i.e., the first three strategies in scenario (4) of Table 1) all yielded identical results when applied to permanovaFL or the LDM, as we have analytically shown. Thus, Fig. 4 only displays their results for the proposed strategy. When applied to adonis2, the three strategies led to slightly different type I error and power. For (5) the unbalanced data with a setlevel covariate, the LDM not adjusting for the set ID generated correct type I error but diminished power and sensitivity compared to its counterpart that adjusted for the set ID (both not adjusting for the covariate) (Fig. 5). Adjusting for the covariate but not the set ID failed to recover any power or sensitivity, which underscored the importance of adjusting for the set ID. The same pattern can be seen in the results of permanovaFL and adonis2.
In the presence of a continuous trait, even in (6) the simplest matchedpair data without any covariates, permanovaFL, adonis2, and the LDM not adjusting for the set ID all yielded inflated type I error. The strategy of performing unrestricted permutation led to highly inflated type I error, which is the opposite of its performance in testing a binary trait.
For testing (7) the interaction in matchedpair data, the strategy of performing unrestricted permutation yielded extremely conservative type I error. Figure 7 confirms the lack of power with unrestricted permutation.
The power and sensitivity of various 1:m matched study designs were contrasted in Figs. 8, 9, and 10. Figure 8 shows that the matchedpair design always gained substantial efficiency over an analysis of data from an equivalent number of independent cases and controls over a wide range of θ_{2} values. Figure 9 shows that, with a fixed number of total samples, maximizing the number of distinct sets (i.e., using 1:1 pairs) rather than increasing the number of controls per set optimized efficiency. In Fig. 10, we show that adding more control samples to each set, while keeping the number of sets fixed, has a relatively small effect on power and sensitivity; the addition of each successive control sample yielded diminishing returns. Taken together, Figs. 8, 9, and 10 suggest that when data have a matched structure, a matched analysis outperforms an unmatched analysis and, in general, increasing the number of controls in a 1:m matched study beyond 1:2 may only result in fairly small improvements in power and sensitivity.
Analysis of the MsFLASH data
The data for our first example were extracted from the study “Menopause Strategies: Finding Lasting Answers for Symptoms and Health” (MsFLASH) [13, 14]. This doubleblinded, randomized trial enrolled women into one of three arms: oral estradiol (arm 1), oral venlafaxine (arm 2) (two commonly used drugs to alleviate menopausal hot flashes), or placebo (arm 3). To examine the effect of these drugs on the vaginal microbiome, 113 vaginal swab samples were collected at baseline (before treatment) and at weeks 4 and 8 posttreatment. 16S rRNA gene sequencing was performed, and the results were summarized into 171 OTUs. Specifically, 9 sets (women) in the estradiol arm, 10 in the venlafaxine arm, and 18 in the placebo arm have data from swab samples at all three visits; one woman in the estradiol arm only provided samples at baseline and week 4. Due to the small sample size, we also considered an enlarged “treatment” group that combined the estradiol and venlafaxine arms. The ordination plot (Fig. 11) showed that the samples from the same woman tended to cluster together.
In each arm, we tested whether the composition of the vaginal microbiome changed between baseline and week 4, baseline and week 8, and weeks 4 and 8; each of these tests was based on 1:1 paired data. We also tested the microbiome differences pre and posttreatment by comparing baseline and posttreatment (both week 4 and week 8) samples without differentiating between time since treatment using a 1:2 matchedset design; the estradiol arm was an exception, as one set had only two samples, resulting in unbalanced data. We applied the LDM (using the omnibus test) and permanovaFL (using the BrayCurtis distance) with the proposed strategy. As a comparison, we also applied DESeq2 (adjusting for the set ID) and the Wilcoxon signedrank test to 1:1 matched data.
We limited our analysis for each arm to OTUs that were present at least 5 times in each of the four subsets of samples, which resulted in, for example, 31 OTUs in the venlafaxine arm. All results were summarized in Table 2. Only the comparisons within the venlafaxine arm yielded some significant pvalues (<0.05). In particular, the LDM generated pvalue 0.033 for comparing the baseline and week 4 samples, followed by a smaller pvalue 0.0042 for the baseline and week 8 samples, and then the smallest pvalue 0.0003 for the baseline and the combined week 4 and week 8 samples. These pvalues suggested an effect of venlafaxine on the vaginal microbiome, which was strengthened at week 8 relative to week 4. However, the differences between week 4 and week 8 were not found to be significant (the LDM pvalue = 0.76). The results of permanovaFL corroborated these conclusions. In the comparison of the baseline vs. weeks 4 and 8, the LDM detected four OTUs (Campylobacter, Gardnerella vaginalis, Porphyromonas, and Aerococcus christensenii) to be differentially abundant at the nominal FDR 20% (we chose a relatively high nominal FDR because of the small number of sets), whereas DESeq2 detected none and the Wilcoxon test was not applicable.
Motivated by the likely trend of strengthened effect of venlafaxine over time, we reanalyzed the data at weeks 0, 4, and 8 in the venlafaxine arm, treating “week” as a quantitative variable. However, this analysis yielded less significant global pvalues (0.043 by the LDM and 0.096 by permanovaFL), suggesting that the change in OTU frequencies as a function of time since treatment initiation is probably nonlinear. We also tested whether the effect of venlafaxine is the same for the five white and four black women (excluding one women in the “other” race category), i.e., we tested the interaction between week (coded as 0 vs. 4 & 8) and race. The global pvalues are 0.44 by the LDM and 0.39 by permanovaFL, suggesting no racial difference in the effect of vanlafaxine. These nonsignificant pvalues may also be due to the generally low power for testing interactions.
Analysis of the Alzheimer’s disease data
The data for our second example were generated from a pairmatched study comparing the gut microbiome of 25 patients with Alzheimer’s disease (AD) and their age and sexmatched controls [15]. A covariate of particular interest was the APOE ε4 genotype, which was coded as carriers (one or two ε4 alleles) vs. noncarriers (zero ε4 alleles). APOE ε4 genotype is a potential confounder of the association between the gut microbiome and AD, as it is distributed differently in the AD patients than in the controls (AD, 72% carriers; control, 20% carriers; pvalue <0.001) in the study sample, and has been found to influence the gut microbiome [16]. Since matching on APOE ε4 genotype was not used in the study design, it should be adjusted for in the association test. The microbiome data were summarized into 972 OTUs, of which 723 were present at least 5 times in the study sample and included in our analysis. We applied the same methods as those for the MsFLASH data, except we do not report results for the Wilcoxon signedrank test, which is not applicable in the presence of a withinpair covariate.
The results are summarized in Table 3. Without adjustment of the APOE ε4 genotype, the LDM yielded pvalue 0.0001 for testing the communitylevel association and detected 66 OTUs (at nominal FDR 10%) that were differentially abundant between AD patients and controls. After adjustment for APOE ε4 genotype, the LDM yielded pvalue 0.0159 and detected no OTUs. The results of permanovaFL corroborated this conclusion. These results suggest that much of the association seen without adjusting for APOE ε4 genotype is due to confounding.
Discussion
We have developed a novel strategy that extends PERMANOVA (implemented in both adonis2 and permanovaFL) and the LDM for analyzing matchedset microbiome data that can account for complex design features such as unbalanced data, samplelevel confounding covariates, and continuous traits of interest. This strategy corresponds to a specific application of PERMANOVA and the LDM, without modifying any of their methodologies. Our simulations show that the proposed strategy was the most efficient among all strategies we considered, when applied to either PERMANOVA or the LDM. The LDM was also superior to existing methods, such as DESeq2 and the Wilcoxon signedrank test, for testing individual OTUs with matchedset data. In addition, our simulation studies suggested that the 1:1 matchedpair study is the most efficient design as it maintains a good balance between sequencing cost and statistical power.
Our results in analysis of the MsFLASH data did not agree with those reported by Zhao et al. [2], who found significant effects in the “treated” group only (rather than the venlafaxine arm). Their method was based on logratiotransformed frequency data and used a pseudo count value of 0.01 for zero count data, which essentially resulted in a different hypothesis being tested than that used in our methods. Similarly, we found that much of the association reported by [15] between AD disease status and the microbiome may be due to confounding by APOE ε4 genotype. This finding emphasizes the need to develop and use microbiome methods, such as those we have reported here, that can account for complex design features, like matching with withinset confounding covariates, that are often found in epidemiological studies involving the microbiome.
Hu and Satten [9] have shown that for independent casecontrol samples, the power of the LDM was sensitive to the OTU data scale, i.e., if untransformed frequency scale or arcsinroottransformed data were used. We found (Figure S1) that these patterns persisted in the analysis of matchedset data. As a result, we reiterate the recommendation in [9] and use the omnibus test for the LDM, which corresponds to the minimum of the pvalues obtained on the frequency and arcsinroottransformed scales.
The strategy we have proposed here is applicable to any matchedset microbiome data as long as model residuals can be assumed to have an exchangeable correlation structure. In some settings, longitudinal microbiome data that have timevarying traits (i.e., time, antibiotic intake, or infection) can be reasonably assumed to have an exchangeable correlation structure. The simple withincluster permutation approach used here is not valid for other correlation structures such as the autoregressive model. We are currently developing methods for analysis of clustered or longitudinal microbiome data having an arbitrary residual correlation structure.
Our simulation studies showed that matchedset sampling, when available, can result in a substantial increase in power to detect global associations and sensitivity to detect individual OTUs when our approach is used. This is presumably because the overdispersion parameter for the matched data is smaller than it is for independent data sampled from the same population. In the independent data sample, the overdispersion parameter describing each observation is effectively the sum of the between and withinset heterogeneity parameters (θ_{1} and θ_{2} in our simulations). In the matched data, the betweenset heterogeneity (represented by θ_{1} in our simulations) is effectively conditioned out. Thus, we expect the advantage of a matched analysis over an unmatched analysis to increase as the betweenset heterogeneity increases. Presumably when the withinset heterogeneity is large compared to the betweenset heterogeneity, a matched analysis would have a smaller advantage.
Conclusions
We proposed a new strategy, i.e., including set indicator variables as covariates and permuting within sets, that can be used with both PERMANOVA and the LDM for analyzing matchedset microbiome data. These methods not only have superior performance than existing methods but can also handle many complex design features in matchedset studies such as unequal set sizes, withinset confounding covariates, and continuous traits of interest. Given the availability of proper analytical tools, future microbiome studies should preferably adopt the matchedset design to enjoy its good power as the large microbiome heterogeneity as well as most confounding factors between sets (e.g., individuals) are conditioned out.
Availability of data and materials
The R package LDM is available on GitHub at https://github.com/yijuanhu/LDM in formats appropriate for Macintosh, Linux, or Windows systems.
Declarations
References
 1
Shi P, Li H. A model for pairedmultinomial data and its application to analysis of data on a taxonomic tree. Biometrics. 2017; 73(4):1266–78.
 2
Zhao N, Zhan X, Guthrie KA, Mitchell CM, Larson J. Generalized Hotelling’s test for paired compositional data with application to human microbiome studies. Genet Epidemiol. 2018; 42(5):459–69.
 3
La Rosa PS, Warner BB, Zhou Y, Weinstock GM, Sodergren E, HallMoore CM, Stevens HJ, Bennett WE, Shaikh N, Linneman LA, Hoffmann JA, Hamvas A, Deych E, Shands BA, Shannon WD, Tarr P. Patterned progression of bacterial populations in the premature infant gut. Proc Natl Acad Sci. 2014; 111(34):12522–125274151715.
 4
Bokulich NA, Dillon MR, Zhang Y, Rideout JR, Bolyen E, Li H, Albert PS, Caporaso JG. q2longitudinal: longitudinal and pairedsample analyses of microbiome data. MSystems. 2018; 3(6):e0021918.
 5
Vatanen T, Franzosa EA, Schwager R, Tripathi S, Arthur TD, Vehik K, Lernmark Å., Hagopian WA, Rewers MJ, She JX, et al.The human gut microbiome in earlyonset type 1 diabetes from the TEDDY study. Nature. 2018; 562(7728):589–94.
 6
Chen EZ, Li H. A twopart mixedeffects model for analyzing longitudinal microbiome compositional data. Bioinformatics. 2016; 32(17):2611–26175860434.
 7
Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNAseq data with DESeq2. Genome Biol. 2014; 15(12):550.
 8
McArdle BH, Anderson MJ. Fitting multivariate models to community data: a comment on distancebased redundancy analysis. Ecology. 2001; 82(1):290–7.
 9
Hu YJ, Satten GA. Testing hypotheses about the microbiome using the linear decomposition model (LDM). Bioinformatics. 2020; 36(14):4106–4115. https://doi.org/10.1093/bioinformatics/btaa260.
 10
Breslow NE, Day NE, Davis W, et al.Statistical methods in cancer research: volume 1the analysis of casecontrol studies, vol. 32. France: IARC; 1980.
 11
Freedman D, Lane D. A nonstochastic interpretation of reported significance levels. J Bus Econ Stat. 1983; 1(4):292–8.
 12
Charlson ES, Chen J, CustersAllen R, Bittinger K, Li H, Sinha R, Hwang J, Bushman FD, Collman RG. Disordered microbial communities in the upper respiratory tract of cigarette smokers. PloS ONE. 2010; 5(12):15216.
 13
Mitchell CM, Srinivasan S, Zhan X, Wu MC, Reed SD, Guthrie KA, LaCroix AZ, Fiedler T, Munch M, Liu C, et al.Vaginal microbiota and genitourinary menopausal symptoms: a crosssectional analysis. Menopause. 2017; 24(10):1160–6.
 14
Joffe H, Guthrie KA, LaCroix AZ, Reed SD, Ensrud KE, Manson JE, Newton KM, Freeman EW, Anderson GL, Larson JC, et al.Lowdose estradiol and the serotoninnorepinephrine reuptake inhibitor venlafaxine for vasomotor symptoms: a randomized clinical trial. JAMA Intern Med. 2014; 174(7):1058–66.
 15
Vogt NM, Kerby RL, DillMcFarland KA, Harding SJ, Merluzzi AP, Johnson SC, Carlsson CM, Asthana S, Zetterberg H, Blennow K, et al.Gut microbiome alterations in Alzheimer’s disease. Sci Rep. 2017; 7(1):13537.
 16
Tran TT, Corsini S, Kellingray L, Hegarty C, Le Gall G, Narbad A, Müller M, Tejera N, O’toole PW, Minihane AM, et al.Apoe genotype influences the gut microbiome structure and function in humans and mice: relevance for Alzheimer’s disease pathophysiology. FASEB J. 2019; 33(7):8221–31.
Acknowledgements
The authors would like to acknowledge the Wisconsin ADRC’s biostatistical support provided by the Data Management and Biostatistics Core.
Funding
This research was supported by the National Institutes of Health awards R01GM116065 (Hu). The Alzheimer’s disease data were provided by the Wisconsin Alzheimer’s Disease Research Center, which is supported by the National Institutes of Health awards P50AG033514. The MsFLASH data were provided by the Fred Hutchinson Cancer Research Center (MsFLASH Network), which is supported by the National Institute on Aging grant 5R01AG048209.
Author information
Affiliations
Contributions
YJH conceived the study, developed the method, performed simulation studies, analyzed the data, and wrote the manuscript. GAS conceived the study, developed the method, and wrote the manuscript. ZZ performed simulation studies and analyzed the data. CM provided the biological insights and interpretation for the results from analyzing the MsFLASH data. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
This study only involved secondary analyses of existing, deidentified datasets; as such, it does not require separate IRB consent.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
Simulation results for the matchedpair data of scenario (1). The LDM on frequency scale (freq), arcsinroottransformed frequency scale (arcsin), and the omnibus test (omni) are shown. permanovaFL based on the BrayCurtis (BC), weighted UniFrac (WU), and Hellinger (H) distances are shown. All methods adopted the proposed strategy.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhu, Z., Satten, G.A., Mitchell, C. et al. Constraining PERMANOVA and LDM to withinset comparisons by projection improves the efficiency of analyses of matched sets of microbiome data. Microbiome 9, 133 (2021). https://doi.org/10.1186/s40168021010349
Received:
Accepted:
Published:
Keywords
 Microbiome association test
 Differentially abundant
 Clustered data
 Paired data
 Unbalanced data