\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$W_{d}^{*}$\end{document}Wd∗-test: robust distance-based multivariate analysis of variance

Background Community-wide analyses provide an essential means for evaluation of the effect of interventions or design variables on the composition of the microbiome. Applications of these analyses are omnipresent in microbiome literature, yet some of their statistical properties have not been tested for robustness towards common features of microbiome data. Recently, it has been reported that PERMANOVA can yield wrong results in the presence of heteroscedasticity and unbalanced sample sizes. Findings We develop a method for multivariate analysis of variance, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$W_{d}^{*}$\end{document}Wd∗, based on Welch MANOVA that is robust to heteroscedasticity in the data. We do so by extending a previously reported method that does the same for two-level independent factor variables. Our approach can accommodate multi-level factors, stratification, and multiple post hoc testing scenarios. An R language implementation of the method is available at https://github.com/alekseyenko/WdStar. Conclusion Our method resolves potential for confounding of location and dispersion effects in multivariate analyses by explicitly accounting for the differences in multivariate dispersion in the data tested. The methods based on \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$W_{d}^{*}$\end{document}Wd∗ have general applicability in microbiome and other ‘omics data analyses. Electronic supplementary material The online version of this article (10.1186/s40168-019-0659-9) contains supplementary material, which is available to authorized users.


Introduction
Beta diversity analyses or community-wide ecological analyses are important tools for understanding the differentiation of the entire microbiome between experimental conditions, environments, and treatments. For these analyses, specialized distance metrics are used to capture the multivariate relationships between each pair of samples in the dataset. Analysis of variance-like techniques, such as PERMANOVA [1], maythen be used to determine if an overall difference exists between conditions. The distances use all of the measured taxa information simultaneously without the need to explicitly estimate individual covariances. The utility of these methods is hard to underestimate as virtually every recent major microbiome report has used some form of a community-wide association analysis. On many occasions, the comparison reveals major differences between the groups. However, one is not guaranteed to find one. For example, in Redel et al. [2], the authors have found that there are significant differences in cutaneous microbiota in diabetic vs. non-diabetic subject feet, but not on their hands (see fig. 5). This lack of difference is an important indicator about the potential pathobiological processes that lead to diabetic foot ulcers. Therefore, getting the correct result in such comparisons is important. The Redel et al. analysis can ultimately be achieved by pairwise comparisons only (diabetic vs. non diabetic); however, many study designs have more than two groups that need to be considered simultaneously. Dietary intervention studies among others often include several experimental groups. For example, Cox et al. [3] analysis of the impact of diet on the murine gut microbiome included animal groups receiving low fat, high fat, and high fat with fiber supplement diets. Although it is possible to treat such design using multi-way comparisons of dietary fat and dietary fiber, a simultaneous analysis of all three groups can be more intuitive. Hence, there is a need for methods that can compare more than two experimental groups at the same time. PERMANOVA among other methods allows for such analyses.
From the statistical stand point, community-wide analyses test the hypothesis that the data from two or more conditions share the location parameter (centroid or multivariate mean). Caution, however, needs to be taken to ensure that potential violations of assumptions do not lead to adverse statistical behavior of PERMANOVA. Two such assumptions that are commonly violated are the multivariate uniformity of variability (homoscedasticity) and sample size balance. We have previously shown that simultaneous violation of both assumptions leads to PER-MANOVA analysis with indiscriminate rejection and type I error inflation or to significant loss of power up to inability to make any rejections at all [4]. Unfortunately, heteroscedasticity across conditions is a very common feature of microbiome data. Thus, new robust methods are needed to ensure correct data analysis.
We have previously described a T 2 w test, which presents a robust solution for comparing two groups of microbiome samples [4]. The two-group scenario is common, but not universally satisfying as many study designs often include many different sample types, e.g., from affected and unaffected sites of a study subject and from a matched healthy control [5] and interventions as in the Cox et al. [3] study mentioned above. Here we describe a further extension of T 2 w to allow for arbitrary number of groups with possibly different within group variability to be compared using an omnibus test for equality of means. Our method presents an advance to the stateof-the-art by introducing a way to compare data from multiple conditions where heteroscedasticity is a nuisance and only the differences between location of the data are important.

Univariate Welch MANOVA
Univariate solutions for a heteroscedastic test to compare k-means deal with finding asymptotic distributions for w j (x j −μ) 2 , as defined later in Eqs. (2) and (3). Welch's solution [6] is perhaps the most known and well adopted in statistical literature. Next we briefly describe it, as we will build on extending this statistic to multivariate data.
Suppose we observe data from k populations x j = with potentially unequal number of observations, n j for j = 1, . . . , k, in each. Letx j and s 2 j denote the means and variances for each sample. The Welch ANOVA statistic is: where The Welch test uses

Calculation of multivariate Welch W-statistic on distances
To derive a Welch W * statistic suitable for analysis of microbiome data, W * d , we follow the same approach as we did in our derivation of T 2 w . We first demonstrate that in the univariate case, W * d can be expressed in terms of sums of pairwise square differences. Next we observe that these sums represent the squares of the univariate Euclidean distances, which allows for a direct extension of the W * d statistic computation for multivariate Euclidean distances and in fact any arbitrary distance or dissimilarity metric. The derivation of the statistic in terms of dissimilarities makes it suitable for analysis of microbiome data via a permutation test.
We have previously shown [4] that the sample variances can be written as: where x (p) j and x (q) j denote p-th and q-th observations in the j-th level, d (j) pq is the distance between them. Hence: Now consider: Equation (16) means that j w j (x j −μ) 2 can be expressed as weighted sum of squares of pairwise intergroup mean differences, which makes for a convenient expression to compute. Finally, we have previously shown that squares of mean differences can be expressed in terms of squares of pairwise sample differences [4], i.e: where . The squares of the pairwise differences under the summations in Eq. (17) can be thought of as the squares of the pairwise Euclidean distances in one dimension. This allows us to generalize the univariate Euclidean Welch ANOVA to MANOVA with arbitrary distances, where the distances can be suitably defined for the data at hand, including all of common distances used with microbiome data.
Note that in contrast to the PERMANOVA statistic, the distance-based T 2 w and W * d explicitly account for potentially unbalanced number of observations and differences in multivariate spread in the two samples. Finally, observe that W * d reduces to T 2 w when k = 2, as W * reduces to Welch t-statistic.
As with T 2 w , the exact distribution of the multivariate distance-based W * d statistic is dependent on many factors, such as the dimensionality of underlying data, distributions of the random variables comprising the data, the exact distance metric used, and the number of groups compared k. To make a practical general test, we use permutation testing to establish the significance. To do so, we compute W * d (i) on m permutations of the original data, for i = 1, . . . , m, and estimate the significance as the fraction of times the permuted statistic is greater than or equal to W d , i.e., p = 1 Here, 1(.) designates the indicator function. Larger p values are more easily estimated with permutations as the number of more extreme permuted statistics will be quite large. For smaller, p values often, the precise p value is not necessary, but only an indication if it is smaller than a particular threshold (e.g., 0.01). As a rule of thumb, to conclude that a p value is less than a threshold α, we recommend performing at least 5/α permutations.
Confounder modeling and repeated measures are often key elements of microbiome study design. These can be accounted for in permutation testing procedures using restricted permutation. For example, the effect of a discrete valued confounder can be removed from the p value calculation by restricting permutations to only within the levels of the confounding variable. This amounts to an application of stratified analysis of variance. Similarly, restricting permutations to within individual subjects only results in a repeated measures analysis. Notice that the test statistic under restricted permutations remains the same, but the null distribution is changed to reflect the desired comparison. Methods for W * d and these restricted permutation methods are available in our reference implementation at https://github.com/alekseyenko/WdStar.
When multiple means are compared with W * d , a statistically significant result may prompt the question about attribution of the differences to a specific group or groups. Post hoc testing procedures are used to perform that kind of analysis. There are many possible ways to design the post hoc testing procedures, but the guiding principle due to potential for loss of power to multiple testing should be to minimize the number of tests performed. For this reason, in addition to all possible pairwise (one versus one) tests, it may be interesting and relevant to test one group versus all others. In this scenario, samples from one experimental group are compared to pooled samples from the remaining groups. The statistical test for this comparison can equivalently be either T 2 w or W * d on two level factors. We illustrate the use of one versus all post hoc procedure in our application example in "Application example: colorectal cancer disparity and microbiome" section and provide corresponding computation routines in our reference implementation.

Empirical evaluation of W * d type I error
The principal evaluation that is required to assure statistical properties of W * d is demonstration of appropriate type I error control. For this purpose, we consider the univariate heteroscedastic case with three groups, , k 1 = 1 . . . , n 1 , k 2 = 1 . . . , n 2 , and k 3 = 1 . . . , n 3 , of samples to compare, where n 1 , n 2 , n 3 are the numbers of observations in each group. We let ∼ N 0, s 4 be the groups with different variance s 2 and s 4 , respectively, to introduce heteroscedasticity. In our simulation, we let s 2 = {1, 0.8, 0.2} to control the degree of heteroscedasticity in the range from none to large. Finally, we let the sample sizes n 1 , n 2 , and n 3 take values of 5, 10, 20, or 40 to generate data with varying total sample size and degree of balance. For each combination of sample sizes and variance, we have performed 1000 simulations of the data for a total of 192,000 datasets. Each dataset has been analyzed using our reference implementation of W * d , PERMANOVA (adonis function in R library vegan), and univariate Welch ANOVA (oneway.test in R library stats). For distance-based methods, Euclidean distances have been used. Details of simulation are available as a knitted R Markdown file in Additional file 1. The simulation results comprise the fraction of rejected null hypotheses at α = 0.05 by each test (Fig. 1a). A test properly controlling the type I error is expected to have the fraction of rejections equal to the nominal error rate (0.05). Notice that the proposed W * d test, in fact, produces the expected error rates over the entire range of simulation parameters. Similarly to our previous observations in the two-group case, PERMANOVA is not robust to heteroscedasticity when sample size imbalance is present. Observe that whenever the number of observations in the reference group (the one with variance equal to 1) is smaller than that in the less dispersed groups, the fraction of rejections is overly inflated, resulting in higher type I error. Also notice that when there are more observations in the reference group than in others (e.g., n 1 = 40, n 2 , n 3 < 40), it is hard for PERMANOVA to make the rejections, resulting in approximately zero type I error.
Interestingly, when we compare the raw p values obtained from W * d to those from the distribution based asymptotic Welch test, we see a good concordance between the two (Fig. 1b). The variability around the trendline is most likely due to Monte Carlo error associated with permutation testing and small sample size. On the contrary, when PERMANOVA is compared to the distribution-based asymptotic test, the fit is clearly much noisier (Fig. 1c). The concordance is much smaller for tests involving groups with larger degree of heteroscedasticity. The code used to produce the plots in Fig. 1 is available as Additional file 2. Finally, given the equivalence of the W * d to T 2 w for k = 2, and the fact that the two-level test is powered similarly to PERMANOVA, we expect the test described in this paper to be of similar power for k > 2 as well. The full empirical evaluation of power characteristics for k > 2 is hard to achieve in non-superficial setups as most realistic simulation scenarios present an infinite universe for choice of parameters.

Application example: colorectal cancer disparity and microbiome
Extensive scientific literature suggests an important, yet not fully understood role of the intestinal microbiome in the development, progression, and treatment of colorectal cancers (CRC). Several genus level bacterial taxa have been associated with CRC [7] but the role of personal characteristics in influencing the presence of CRCassociated bacteria is not well understood. A few studies have noted marked differences in the microbial environment in the gut of African-Americans (AA) versus others [7][8][9][10][11] (e.g., Caucasian (CA)) and suggested differences in microbial composition among those with and without colorectal polyps and cancer. Others found distinct differences in the microbes populating the proximal and distal colo-rectum [12,13]. Lower socioeconomic status and western diet have been associated with a lower microbial diversity, especially in the distal colon [14,15]. Microbial signature approaches have been used for development of diagnostic biomarkers [9,[16][17][18] or assessing differences in immune gene expression [13]-highlighting the increasing importance of statistical methods to analyze clusters of microbes-genes while also taking into account patient-level variables. The role of the gut microbiome in CRC disparities is likewise poorly understood [19]. Here we use a pilot CRC dataset to demonstrate the utility of W * d in uncovering signals potentially missed due to heteroscedasticity.
The Medical University of South Carolina (MUSC) Institutional Review Board approved all study activities. The Cancer Registry at Hollings Cancer Center (HCC) at MUSC was used to identify all cases of CRC. The study population was comprised of a sample of histologically confirmed cases diagnosed between January 1, 2000, and June 30, 2015. Patients were of either AA or CA descent. For each case, we obtained a formalinfixed, paraffin-embedded tissue blocks from the MUSC Department of Pathology and Laboratory Medicine. DNA was extracted following standard protocols in the laboratory. Briefly, the colonic tissue was transferred to a tube containing lysis buffer (1% SDS, 1 mg/ml Proteinase K, LTE pH 8.0). The solution was incubated at 50 • C for 1 h, followed by phenol/chloroform extraction and ethanol precipitation. The quantity and quality of DNA was then determined by running a small aliquot on a 1% agarose gel and comparing it to a set of DNA standards. The extracted DNA was stored at − 80 • C. V3 and V4 regions of the 16S rRNA gene have been amplified using 16S Amplicon PCR Forward Primer = 5 -TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGC CTACGGGNGGCWGCAG 16S Amplicon PCR Reverse Primer = 5 -GTCTCGTGGGCTCGGAGATGTGTATAAG AGACAGGACTACHVGGGTATCTAATCC using KAPA HiFi enzyme. The library has been prepared using Nextera XT index kits and sequenced using MiSeq Reagent Kit v3 in a Miseq instrument. We have analyzed the genera previously reported in a systematic review to be associated with CRC [20]. Jensen-Shannon Divergence distances have been computed between the subjects of Caucasian and African-American races with cancers in distal and proximal locations of their colons. See Additional file 4 for the list of 14 genera retained for this analysis.
We selected a convenience sample from our MUSC cancer cohort of 20 patients (10 AAs, 10 CAs) which we matched on colonic location (proximal, distal) and sex. Of the 20 cases, 6 have been removed due to low sequence count (< 100) within the genera of interest. This resulted in groups with highly unbalanced number of subjects (Table 1). The location of the cancers correlate with outcomes, whereby distal results in the worst. Hence, to examine possible microbial clues into racial disparity the interaction of race and location is important. Due to extremely small pilot-scale sample size, the group unbalance and potential for heteroscedasticity prompt caution with using PERMANOVA for these comparisons (Fig. 2). Visually, we observe the two distal AA observations segregating to the extreme left of the other specimens. On the other hand the centroid of the CA distal observations is in the diagonal quadrant from the centroid AA distal observations. The proximal specimens are not separated on the first principal axis, but a moderate segregation can be observed along the second axis. Note that if primary effects only are considered in this case, the differences in the location of the cancers would not be plausibly different, as the corresponding centroids would lie well within each others' circles of inertia. These qualitative observations are suggestive that the interaction of race and location represented by the four groups of observations harbor differential microbiota, albeit the extemely small sample size may not be sufficient to deem the observed effect sizes significant. Nonetheless, the race and location interaction model achieves significance (p < 0.05) with W * d test ( Table 2). Observe that as expected the difference between the distal and proximal cancers alone is not significant, but in combination with race suggest existence of differential interaction. These nuances are not captured by the analyses with PERMANOVA, which yields unremarkable p values. Interestingly, there is a discrepancy in test results for the primary effect of the race at 0.05 significance threshold, which is well within the gray zone of being notable. However, the discrepancy of PERMANOVA and W * d in the interaction term is a clear illustration for utility of our method. In the presence of heteroscedasticity and sample size imbalance, one might doubt the result by PERMANOVA. Next, we demonstrate the application of the post hoc procedures described in the methods section. Significance of the interaction term may dictate additional questions about which groups differ from  (Table 3). As expected, these indicate a significant difference (p < 0.05) in the microbiome of the AA distal CRC samples from the rest, and a trend for difference of the Caucasian distal samples. Note that the interpretations of these results might differ if multiple comparison issues are taken into account. Due to the pilot nature of these data, we do not perform any formal corrections, as our goal is to determine the plausibility of significant differences, which are to be evaluated in appropriately sized datasets where power is not a concern. Epidemiological literature indicates that AA and CA have notable differences in the prevalence of colorectal neoplasia in the proximal and distal colorectum at both the precancerous [21][22][23][24] and invasive stages [25]. Numerous lifestyle and dietary factors associated with dysbiosis (e.g., red-meat intake, sedentary  [26][27][28][29][30]. A recent study reported that blacks compared to whites had a greater abundance of sulfidogenic bacteria in the normal colonic mucosa which correlated with higher intakes of fat, protein, and meat per day [31].
Overall, the racial differences we observed in microbial patterns in the CRCs by colonic location may reflect differences in modifiable lifestyle and dietary factors. The data and R Markdown for this application is included in Additional files 3 and 4.

Discussion and conclusion
Community-wide analyses where the entire microbiome is modelled as a response variable of one or more factors has become a standard first line of analysis technique in the field. These techniques address the question of overall aggregate changes in the microbiome in response to explanatory variables without the need to model each individual microbiome constituent. PERMANOVA [1] has been one of the most dominant tools for such analyses, although the potential for confounding of location and dispersion effects has been recognized for a long time [32,33]. The W * d method closes the gap by explicitly accounting for the differences in multivariate dispersion in the data tested, which has been shown to be associated with adverse statistical properties in PERMANOVA [4]. Current heteroscedasticity-aware methodologies allow for modeling multi-level factors, stratification, and multiple post hoc testing scenarios. Although in many applications the differences in statistical decisions made on the basis of PERMANOVA and W * d may remain unchanged, the principled guarantees of being correct in wider range of scenarios provided by the latter might be important for practitioners. Although originally developed for discretevalued covariates, PERMANOVA remains a viable analysis option for continuous covariates as well when multivariate regression-like formula are utilized [34]. However, the effect of heteroscedasticity has not been rigorously evaluated or addressed for such analyses. To be fair, heteroscedasticity with continuous covariates is an issue that does not have a generic statistical solution applicable in most cases. A more cautious analysis involving continuous covariates may require corroboration with discretized independent variables by W * d , but has to also account for potential statistical power issues pertaining to discretization.
A major limitation of most community-wide analyses is that those often do not yield a natural unified framework for evaluation of taxon-level effects. Currently, methods that have this unifying ability are emerging [35]. None of these, however, are evaluated for robustness with heteroscedastic data yet.