A powerful microbial group association test based on the higher criticism analysis for sparse microbial association signals.

BACKGROUND
In human microbiome studies, it is crucial to evaluate the association between microbial group (e.g., community or clade) composition and a host phenotype of interest. In response, a number of microbial group association tests have been proposed, which account for the unique features of the microbiome data (e.g., high-dimensionality, compositionality, phylogenetic relationship). These tests generally fall in the class of aggregation tests which amplify the overall group association by combining all the underlying microbial association signals, and, therefore, they are powerful when many microbial species are associated with a given host phenotype (i.e., low sparsity). However, in practice, the microbial association signals can be highly sparse, and this is especially the situation where we have a difficulty to discover the microbial group association.


METHODS
Here, we introduce a powerful microbial group association test for sparse microbial association signals, namely, microbiome higher criticism analysis (MiHC). MiHC is a data-driven omnibus test taken in a search space spanned by tailoring the higher criticism test to incorporate phylogenetic information and/or modulate sparsity levels and including the Simes test for excessively high sparsity levels. Therefore, MiHC robustly adapts to diverse phylogenetic relevance and sparsity levels.


RESULTS
Our simulations show that MiHC maintains a high power at different phylogenetic relevance and sparsity levels with correct type I error controls. We also apply MiHC to four real microbiome datasets to test the association between respiratory tract microbiome and smoking status, the association between the infant's gut microbiome and delivery mode, the association between the gut microbiome and type 1 diabetes status, and the association between the gut microbiome and human immunodeficiency virus status.


CONCLUSIONS
In practice, the true underlying association pattern on the extent of phylogenetic relevance and sparsity is usually unknown. Therefore, MiHC can be a useful analytic tool because of its high adaptivity to diverse phylogenetic relevance and sparsity levels. MiHC can be implemented in the R computing environment using our software package freely available at https://github.com/hk1785/MiHC.


Background
The recent advent of next-generation sequencing has enabled unbiased microbiome profiling for all microbes inhabiting in different organs of the human body. The two major sequencing platforms for microbiome profiling are the targeted polymerase chain reaction amplicons for the 16S ribosomal RNA (rRNA) gene [1,2] and the shotgun metagenomics for the whole microbial genome [3]. These sequencing platforms produce various types of metagenomic information, such as microbial abundance, gene content, and metabolic capacity [4]. Among those, here we focus on a common type of the microbiome data for the microbial composition with relative abundances and phylogenetic relationships. We also consider the operational taxonomic unit (OTU) as a surrogate of microbial species and the smallest unit of the microbial biomarkers nested in different microbial assemblages (e.g., communities (bacteria, fungi, viruses), upper-or lower-level taxa (phyla, classes, orders, families, genera)). The roles of the microbiome on human health or disease have been intensely studied throughout all different microbial assemblages. For example, the community of bacteria has been primarily studied on the disparity in microbial diversity among different populations (e.g., diseased vs. non-diseased, treatment vs. placebo) [5][6][7][8]. While the communities of fungi or viruses have been less studied, they are gaining more and more attention [9,10]. Moreover, investigators have intensely studied the disparity in microbial taxon composition throughout a breadth of hierarchical taxonomic classifications (e.g., phyla to genera) [11,12].
Here, we refer, in general, the study on the association between any microbial group (e.g., community or clade) composition and a host phenotype (or any other health/ disease-related factor) as a microbial group association study. In response to the popularity of such studies, researchers have proposed a number of microbial group association tests while incorporating the unique features of the microbiome data (e.g., high-dimensionality, compositionality, phylogenetic relationship) into their proposed tests. The most popular approaches are the association tests using αor β-diversity indices [5,6]. αdiversity measures within-sample diversity, by which the high-dimensional microbiome information can be projected into a single diversity variable. We can then easily test the association between an α-diversity index and a host phenotype using a traditional statistical method (e.g., generalized linear models), or we can jointly consider multiple α-diversity indices and conduct an omnibus microbial diversity association analysis using the adaptive microbiome α-diversity-based association test (aMiAD) [13]. On the other hand, β-diversity measures between-sample diversity (i.e., dissimilarity or distance), by which the high-dimensional microbiome information can be projected into a full rank similarity matrix via a kernel machine framework [14]. We can then test the association using the ANOVA-type association test known as the permutational multivariate analysis of variance (PERMANOVA) [15][16][17] or the regression-type association test known as the regression-microbiome regression-based kernel association test (MiRKAT) [14], while they result in a similar performance [14]. Researchers have also proposed diverse microbial group association tests to amplify the overall group association by combining all the underlying microbial association signals (e.g., the microbiome sum of powered score tests (MiSPU) [18]).
All the above tests generally fall in the class of aggregation tests as all the underlying microbial association signals are aggregated into the αor β-diversity or the overall group association statistic [19]. Therefore, they are powerful when a large number of OTUs are associated with a host phenotype (i.e., low sparsity) [19]. However, in practice, it is possible that only few OTUs are associated with a host phenotype (i.e., high sparsity), and, as an extreme case, even a single OTU can cause human disease (e.g., a small influx of Escherichia coli O157:H7 can cause food poisoning [20]). However, it is questionable if the current methods can powerfully discover the microbial group association for the high sparsity situation. For example, it is so obvious that there is a huge disparity in a variety of host phenotypes between the normal and germ-free mice because of the huge disparity in their microbiomes (i.e., presence vs. absence of microbiome) [21], and, in this low sparsity situation, any of the current methods can powerfully discover the microbial group association with no need for any additional method development. Thus, here we instead move our focus onto the high sparsity situation, in which only a small portion of the OTUs are associated with a host phenotype and the pressing issue of powerfully discovering the disparity in a host phenotype driven by the sparse association signals.
We notice that the group association test, known as higher criticism (HC) test, is powerful at high sparsity levels because its test statistic reflects only the single largest association signal among underlying individual association signals [22]. While the use of the higher criticism test has been extended to genome-wide association studies [23,24], it has not been well-appreciated for the microbiome group association analysis. This might be because of the unique features of the microbiome data and the resulting need for more sophisticated analysis procedures. Thus, here we further tailor the higher criticism test for microbial group association analysis by incorporating phylogenetic information and modulating sparsity levels, as follows. First, we notice that phylogenetically relevant species share similar genetic components and evolutionary histories and, as a result, they are likely to have similar functional effects on a host phenotype [25]. Thus, to improve power when the OTUs associated with a host phenotype are phylogenetically relevant, we introduce a weighted higher criticism test which gives a higher weight to the OTUs whose phylogenetically relevant OTUs have larger association signals. Second, the original higher criticism test is powerful at a high sparsity level but rapidly loses power as the sparsity level decreases. Thus, to improve power for lower sparsity levels, we introduce a modulated higher criticism test which flexibly reflects the single or multiple largest association signal(s) among underlying individual association signals. In addition, we notice that the Simes test [26] is also powerful at high sparsity levels because it requires only a single strong association signal among underlying individual association signals which is significant even after the multiple testing correction. We heuristically, but not theoretically, found that the Simes test is more powerful at excessively high sparsity levels than the higher criticism test while the Simes test more rapidly loses power as the sparsity level decreases than the higher criticism test (see the "Simulation results" section).
Here, the dilemma in reality is that the OTUs associated with a host phenotype can be phylogenetically relevant or not, and they can be highly sparse or less sparse. Yet, unfortunately, we cannot presume which specific association pattern underlies our study in advance because of the lack of prior knowledge. Thus, here we introduce a data-driven omnibus test, namely, microbiome higher criticism analysis (MiHC), which robustly adapts to diverse association patterns. To achieve the robust adaptivity, we first construct multiple candidate tests by combining the principles of the original, weighted and modulated higher criticism tests, and the Simes test, in which each of the individual candidate tests suits some specific association pattern. Then, we use the minimum p value among those candidate tests as the test statistic of MiHC with the aim of closely reaching the highest power among those candidate tests. Finally, we use a residual-based permutation approach based on the minimum p value statistic to calculate the p value for MiHC. Here, the residual-based permutation approach enables to preserve OTU-by-OTU correlations [27], which are inherent in the microbiome data because of the compositional constraint (also known as unit sum constraint), phylogenetic relevance, and other potential sources.
Our extensive simulations show that MiHC robustly maintains a high power at different phylogenetic relevance and sparsity levels with correct type I error controls at the significance level of 5%. We also apply MiHC to four real microbiome datasets to test the association between respiratory tract microbiome and smoking status [28], the association between the infant's gut microbiome and delivery mode [29], the association between the gut microbiome and type 1 diabetes status [30], and the association between the gut microbiome and human immunodeficiency virus (HIV) status [31].

Methods and materials
This section is devoted to describe methodological details (i.e., models, notations, test statistics, and computational procedures) for our proposed methods. Since we use many notations, we organize them in a summary table for easy follow-up (Additional file 1: Table S1).

Generalized linear models and marginal score statistics
We suppose that the data include n samples, m OTUs in a microbial group of interest (e.g., community or clade) and l covariates (e.g., age, gender). Let y i denote a host phenotype (or any other health/disease-related factor) of interest, o ij denote an OTU in relative abundance (i.e., proportion), and x ik denote a covariate for i = 1,…, n, j = 1,…, m and k = 1,…, l. To test the association between OTUs and a host phenotype adjusting for covariates, we consider a generalized linear model [32] (Eq. 1), where g(·) is a canonical link function, μ i = E(y i | x i , o i ), and α = (α 0 , …, α l ) T and β = (β 1 , …, β m ) T are the regression coefficients for the covariates, x i = (1, x i1 , …, x il ) T , and the OTUs, o i = (o i1 , …, o im ) T , respectively. Here, y i conditional on x i and o i is assumed to follow a distribution in the exponential dispersion family with the probability density/mass function (Eq. 2).
where θ i is the natural parameter, φ is the dispersion parameter, and a(·), b(·), and c(·) are the known functions [32]. Let b ′ (θ i ) and b ′′ (θ i ) denote the first two de- Here, we are interested in testing the global null hypothesis of no association between OTUs and a host phenotype adjusting for covariates (Eq. 3).
H 0 : β j ¼ 0 for all j 0 sin f1; …; mg vs: : While we will soon address the above global hypothesis testing in the following sections, here we first delineate the marginal standardized score statistic for each OTU (Eq. 4) as it is the key component of the higher criticism test [24].

Unweighted and weighted higher criticism analyses
Donoho and Jin first derived the higher criticism test, motivated by an idea of the great statistician, John Wilder Tukey [22]. Then, the higher criticism test has been further developed by a few follow-up studies [23,24,33,34]. While there are different forms of the test statistic, we use the simplest form of (Eq.5) based on [23].
where uHC is the test statistic for the higher criticism test [23], p j is the p value for the jth OTU, and r j is the rank of p j in the ascending order of p j 's for j = 1,…, m. We denote this higher criticism test as the unweighted higher criticism (uHC) test in order to distinguish it from the forthcoming weighted higher criticism test.
Here, a relatively large observed statistic value compared with null statistic values indicates a higher chance to discover the group association. Prior studies have found that this higher criticism test sensitively detects highly sparse association signals [22][23][24]33]. The major rationale behind is that the test statistic (Eq. 5) focuses on the single largest deviation between the expected ( and observed ( p j ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi p j ð1−p j Þ=m p ) quantiles of significance among all the m tests; as such, only a small number of association signals are sufficient to get a large statistic value [22][23][24].
In microbiome association studies, phylogenetically relevant species tend to have similar effects on a host phenotype because of their similarities in genetic components and evolutionary histories [25]. Thus, to improve power when the OTUs associated with a host phenotype are phylogenetically relevant, we introduce the weighted higher criticism (wHC) test (Eq. 6).
where wHC is the test statistic for the weighted higher criticism test and w j is the weight for the jth OTU. To assign the weight to each OTU (i.e., w j for j = 1,…, m), we first partition the m OTUs into C phylogenetically close clusters based on OTUs' pairwise cophenetic distances [35], where the cophenetic distance of any two OTUs refers to the total length of the branches to their most common ancestor (i.e., the closest intersection) in the phylogenetic tree and we calculate it using the function, cophenetic, in the R package, stats. For this, we use the partitioning-around-medoids algorithm [36] based on the optimal number of clusters (C) which maximizes the average silhouette width searching up to 30 clusters [36]. Let ζ(j) denote a cluster anchored at the jth OTU among the C clusters. Then, we define w j as (Eq. 7), where D j; j 0 is the cophenetic distance between jth and j ′ th OTUs, j ∈ ζ(j) and j ′ ∈ ζ(j)\{j}. w j is designed to give a higher weight to the OTUs whose neighboring OTUs, with respect to closer phylogeny (see 1 D j; j 0 ), have larger association signals (see j Z j 0 j ). Therefore, w j amplifies the association signals from close phylogeny and hence can suit when the OTUs associated with a host phenotype are phylogenetically relevant.

Modulated higher criticism analyses for lower sparsity levels
Again, the higher criticism test is powerful for high sparsity levels, but it is underpowered for low sparsity levels [24]. In practice, the true associations are not always so highly sparse that the higher criticism can be underpowered. Thus, to improve power for lower sparsity levels, we make some modulations to the original test statistic as (Eq. 8), where uHC (h) is the test statistic for the unweighted higher criticism test for a given h value,  (1) ). uHC (1) relies on the single largest deviation and hence can suit high sparsity levels. As h increases, uHC (h) considers more deviations to the next level association signals and hence can suit lower sparsity levels. When h=m, uHC (h) becomes uHC (m) . uHC (m) considers all the m deviations and hence can suit the least sparsity level. Without loss of generality, we can apply the same modulations to the weighted higher criticism (wHC) test (Eq. 9).
where wHC (h) is the test statistic for the weighted higher criticism test for a given h value, and 's in the descending order for j = 1,…, m. We calculate the p values for the individual unweighted (uHC (h) 's) and weighted (wHC (h) 's) tests based on a permutation method (see the "p value calculation" section).

Simes test
Simes (1986) introduced a simple modification of the Bonferroni procedure for multiple hypothesis testing and a group association test, known as the Simes test, that calculates the p value as the minimum p value among the marginal p values that are corrected by the Bonferroni procedure (i.e., multiplied by the number of tests) and weighted by the inverse of their ranks (i.e., multiplied by the inverse of their ranks) [26] (Eq. 10).
where P Simes and T Simes are the p value and the test statistic for the Simes test, p j is the p value for the jth OTU, and r j is the rank of p j in the ascending order of p j 's for j = 1,…, m. To discover the group association, the Simes test requires only a single strong association signal which can produce a significant p value even after adjusting for multiple hypothesis testing. Thus, the Simes test is also powerful at highly sparsity levels. Our simulations demonstrate that the Simes test is more powerful at excessively high sparsity levels than the higher criticism test while the Simes test more rapidly loses power as the sparsity level decreases (see the "Simulation results" section).

Microbiome higher criticism analysis
In reality, the true microbial associations can be phylogenetically relevant or not, and they can be highly sparse or less sparse, yet we do not know the true underlying association pattern in advance. Thus, to robustly adapt to the unknown phylogenetic relevance and sparsity levels, we propose a data-driven omnibus test, namely, microbiome higher criticism (MiHC) analysis (Eq. 11). This kind of the minimum p value statistic approach has also been widely used in many prior association tests [13,14,18,[37][38][39]. The set (Г) can be spanned up to the union set of {1, 2,…, m-1, m}. However, it is a huge computational burden to survey all the h values in the union set because of the highdimensionality of the microbiome data. Thus, we use a candidate set of Г = {1, 3, 5, 7, 9} and it was sufficient in our simulations and real data applications. The use of the minimum p value statistic allows MiHC to closely approach the most powerful test among all the candidate tests in Г and the Simes test. Therefore, compared with the original higher criticism test (which is only for h = 1) or the Simes test, our candidate set always gives a similar or higher power. Our extensive simulation experiments demonstrate the high adaptivity of MiHC to various phylogenetic relevance and sparsity levels while robustly maintaining a high power with wellcontrolled type I error rates (see the "Simulation results" section). By the same logic, we can also consider two local omnibus tests, namely, uHC A (Eq. 12) and wHC A (Eq. 13), that are taken within each of the two sub-domains: (1) the unweighted higher criticism tests (i.e., uHC (h) tests for h's in Г) and the Simes test and (2) the weighted higher criticism tests (i.e., wHC (h) tests for h's in Г) and the Simes test.
T uHC A is the minimum p value among uHC (h) (Eq. 8) tests and the Simes test (Eq. 10) while T wHC A is the minimum p value among wHC (h) (Eq. 9) tests and the Simes test (Eq. 10). These two local omnibus tests are distinguished from the global omnibus test, MiHC, that is taken within the global domain of all the unweighted and weighted higher criticism tests (i.e., all the uHC (h) and wHC (h) tests for h's in Г) and the Simes test. T uHC A and T wHC A are the test statistics of uHC A and wHC A , respectively, and we calculate the p value based on a permutation method (see the "p value calculation" section). By the formula, we can infer that uHC A and wHC A can modulate sparsity levels through h's in Г and the Simes test for excessively high sparsity levels, while uHC A suits the low phylogenetic relevance, but wHC A suits the high phylogenetic relevance. Although the global omnibus test (i.e., MiHC) (Eq. 11) is our major proposal for microbial group association analysis, we introduce these two local omnibus tests (i.e., uHC A and wHC A ) especially because uHC A is useful to modulate sparsity levels when the phylogenetic information is not available (e.g., microbial functional studies for genetic/metabolic content).

p value calculation
There have been different approaches to calculate the p value for the higher criticism test [22-24, 33, 34]. The analytical approaches based on an asymptotic distribution proposed in [22,33,34] have the advantage of producing a closed-form p value in a computationally efficient manner. However, the analytical approaches assume independent tests and/or rely on asymptotics in m which requires m as large as a million for valid statistical inferences [23]. In microbiome association studies, the independence assumption can be easily violated because of the inherent compositional constraint and phylogenetic relevance. Furthermore, the microbiome data do not usually include a million OTUs so that the slow convergence rate to asymptotics in m can lead to invalid statistical inferences [23]. Thereafter, Barnett et al. proposed an exact p value calculation which releases the independence assumption and the huge m requirement. However, its computational burden increases exponentially as m increases; hence, it can handle only a small number of OTUs. Therefore, instead of using the asymptotic or the exact method, we use a permutation method to calculate the p value for our proposed method. In particular, we use the following procedures. Importantly, our permutation method can robustly account for any correlation structure among the m tests using the same permuted residual vectors repeatedly for each test (i.e., residual-based permutation) [27]. Moreover, since MiHC is based on the score test (Eq. 4) which is computationally efficient and the null model needs to be fitted only once, our method is computationally manageable.

Visualization
Here, we introduce simple Q-Q plots to demonstrate influential OTUs in each of the two sub-domains (i.e., uHC and wHC) for MiHC. First, we draw Q-Q plots between the expected ( tiles for wHC (Eq. 6), respectively, for j = 1,…, m. Here, we use (blue) dots to represent individual OTUs and a (red) diagonal line with intercept 0 and slope 1 to represent no influential points; as such, the OTUs that fall along the diagonal line have no influence on the host phenotype while the OTUs that have larger deviations from the diagonal line are more influential on the host phenotype. Then, we report the 10 most influential OTUs corresponding to the 10 largest deviations from the diagonal line with respect to uHC and wHC, respectively. We use darker to lighter vertical lines to represent more to less influential OTUs in rank order among the 10 most influential OTUs. Example visualizations can be found later in the "Real data applications" section.

Simulation results
We conducted simulation experiments to compared MiHC with the prior tests, Simes test [26], higher criticism (HC) test (i.e., uHC (Eq. 5)) [22], aMiAD [13], adaptive MiSPU (aMiSPU) [18], and Optimal MiRKAT (OMiRKAT) [14]. Our simulation design is based on prior studies [14]. We first estimated the proportions and dispersion for the 100 most abundant OTUs from the real respiratory tract microbiome data [28] based on the Dirichlet-multinomial model [40]. Then, we iteratively generated an OTU count table using the Dirichletmultinomial model with the estimated proportions and dispersion and a rooted phylogenetic tree with 100 leaves using the function, rtree, in the R package, ape [41]. Here, we fixed the total reads per sample as 1000 to mimic the compositional constraint and considered two different sample sizes, n = 50 and n = 100, respectively. To illustrate fits of the simulated data, we generated histograms of the relative abundances for the 100 most abundant OTUs of the real respiratory tract microbiome data and the simulated data based on the Dirichlet-multinomial model (Additional file 2: Figure  S1). For the relative abundances of the simulated data, we took averages across 100 simulated data sets that were iteratively generated based on the Dirichletmultinomial model for n = 50 and n = 100, respectively. To estimate type I error rates and powers, we generated Gaussian responses based on the linear regression model below.
where x i1 are x i2 are the covariates generated from the Bernoulli distribution with success probability 0.5 and the standard normal distribution N(0, 1), respectively, Λ is a set of OTUs that are associated with the host phenotype y i , ε i is an error term generated from the standard normal distribution N(0, 1), and scale is the standardization function to have mean 0 and standard deviation 1.
To estimate type I error rates, we assigned β = 0 to reflect the null hypothesis of no association for all OTUs (Eq. 3). To estimate powers, we assigned β = 1 for n = 50 and β = 0.5 for n = 100, while choosing the set of associated OTUs (Λ) based on two different scenarios: (1) we randomly selected 2%, 4%, 6%, 8%, 10%, or 12% of the OTUs to be associated with the host phenotype and (2) we selected 2%, 4%, 6%, 8%, 10%, or 12% of the OTUs which are phylogenetically close to be associated with the host phenotype. We regard the second scenario more realistic because the phylogenetic relevance likely to give shared functional attributes. In particular, for the second scenario, we randomly selected one OTU as a seed OTU and then included 2%, 4%, 6%, 8%, 10%, or 12% of the OTUs that are closest to the seed OTU (including the seed OTU) with respect to cophenetic distance [35]. For both of the scenarios, 2%, 4%, 6%, 8%, 10%, and 12% reflect from high to low sparsity levels.

Fits of the simulated data
Additional file 2: Figure S1 reports the histograms of the relative abundances for the real respiratory tract microbiome data [28] and the simulated data based on the Dirichlet-multinomial model [40]. We can observe that the simulated data approximate to the real data in shape while including high proportions for rare OTUs (Additional file 2: Figure S1). This indicates that the Dirichlet-multinomial model is useful to simulate microbiome data. Table 1 reports empirical type I error rates at the significance level of 5% for all the surveyed methods. We can observe correct type I error controls (i.e., the empirical type I error rates close to the significance level of 5%) for all the individual (i.e., uHC (h) 's and wHC (h) 's) and omnibus (i.e., uHC A , wHC A , and MiHC) higher criticism tests and the Simes test and also for all the other competing tests (i.e., aMiAD, aMiSPU, and OMiRKAT) ( Table 1). Therefore, all the surveyed tests are valid in hypothesis testing.
The individual (i.e., uHC (h) and wHC (h) ) tests are more powerful using a smaller h value for higher sparsity levels, while they are more powerful using a larger h value for lower sparsity levels ( Fig. 1 and Additional file 3: Figure S2), which is explained by the modulation scheme (Eq. 8 and Eq. 9). The Simes test is powerful at high sparsity levels but rapidly loses power as the sparsity level decreases ( Fig. 1 and Additional file 3: Figure  S2). The Simes test is more powerful at high sparsity levels (i.e., ≤ 2% or ≤ 4%) even than the original high criticism test (i.e., uHC (1) ), but it is less powerful at low sparsity levels (i.e., ≥ 4% or ≥ 6%) than any individual higher criticism tests ( Fig. 1 and Additional file 3: Figure  S2). uHC A closely approaches the most powerful test among the individual unweighted tests (i.e., uHC (h) 's) and the Simes test (Fig. 1a, c and Additional file 3: Figure S2:A,C), while wHC A closely approaches the most powerful test among the individual weighted tests (i.e., wHC (h) 's) and the Simes test (Fig. 1b, d and Additional file 3: Figure S2:B,D), which is explained by the adaptivity of the minimum p value statistic (Eq. 12 and Eq. 13). The unweighted tests (i.e., uHC (h) 's and uHC A ) are more powerful than the weighted tests (i.e., wHC (h) 's and wHC A ) when randomly selected OTUs are associated with the host phenotype (Fig. 1a > Fig. 1b, Additional file 3: Figure S2A > Additional file 3: Figure S2B, and Fig.  2a), while the weighted tests are more powerful than the unweighted tests when phylogenetically relevant OTUs are associated with the host phenotype (Fig. 1c < Fig. 1d, Additional file 3: Figure S2C < Additional file 3: Figure  S2D, and Fig. 2b), which is explained by the weighting scheme for phylogenetic relevance (Eq. 7). In addition, the unweighted tests (i.e., uHC (h) 's and uHC A ) are almost equally powerful when either randomly selected OTUs (Fig. 1a) or phylogenetically relevant OTUs (Fig.  1c) are associated with the host phenotype ( Fig. 1a ≅ Fig.  1c). This is because the unweighted tests do not utilize any phylogenetic information; hence, they treat either randomly selected OTUs or phylogenetically relevant OTUs all equally as randomly selected OTUs.
To facilitate easier comparison, Fig. 2 and Additional file 4: Figure S3 report estimated powers only for the local omnibus (i.e., uHC A and wHC A ) and global omnibus (i.e., MiHC) higher criticism tests. Here again, uHC A is more powerful than wHC A when randomly selected OTUs are associated with the host phenotype ( Fig. 2a and Additional file 4: Figure S3A), while wHC A is more powerful than uHC A when phylogenetically relevant OTUs are associated with the host phenotype ( Fig. 2b and Additional file 4: Figure S3B), which is explained by the weighting scheme for phylogenetic relevance (Eq. 7). Importantly, we can observe that MiHC closely approaches the most powerful test between uHC A and wHC A (Fig. 2 and Additional file 4: Figure S3), which is Table 1 Empirical type I error rates at the significance level of 5% for the individual (i.e., uHC (h) 's and wHC (h) 's for h ∈ {1, 3, 5, 7, 9}) and omnibus (i.e., uHC A , wHC A , and MiHC) higher criticism tests, the Simes test, and the other competing tests (i.e., aMiAD, aMiSPU, and OMiRKAT)

Real data applications
The association between the respiratory tract microbiome and smoking status Charlson et al. have collected swab samples from the upper respiratory tract to survey the effect of cigarette smoking on the respiratory tract microbiome [28]. The microbiome data for the OTU abundance table and phylogenetic tree are publicly available in the R package, GUni-Frac [42], where the raw sequence data had been processed using the QIIME pipeline [2] by targeting the V1-2 region of the 16S ribosomal RNA (rRNA) gene (refer to [28] for more detailed sampling/data processing procedures) and the phylogenetic tree had been constructed by using FastTree [43,44]. The microbiome data include 273 OTUs with mean relative abundance ≥ 10 −4 for 60 samples (28 smokers and 32 non-smokers). Here, we test the association between respiratory tract microbial composition and smoking status while adjusting for gender and antibiotic use within the last 3 months.
We found the significant association between respiratory tract microbial composition and smoking status throughout all the individual and omnibus higher criticism tests and the Simes test (Fig. 4a). We can also observe only a small difference between the unweighted (i.e., uHC (h) 's and uHC A ) and weighted (i.e., wHC (h) 's and wHC A ) tests (Fig. 4a), indicating that the OTUs associated with smoking status might have only mild phylogenetic relevance. We can also confirm it in visualization with similar graphical patterns between uHC and wHC (Fig. 4a). We also report the 10 most influential OTUs with respect to uHC and wHC, respectively (Fig. 4a). For the other competing methods, aMiAD and OMiRKAT find the significant association while aMiSPU does not (  Fig. 4a).
The association between the infant's gut microbiome and delivery mode Bokulich et al. have conducted a microbiome profiling study to survey the effect of the early life factors (e.g., delivery mode, infant nutrition, antibiotic use) on the infant's gut microbiome [29]. As a demonstration, we test the association between the infant's gut microbiome and delivery mode (i.e., vaginal or cesarean birth) while adjusting for gender and predominant diet (breastfeeding vs. formula). The microbiome data include 310 OTUs with mean relative abundance ≥ 10 −4 for 32 infants (11 infants by cesarean delivery and 21 infants by vaginal delivery) [29,37], where the raw sequence data had been processed using the QIIME pipeline [2] by targeting the V4 region of the 16S rRNA gene (refer to [29,37] for more detailed sampling/data processing procedures) and the phylogenetic tree had been constructed by using FastTree [43,44].
All the weighted tests (i.e., wHC (h) 's and wHC A ) except for wHC (1) found the significant association between the infant's gut microbiome and delivery mode, while none of the unweighted tests (i.e., uHC (h) 's and uHC A ) and the Simes test found it (Fig. 4b), indicating that the OTUs associated with delivery mode might have strong phylogenetic relevance. We can also confirm it in visualization with larger deviations between the expected and observed quantiles for wHC than uHC (Fig. 4b). We can also observe that the individual tests (i.e., uHC (h) and wHC (h) ) using a larger h value find smaller p values (Fig. 4b), indicating that many OTUs might be associated with delivery mode (i.e., low sparsity). We can also confirm it in visualization that many OTUs have some deviations between the expected and observed quantiles (Fig.  4b). We also report the 10 most influential OTUs with respect to uHC and wHC, respectively (Fig. 4b). For the other competing methods, OMiRKAT finds the significant association while aMiAD and aMiSPU do not ( Table 2 B). MiHC finds the significant association (p value, 0.044) ( Table 2 B and Fig. 4b).
The association between the gut microbiome and T1D status Livanos et al. have conducted a microbiome profiling study to survey the roles of the gut microbiome on T1D onset through mouse experiments [30]. As a demonstration, we test the association between gut microbial composition and T1D status. For this, 19 mice were exposed to therapeutic-dose pulsed antibiotic (PAT) treatment at 6 weeks of age and then followed up for 30 weeks. The microbiome data include 120 OTUs with mean relative abundance ≥ 10 −4 for the 19 mice at 30 weeks of the follow-up (9 T1D-free mice and 10 T1D-onset mice), where the raw sequence data had been processed using the QIIME pipeline [2] by targeting the V4 region of the 16S rRNA gene (refer to [30] for more detailed sampling/data processing procedures) and the phylogenetic tree had been constructed by using FastTree [43,44].
We found the significant association between gut microbial composition and T1D status throughout all the individual and omnibus higher criticism tests but not through the Simes test (Fig. 5a). We can also observe Fig. 4 The Q-Q plots between the expected and observed quantiles for uHC and wHC, respectively. a The association between the respiratory tract microbiome and smoking status. b The association between the infant's gut microbiome and delivery mode. Blue dots represent individual OTUs and a red diagonal line represents no influential points. Darker to lighter vertical lines represent more to less influential OTUs in rank order among the 10 most influential OTUs that correspond to the 10 largest deviations from the red diagonal line. The asterisk represents the p values for all the individual (i.e., uHC (h) 's and wHC (h) 's for h ∈ {1, 3, 5, 7, 9}) and omnibus (i.e., uHC A , wHC A , and MiHC) higher criticism tests and the Simes test only a small difference between the unweighted (i.e., uHC (h) 's and uHC A ) and weighted (i.e., wHC (h) 's and wHC A ) tests (Fig. 5a). This might indicate that the OTUs associated with T1D status have only mild phylogenetic relevance. We also report the 10 most influential OTUs with respect to uHC and wHC, respectively (Fig. 5a). For the other competing methods, aMiSPU and OMiRKAT find the significant association while aMiAD does not (  Fig. 5a).

The association between the gut microbiome and HIV status
Pinto-Cardoso et al. have conducted a microbiome profiling study to survey the effect of antiretroviral therapy (ART) on the gut microbiome of HIV-positive individuals [31]. As a demonstration, we test the association between gut microbial composition and HIV status while adjusting for age. For this, 33 HIV-infected individuals on ART and 10 HIV-uninfected individuals from Mexico were included in the analysis [31]. The microbiome data include 422 OTUs with mean relative abundance ≥ 10 −4 for the 44 individuals, where the raw sequence data had been processed using the Resphera Insight [45] by targeting the V3 and V4 regions of the 16S rRNA gene (refer to [31,46] for more detailed sampling/data processing procedures) and the phylogenetic tree had been constructed by using PyNAST [47].
We found the significant association between gut microbial composition and HIV status throughout all the individual and omnibus higher criticism tests, but not through the Simes test (Fig. 5b). We can also observe only a small difference between the unweighted (i.e., uHC (h) 's and uHC A ) and weighted (i.e., wHC (h) 's and wHC A ) tests (Fig. 5b), indicating that the OTUs associated with T1D status might have only mild phylogenetic relevance. We can also confirm it in visualization with similar graphical patterns between uHC and wHC (Fig.  5b). We also report the 10 most influential OTUs with respect to uHC and wHC, respectively (Fig. 5b). For the other competing methods, aMiAD finds the significant association while aMiSPU and OMiRKAT do not (  Fig. 5b).

Discussion and conclusions
In this paper, we introduced a data-driven omnibus test, MiHC, to evaluate the association between microbial group (e.g., community or clade) composition and a host phenotype of interest. Our simulations demonstrated that MiHC robustly maintains a high power for both random and phylogenetic association patterns at different high sparsity levels with correct type I error controls. Table 2 The p values for the individual (i.e., uHC (h) 's and wHC (h) 's for h ∈ {1, 3, 5, 7, 9}) and omnibus (i.e., uHC A , wHC A , and MiHC) higher criticism tests, the Simes test, and the other competing tests (i.e., aMiAD, aMiSPU, and OMiRKAT). (A) The association between respiratory tract microbiome and smoking status. (B) The association between the infant's gut microbiome and delivery mode. (C) The association between the gut microbiome and type 1 diabetes status. (D) The association between the gut microbiome and human immunodeficiency virus status. * represents significant p values We also applied MiHC to four different real microbiome datasets and observed that MiHC finds stably low p values while the individual (i.e., uHC (h) 's and wHC (h) 's) higher criticism tests and the Simes test find differing p values depending on the underlying phylogenetic relevance and sparsity levels. Thus, MiHC is a useful analytic tool in practice because of the unknown phylogenetic relevance and sparsity levels.
We considered the optimal number of clusters which maximizes the average silhouette width searching up to 30 clusters and the candidate set of Г = {1, 3, 5, 7, 9}, instead of the union set of Г = {1, 2,…, m-1, m}, for the Fig. 5 The Q-Q plots between the expected and observed quantiles for uHC and wHC, respectively. a The association between the gut microbiome and type 1 diabetes status. b The association between the gut microbiome and human immunodeficiency virus status. Blue dots represent individual OTUs and a red diagonal line represents no influential points. Darker to lighter vertical lines represent more to less influential OTUs in rank order among the 10 most influential OTUs that correspond to the 10 largest deviations from the red diagonal line. The asterisk represents the p values for all the individual (i.e., uHC (h) 's and wHC (h) 's for h ∈ {1, 3, 5, 7, 9}) and omnibus (i.e., uHC A , wHC A , and MiHC) higher criticism tests and the Simes test individual (i.e., uHC (h) 's and wHC (h) 's) higher criticism tests to avoid the exhaustive search and huge computation. However, any other upper limit to fine the optimal number of clusters and any other smaller or larger candidate set can alternatively be considered by the researcher's choice through the options in our software package. For example, you may believe that the candidate set of Г = {1, 3, 5, 7, 9} is too much tailored to high sparsity levels; hence, you can include larger values in the candidate set for lower sparsity levels. Moreover, a number of microbiome data normalization procedures have been proposed [48], but there is no consensus on which procedure is the best and such debate is beyond the scope of this paper. We did not survey any further normalization procedure except for using relative abundances (i.e., proportions), instead of absolute abundances (i.e., read counts), to control differing total read counts per sample. However, MiHC is compatible with any other normalization procedure (e.g., centered log-ratio transformation [49]), which can be considered by the researcher's choice. We set up all the implementation procedures described in this paper as a default in our software package, yet we do not strictly force to use it. Instead, we give researches some user options in our software package to make the best use of it.
We developed MiHC based on the generalized linear models to handle exponential family responses with the linear predictor [32]. However, its application can be much broader, and, for example, the potential extensions to survival [38,50], longitudinal [39,51], or mediation [52] analysis need to be further studied.