Suppose that for subject *i*, *i*=1,⋯,*n*, we observe the outcome value *y*_{i}, which can be continuous or binary; the vector of *p* covariates *x*_{i}=(*x*_{i1},⋯,*x*_{ip})^{⊤}, and the abundance vector of *M* OTUs *z*_{i}=(*z*_{i1},⋯,*z*_{iM})^{⊤}, after sequence processing. In matrix presentation, we have the *n*×*M* matrix of OTU abundance *Z*=(*z*_{1},*z*_{2},…,*z*_{n})^{⊤}, the *n*×*p* covariate matrix *X*=(*x*_{1},*x*_{2},⋯,*x*_{n})^{⊤}, and the outcome vector *y*=(*y*_{1},*y*_{2},⋯,*y*_{n})^{⊤}.

POST uses a kernel machine regression [10, 22] to model the relationship between OTU *m* and the outcome variable:

$$ g(\boldsymbol{\mu}) = \boldsymbol{X}\boldsymbol{\gamma}+h^{m}(\boldsymbol{Z}), $$

(1)

where *μ*=*E*(*y*|*X*,*Z*), the conditional expectation of *y* given the microbiome and covariate information; *g*(·) is the link function, which can be set to be the identity function for continuous outcomes and the logistic function for binary outcomes; *γ* is the *p*×1 vector of covariate regression coefficients; *h*^{m}(·) is a smooth function in a reproducing kernel Hilbert space generated by a kernel function *k*^{m}(*z*_{i},*z*_{j}). Function *h*^{m}(·) characterizes the effect of OTU *m* and can be specified by *k*^{m}(*z*_{i},*z*_{j}) using the dual representation of a function in kernel methods, i.e., \(h^{m}(\boldsymbol {z}_{i})=\sum _{j=1}^{n}\alpha ^{m}_{j} k^{m}(\boldsymbol {z}_{i},\boldsymbol {z}_{j})\) with \(\alpha ^{m}_{1},\cdots,\alpha ^{m}_{n}\) the unknown parameters. The association between OTU *m* and the outcome can be evaluated by testing *H*_{0}:*h*^{m}(*Z*)=*0*.

In POST, instead of using a global kernel as in the community-level test, we propose an OTU-specific kernel function *k*^{m}(·,·) for OTU *m* to permit the local OTU test to borrow information from its neighboring OTUs in the phylogenetic tree. Figure 1 describes the key steps of POST: (1) to quantify the pairwise OTU correlation based on their phylogenetic relationship, (2) to compute the subject kernel matrix *K*^{m} for OTU *m* using OTU abundance and OTU correlation data, and (3) to evaluate the association of OTU *m* with the outcome variable using the local kernel test that seeks for the optimal amount of information borrowing. The detail of each step is described below.

### Step 1: Quantifying OTU phylogenetic correlation

The first step of POST is to compute the OTU correlation matrix based on the phylogenetic relationship. To do so, we start with constructing the phylogenetic tree with *M* tips based on the sequences of the *M* OTUs using *FastTree2* [23]. We then quantify the pairwise distance between OTUs *ℓ* and *m* (denoted by *d*_{ℓm}) by taking the sum of branch lengths from OTU *ℓ* to OTU *m* through the first common node in the tree, which can be done using the R function “cophenetic.” Finally, we convert the distance to correlation (denoted by *r*_{ℓm}) using the Gaussian function, i.e., \(r_{\ell m} = \exp \left \{-\frac {d_{\ell m}^{2}}{c\times s}\right \}\), where *s* is the standard deviation (SD) of *d*_{ℓm}’s, *ℓ*,*m*∈{1,⋯,*M*};, and *c* is a positive data-adaptive parameter that controls how fast the OTU correlation decreases when the between-OTU distance increases. Parameter *c* is measured on the unit of distance SD so as to be scale-free and robust to trees generated from different tools (e.g., *FastTree2*, hclust). As detailed in step 2, neighboring OTUs with non-zero *r*_{ℓm}, *ℓ*≠*m* will contribute in the association test of OTU *m*, and *r*_{ℓm} controls the amount of information contributed from OTU *ℓ* into the test via *c*. We illustrate in Fig. 2 a typical relationship between *c* and *r*_{ℓm} using the dataset of Charlson et al. [24] considered in the simulation study. Without loss of generality, we focus on OTU3438 and its 16 neighboring OTUs in the phylogenetic tree (Fig. 2A) and show the values of *r*_{ℓ,3438} for the neighboring OTU *ℓ* under different *c*’s in Fig. 2B. We see that smaller *c* (e.g., *c*=0.02) yields a more rapid correlation decay with increasing distance, and consequently, OTU *m* can borrow a non-negligible amount of information only from a few OTUs. In contrast, larger *c* (e.g., *c*=0.08) yields a slower correlation decay with increasing distance and defines a larger “neighborhood” of OTU *m* from which OTU *m* may borrow information. When *c*=0, *r*_{mm}=1 and *r*_{ℓm}=0 for all *ℓ*≠*m*, and the local association test becomes a strict single OTU test because only information of target OTU *m* is used (e.g., Fig. 2B, panel of *c*=0). When *c*=*∞*, *r*_{ℓm}=1 for all *ℓ*, and the local association test becomes a global, community-level test because all OTUs contribute equally into the test (e.g., Fig. 2B, panel of *c*=100).

When conducting POST association test in step 3, we use data to determine the optimal *c* value. Specifically, we compute the *p*-values for a grid of *c* between 0 and *c*_{max} and use the Cauchy combination (CC) method [25, 26] to aggregate the *p*-values of different *c*’s. As detailed in the simulation studies, we set the *c*_{max}=0.05 to ensure information sharing is limited within a small, concentrated neighborhood of the target OTU. The adaptively determined *c* also ensures that only informative neighbors will make contributions to the target OTU.

### Step 2: Computing subject kernel similarity

In kernel machine regression (1), one crucial component is to specify the local kernel function *k*^{m}(*z*_{i},*z*_{j}), which consequently determines the OTU effects, i.e., \(h^{m}(\boldsymbol {z}_{i})=\sum _{j=1}^{n}\alpha ^{m}_{j} k^{m}(\boldsymbol {z}_{i},\boldsymbol {z}_{j})\). Several kernel choices are available for microbiome compositions in community-level tests [10, 22]. The key is to start with a distance metric to quantify the pairwise abundance dissimilarities, store them in the *n*×*n* abundance dissimilarity matrix *A*, and then transform *A* to the kernel similarity matrix *K* using Gower’s matrix, i.e., \(\boldsymbol {K}=-\frac {1}{2} \left (\boldsymbol {I}-\frac {\mathbf {1} \mathbf {1}^{\top }}{n}\right)\boldsymbol {A}^{2} \left (\boldsymbol {I}-\frac {\mathbf {1} \mathbf {1}^{\top }}{n}\right)\), where *A*^{2} is the element-wise square of *A*, *I* is the *n*×*n* identity matrix, and 1 is a vector of ones with length *n* [22]. For POST, we propose to construct a local, OTU-specific kernel *K*^{m} so that only local information around the target OTU is used. First, obtain the local dissimilarity matrix *A*^{m} by incorporating *r*_{ℓm} into the original dissimilarity matrix *A*. Next, perform Gower’s transformation on *A*^{m} to obtain \(\boldsymbol {K}^{m}\equiv \left \{ k^{m}({\boldsymbol {z}_{i},\boldsymbol {z}_{j})} \right \} = -\frac {1}{2} \left (\boldsymbol {I}-\frac {\mathbf {1} \mathbf {1}^{\top }}{n}\right)(\boldsymbol {A}^{m})^{2} \left (\boldsymbol {I}-\frac {\mathbf {1} \mathbf {1}^{\top }}{n}\right)\). Although such local kernel construction can be applied on arbitrary distance metrics, in POST, we choose to use the Aitchison distance, a non-phylogenetic distance measure accounting for the compositional nature of abundance data, to avoid the potential overuse of phylogenetic information. Aitchison distance is the Euclidean distance of centered log-ratio (CLR) transformed abundance data, and the “local” Aitchison distance for OTU *m* between subjects *i* and *j* can be obtained by:

$$ \boldsymbol{A}_{ij}^{m} = \sqrt{\sum_{\ell=1}^{M} r_{\ell m}\times (z_{i \ell}^{*}-z_{j\ell }^{*})^{2}}, $$

(2)

where \(z_{i\ell }^{*}=\log \left [\frac {z_{i\ell }}{\mathrm {G}(\boldsymbol {z}_{i})}\right ]\), i.e., the CLR transformation of the original abundance count *z*_{iℓ} for OTU *ℓ*, and \(\mathrm {G}(\boldsymbol {z}_{i})=\sqrt [M]{z_{i1} \cdots z_{iM}}\) is the geometric mean of OTU abundance for subject *i*. A pseudo-count 0.5 is added to the OTU counts table before doing CLR transformation. If proportion data is used, a pseudo-proportion 1e −6 can be added to the proportion table [27].

### Step 3: Performing OTU-specific association test

The association of OTU *m* can be detected by testing for *H*_{0}:*h*^{m}(*Z*)=*0* in model (1). Such a test can be constructed through the connection of kernel machine regression and generalized linear mixed models (GLMM) [28, 29]. That is, *h*^{m}(*Z*) can be viewed as subject-specific random effects with *h*^{m}(*Z*)∼*N*(*0*,*τ**K*^{m}), and then testing for *H*_{0}:*h*^{m}(*Z*)=*0* is equivalent to the variance component test of *H*_{0}:*τ*=0. Under the GLMM framework, a variance component score test can be obtained as \(T^{m,c} = \frac {1}{2\hat {\phi }} \left (\boldsymbol {y}-\hat {\boldsymbol {\mu }}_{0}\right)^{\top } \boldsymbol {K}^{m} \left (\boldsymbol {y}-\hat {\boldsymbol {\mu }}_{0}\right)\), where \(\hat {\boldsymbol {\mu }}_{0}=g^{-1}\left (\boldsymbol {X}\hat {\boldsymbol {\gamma }}\right)\) with \(\hat {\boldsymbol {\gamma }}\) the estimated covariate coefficient under the null model: *g*(*μ*)=*X**γ*, and \(\hat {\phi }\) is the estimator of the dispersion parameter under *H*_{0}. For continuous outcome, \(\hat {\boldsymbol {\mu }}_{0}=\boldsymbol {X}\hat {\boldsymbol {\gamma }}\) and \(\hat {\phi }\) equals to the estimated residual variance under *H*_{0}. For binary outcome, *ϕ*=1 and \(\hat {\boldsymbol {\mu }}_{0}=\exp \left \{\boldsymbol {X}\hat {\boldsymbol {\gamma }}\right \} / \left (1+\exp \left \{\boldsymbol {X}\hat {\boldsymbol {\gamma }}\right \}\right)\). With a fixed *c*, *T*^{m,c} asymptotically follows a weighted mixtures of \(\chi ^{2}_{(1)}\) distribution under *H*_{0}, based on which the *p*-value, denoted by *p*_{m,c}, can be computed. Because the sample size tends to be moderate in microbiome studies, we use the small-sample distribution derived in Chen et al. [30] to compute *p*_{m,c}. Finally, as the optimal *c* is unknown in reality, in POST, we consider a grid of *c*∈{*c*_{1},⋯,*c*_{J}}, use the CC method [25, 26] to combine the transformed *p*-values of different *c*’s by computing \(T^{m} = \sum _{j=1}^{J} \tan \left \{ \left (0.5-p_{m,c_{j}}\right)\pi \right \}\), and obtain the *p*-value of *T*^{m} (denoted by *p*_{m}) by \(p_{m} =\frac {1}{2}-\{\text {arctan}(T^{m}/J)\}/\pi.\) The CC method behaves like the minimum *p*-value method as *T*^{m} is dominated by the smallest *p*-values; yet, the CC method becomes a computationally fast alternative to the minimum *p*-value method because the *p*-value of *T*^{m} can be analytically obtained from a Cauchy distribution, even with correlated *p*-values [25, 26].

### Simulation study

We conducted simulation studies to evaluate the performance of POST in identifying outcome-associated OTUs. We generated OTU data based on the real dataset from the upper respiratory tract microbiome study [24], obtained from the R package *GuniFrac* [7]. The dataset consists of 856 OTU data from 60 subjects (from R object “throat.otu.tab”) and the OTU phylogenetic tree (from R object “throat.tree” constructed using *FastTree* [7]). We considered 2 different simulation settings (simulations A and B); for each simulation, we simulated OTU counts for *n*=100 individuals following Chen and Li [22] and focused on the *M*=400 most abundant OTUs in the original data. We modeled the observed OTU counts using the Dirichlet-multinomial distribution with parameters (*π*_{1},⋯,*π*_{M},*θ*), where *π*_{ℓ}’s are the means of the OTU-proportions and *θ* is the over-dispersion parameter [31, 32], and obtained their maximum likelihood estimates \((\hat \pi _{1},\cdots,\hat \pi _{M}, \hat \theta)\). Next, we generated the OTU abundance proportions (*p*_{1},⋯,*p*_{M}) from Dirichlet \((\hat \pi _{1}, \cdots,\hat \pi _{M},\hat \theta)\); we also generated the total counts of individual *i*, *N*_{i}, from a negative binomial distribution with mean 10,000 and size 25. Finally, we generated the OTU counts of individual *i* (*z*_{i1},⋯,*z*_{iM}) from multinomial (*N*_{i},*p*_{1},⋯,*p*_{M}); the OTU count matrix *Z* comprises the resulting simulated counts.

#### Simulation A

In simulation A, we adopted the simulation design of Xiao et al. and considered a case-control study with 50 cases (i.e., *y*_{i}=1) and 50 controls (i.e., *y*_{i}=0). Specifically, we used the R package *pam* to partition the 400 OTUs into 20 clusters based on the phylogenetic tree and selected causal OTUs from the 8 most abundant clusters. Given *Z* with *M*=400 and *n*=100, we then multiplied the counts of cases with a fold change vector [ exp(*β*_{1}),⋯, exp(*β*_{400})], where *β*_{m} is the effect size of OTU *m* generated from normal (*ν*,1) if OTU *m* is causal and 0 otherwise. We considered |*ν*| to be 1 (small effect size) or 2 (large effect size).

We consider five scenarios for causal OTUs as illustrated in Fig. 3. Scenarios 1–3 consider the same set of causal OTUs, which form multiple “causal OTU hubs” in the phylogenetic tree, with 1 causal hub from each of the top 8 abundant clusters and 7–10 causal OTUs per causal hub. In scenario 1, *β*_{m}’s are generated from *N*(*ν*,1) for all causal OTUs. In scenario 2, *β*_{m}∼*N*(*ν*,1) for half of the causal hubs and *β*_{m}∼*N*(−*ν*,1) for the other half of the causal hubs. In scenario 3, a random half of the causal OTUs have their *β*_{m}’s generated from *N*(*ν*,1), and the remaining half have their *β*_{m}’s from *N*(−*ν*,1); consequently, a causal hub may contain a fair number of positive-effect and negative-effect causal OTUs. Scenario 4 considers the case of smaller causal hubs with 2–3 causal OTUs per hub (i.e., less-informative trees): a random half of the causal hubs with *β*_{m}∼*N*(*ν*,1) and the other half causal hubs with *β*_{m}∼*N*(−*ν*,1). Scenario 5 considers the case where causal OTUs are randomly chosen from the entire phylogenetic tree (i.e., non-informative trees), where a random half of the causal OTUs have *β*_{m}∼*N*(*ν*,1) and the other half OTUs have *β*_{m}∼*N*(−*ν*,1).

#### Simulation B

In simulation B, we considered both continuous and binary outcomes and adopted the simulation design of Zhao et al. [10] and Koh et al. [11]. Given the simulated OTU counts *Z*={*z*_{im}} and the same set of causal OTUs as simulation A, we generated the outcome value of individual *i* by *y*_{i}∼*N*(*η*_{i},1) for continuous outcomes and *y*_{i}∼Bernoulli(*Π*_{i}) with \(\Pi _{i}= \frac {\exp (\eta _{i}) }{1+\exp (\eta _{i})}\) for binary outcomes. Value \(\eta _{i}=0.5{\omega } \times (\text {scale}(x_{1i})+\text {scale}(x_{2i})) + \sum _{m=1}^{M}\beta _{m}\times \text {scale}(z_{im})\), where *x*_{1i} and *x*_{2i} are covariates; *z*_{im} is the count of causal OTU *m* for subject *i*; and *ω*=1 or 0 is a parameter controlling if there are covariate effects. The “scale” function is to standardize the variable to mean 0 and standard deviation 1. Variable *x*_{1i} is simulated from Bernoulli(0.5) and is independent of the OTU counts, while variable *x*_{2i} is correlated with causal OTUs by letting *x*_{2i}∼*N*(*δ*_{i},1) with \(\delta _{i} =\text {scale}(\sum _{m \in \text {causal OTUs}}z_{im})\). For non-causal OTUs, *β*_{m}=0; for causal OTUs, *β*_{m}∼normal(*ν*,*ν*/5) or normal (−*ν*,*ν*/5) dependent on the scenario. We considered |*ν*| to be 0.2 (small effect size) and 0.5 (large effect size) for continuous outcomes and to be 0.3 (small effect size) and 1 (large effect size) for binary outcomes. In each setting, we considered *ω*=0 (no covariate effects) and *ω*=1 (with covariate effects).

#### Determining appropriate *c*
_{max}

To determine the appropriate *c*_{max} for POST, we conducted 100 replications under each of the 5 causal scenarios in simulations A and B and used the decision rule that an OTU is selected as important if *p*-value <0.05 and as not important otherwise. Using this decision rule, we report the true-positive rate (TPR), false-positive rate (FPR) and a composite measure by taking the harmonic mean of TPR and (1-FPR), which is referred to as the pseudo-*F* score. The TPR was obtained by first computing the fraction of detected causal OTUs among all causal OTUs in each replication and then averaging across the 100 replications; the FPR was obtained by first computing the fraction of detected non-causal OTUs among all non-causal OTUs in each replication and then averaging across the 100 replications; and the pseudo-*F* was obtained by taking the harmonic mean of the TPR and 1 −FPR in each replication and then averaging across the 100 replications.

#### Performance assessment

To examine the validity of POST, we set *β*_{m}=0 for all OTUs and report the type I error rates and the quantile-quantile (Q-Q) plots of the null *p*-values based on 4000 replications under simulations A and B. To examine the performance of selecting causal OTUs, we report the area under the receiver operating characteristic (ROC) curves (AUC in short) based on the 100 replications under each simulation setting from the 5 causal scenarios. AUC can assess the performance over various selection thresholds and account for both TPR and FPR. The AUC was obtained as follows. (1) In each replication, we computed the FDR-adjusted *p*-values of a method, based on which we then computed the FPR and TPR at varying significance thresholds. (2) We took the average FPR and average TPR across the 100 replications at each of the significant thresholds. (3) We constructed the ROC curve using the average FPRs and TPRs and computed the corresponding AUC. We used the two-stage BH (TSBH) FDR procedure [33] to compute the FDR-adjusted *p*-values for all methods except for TF, which directly gives FDR-adjust *p*-values. TSBH has been shown to yield good performance for correlated *p*-values [34] and is implemented in R function “mt.rawp2adjp” of R package *multtest*.

We compare POST with 7 baseline approaches: (1) TreeFDR (TF) of Xiao et al. [14] as implemented by the function “MicrobiomeSeqTreeFDR” in R package *StructFDR*; (2) single-OTU test (SO), by setting *c*=0 in POST; (3) DESeq2 (DE) of Love et al. [35] as implemented by the function “DESeq” with the default Wald test in the R package *DeSeq2*; (4) ANCOM-BC (AB) of Lin et al. [36] as implemented by the function “ancombc” in the R package *ANCOMBC*; (5) LinDA (LD) of Zhou et al. [37] as implemented in the function “linda” in R package *LinDA*; and (6) Wilcoxon rank-sum test on the proportional data (WR-P) and on the CLR transformed data (WR-R) as implemented in the function “wilcox.test.” The Wilcoxon test is only appropriate for binary outcomes with no covariate effects. We also consider (7) Spearman correlation test on proportional data (SC-P) and on the CLR transformed data (SC-R) as implemented by the function “cor.test” of the *stats* R package. The Spearman test is only appropriate for continuous outcomes with no covariate effects. Among these methods, POST and TF incorporate phylogenetic tree information. POST, SO, LD, WR-R, and SC-R use the CLR transformation to address the compositional nature of microbiome data.