A comprehensive evaluation of microbial differential abundance analysis methods: current status and potential solutions

Background Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Numerous DAA tools have been proposed in the past decade addressing the special characteristics of microbiome data such as zero inflation and compositional effects. Disturbingly, different DAA tools could sometimes produce quite discordant results, opening to the possibility of cherry-picking the tool in favor of one’s own hypothesis. To recommend the best DAA tool or practice to the field, a comprehensive evaluation, which covers as many biologically relevant scenarios as possible, is critically needed. Results We performed by far the most comprehensive evaluation of existing DAA tools using real data-based simulations. We found that DAA methods explicitly addressing compositional effects such as ANCOM-BC, Aldex2, metagenomeSeq (fitFeatureModel), and DACOMP did have improved performance in false-positive control. But they are still not optimal: type 1 error inflation or low statistical power has been observed in many settings. The recent LDM method generally had the best power, but its false-positive control in the presence of strong compositional effects was not satisfactory. Overall, none of the evaluated methods is simultaneously robust, powerful, and flexible, which makes the selection of the best DAA tool difficult. To meet the analysis needs, we designed an optimized procedure, ZicoSeq, drawing on the strength of the existing DAA methods. We show that ZicoSeq generally controlled for false positives across settings, and the power was among the highest. Application of DAA methods to a large collection of real datasets revealed a similar pattern observed in simulation studies. Conclusions Based on the benchmarking study, we conclude that none of the existing DAA methods evaluated can be applied blindly to any real microbiome dataset. The applicability of an existing DAA method depends on specific settings, which are usually unknown a priori. To circumvent the difficulty of selecting the best DAA tool in practice, we design ZicoSeq, which addresses the major challenges in DAA and remedies the drawbacks of existing DAA methods. ZicoSeq can be applied to microbiome datasets from diverse settings and is a useful DAA tool for robust microbiome biomarker discovery. Video Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-022-01320-0.


Fig. S1
Basic steps of the proposed semiparametric simulation framework.

Fig. S2
The fit of the estimated beta mixture prior for several representative taxa in the COMBO (n = 98) dataset in comparison to the beta prior. The parameters of the beta mixture or beta distribution are estimated using maximum likelihood estimation assuming a binomial distribution of the count given the underlying proportion.

Fig. S3
The fit of the estimated beta mixture prior for several representative taxa in the American Gut Project (≈ 10, 000) dataset. The parameters of the beta mixture or beta distribution are estimated using maximum likelihood estimation assuming a binomial distribution of the count given the underlying proportion.

Fig. S4
P-value distributions based on 10,000 simulation runs (a) when the abundance of a rare taxon (0.4% relative abundance, 25% physical absence) is the same between two groups, (b) when the abundance of the same taxon (0.4% relative abundance, 25% physical absence) increases by 25% in one group.

Fig. S5
P-value distributions based on 10,000 simulation runs (a) when the abundance of an abundant taxon (9% relative abundance, 25% physical absence) is the same between two groups, (b) when the abundance of the same taxon (9% relative abundance, 25% physical absence) increases by 25% in one group.  Table   2). In the figure, "ref.pct 50%/ excl.pct 20%" (default thresholds) refers to the method using 50% taxa with the lowest error variances as the reference set and further excluding 20% taxa with the lowest p-values in the reference set in each iteration. "ref.pct 40%/ excl.pct 10%" refers to the method using 40% taxa with the lowest error variances as the reference set and further excluding 10% taxa with the lowest p-values in the reference set in each iteration. Error bars are mean±1.96´standard error.

Fig. S7
Comparison of sample-and taxon-level characteristics between the semiparametric approach and Dirichlet-multinomial (DM) model simulated data. The assessment consists of (a) the percentage of zeros (sparsity), (b) alpha diversity (Shannon diversity index), (c) β-diversity (Bray-Curtis distance), (d) the prevalence of taxa, (e) mean relative abundance of taxa, (f) variance of the relative abundance, and (g) between-taxa Spearman correlation of the relative abundances. In (ab, d-g), the histograms show the probability distributions, and the scatter plots compare the quantiles between the simulated data and real data. In (c), the two principal coordinates (PCs) are generated via principal coordinate analysis based on the Bray-Curtis distance.     Error bars are mean±1.96´standard error. "Low", "Medium" and "High" refer to the signal densities, and "Abundant" and "Rare" refer to the differential mode (Table 1). setting, vaginal data. Performance is assessed by the observed false discovery rate (FDR) level and average true positive rate (TPR) in comparison to evaluated methods. The color of the bar indicates the FDR control performance. The blue color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05-0.1], (0.1, 0.2], and (0.2, 1], respectively. The length of the bar is proportional to the TPR and the actual TPR is shown in the bar. FDR and TPR ranks are based on the average FDR and TPR score across signal densities and/or differential modes.
The order of the method is arranged based on the sum of the FDR and TPR ranks. indicates the FDR control performance. The blue color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05-0.1], (0.1, 0.2], and (0.2, 1], respectively. The length of the bar is proportional to the TPR and the actual TPR is shown in the bar. FDR and TPR ranks are based on the average FDR and TPR score across signal densities and/or differential modes. The order of the method is arranged based on the sum of the FDR and TPR ranks.  Balanced and (f) unbalanced change setting (sample size = 100, taxa number = 50), visualized using bar plots corresponding to Fig. 3. Error bars are mean±1.96´standard error. "Low", "Medium" and "High" refer to the signal densities, and "Abundant" and "Rare" refer to the differential modes (Table 1). unbalanced change setting (sample size = 1000, taxa number = 500). Error bars are mean±1.96´standard error. "Low", "Medium" and "High" refer to the signal densities, and "Abundant" and "Rare" refer to the differential modes (Table 1). bars are mean±1.96´standard error. "Low", "Medium" and "High" refer to the signal densities, and "Abundant" and "Rare" refer to the differential modes (Table 1).

Fig. S19
Performance of ZicoSeq when the sequencing depth differs by 4-fold between the groups, visualized using bar plots corresponding to Fig. 5. The results are based on stool data under the balanced setting (sample size = 100, taxa number = 500). Error bars are mean±1.96´standard error. "Low", "Medium" and "High" refer to the signal densities, and "Abundant" and "Rare" refer to the differential modes (Table 1).

Fig. S20
Performance of ZicoSeq when the sequencing depth differs by 9-fold between the groups. The results are based on stool data under the balanced setting (100 samples and 500 taxa). Performance is assessed by the observed false discovery rate (FDR) level and average true positive rate (TPR) in comparison to evaluated methods. The color of the bar indicates the FDR control performance. The blue color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05-0.1], (0.1, 0.2], and (0.2, 1], respectively. The length of the bar is proportional to the TPR and the actual TPR is also shown in the bar. FDR and TPR ranks are based on the average FDR and TPR score across signal densities and differential modes. The order of the method is arranged based on the sum of the FDR and TPR ranks.

Fig. S23
Box plots showing the distribution of Spearman correlation of p-values between no filtered datasets and filtered datasets (prevenance less than 40% or minimal abundance less than 0.002 are excluded for analysis) based on unbalanced change setting for vaginal data.

Fig. S24
The average percentage of significant taxa at 5% FDR of the 106 real datasets when the group labels are randomly shuffled.

Fig. S25
Ensemble methods at a consensus level of 20%, 40%, 60% and 80% (denoted as "pct20", "pct40", "pct60" and "pct80"). Performance is assessed by the observed false discovery rate (FDR) level and average true positive rate (TPR). The color of the bar indicates the FDR control performance. The blue color indicates that the method controls the FDR at the 5% target level (the 95% confidence interval covers 5%). Yellow, red and gray colors indicate the observed FDR level in (0.05-0.1], (0.1, 0.2], and (0.2, 1], respectively. The length of the bar is proportional to the TPR and the actual TPR is also shown in the bar.

Table S1
Normalization methods reviewed in this study.

Table S2
Package version and source link for the differential abundance analysis methods evaluated in this study.

Table S3
Performance scoring system.

Table S4
The evaluation metrics used in the performance summary.        Table 1 Normalization methods for microbiome data.

Methods Description
Total sum scaling (TSS) The TSS size factor is simply the total number of reads in the sample.

Trimmed mean of M values (TMM)
The TMM method first selects a reference sample, then all other samples are compared to the reference sample. The weighted trimmed mean of log-ratios between each pair of samples is then calculated as the TMM size factor.

Relative Log Expression (RLE)
The RLE method calculates the geometric means of all features as a "reference," and all samples are compared to the "reference" to produce ratios for all features. The median ratio is then taken to be the RLE size factor.

Cumulative sum scaling (CSS)
The CSS size factor is the cumulative sum of counts up to a percentile determined by a data-driven approach.
Centered log-ratio transformation (CLR) For each sample, the counts are divided by their geometric mean, followed by log transformation. Thus, the CLR size factor is the geometric mean of the counts in a sample.
Geometric mean of pairwise ratios (GMPR) For each sample, the GMPR method calculates the pairwise ratios to all other samples for each feature. The size factor is then the geometric mean of the median ratios for all features.

Wrench
Wrench models feature-wise proportion ratios to a reference sample using a hurdle log-normal model, where a compositional scale factor is included so that the log fold changes on the absolute abundance level is centered at 0 (the majority of the taxa do not change).