Normalization and microbial differential abundance strategies depend upon data characteristics

Weiss, Sophie; Xu, Zhenjiang Zech; Peddada, Shyamal; Amir, Amnon; Bittinger, Kyle; Gonzalez, Antonio; Lozupone, Catherine; Zaneveld, Jesse R.; Vázquez-Baeza, Yoshiki; Birmingham, Amanda; Hyde, Embriette R.; Knight, Rob

doi:10.1186/s40168-017-0237-y

Microbiome

Table 1 Normalization methods investigated in this study

From: Normalization and microbial differential abundance strategies depend upon data characteristics

Method	Description
None	No correction for unequal library sizes is applied
Proportion	Counts in each column are scaled by the column’s sum
Rarefy	Each column is subsampled to even depth without replacement (hypergeometric model)
logUQ	Log upper quartile—Each sample is scaled by the 75th percentile of its count distribution; then, the counts are log transformed
CSS	Cumulative sum scaling—This method is similar to logUQ, except that CSS enables a flexible sample distribution-dependent threshold for determining each sample’s quantile divisor. Only the segment of each sample’s count distribution that is relatively invariant across samples is scaled by CSS. This attempts to mitigate the influence of larger count values in the same matrix column
DESeqVS	Variance stabilization (VS)—For each column, a scaling factor for each OTU is calculated as that OTU’s value divided by its geometric mean across all samples. All of the reads for each column are then divided by the median of the scaling factors for that column. The median is chosen to prevent OTUs with large count values from having undue influence on the values of other OTUs. Then, using the scaled counts for all the OTUs and assuming a Negative Binomial (NB) distribution, a mean-variance relation is fit. This adjusts the matrix counts using a log-like transformation in the NB generalized linear model (GLM) such that the variance in an OTU’s counts across samples is approximately independent of its mean
edgeR-TMM	Trimmed Mean by M-Values (TMM)—The TMM scaling factor is calculated as the weighted mean of log-ratios between each pair of samples, after excluding the highest count OTUs and OTUs with the largest log-fold change. This minimizes the log-fold change between samples for most OTUs. The TMM scaling factors are usually around 1, since TMM normalization, like DESeqVS, assumes that the majority of OTUs are not differentially abundant. The normalization factors for each sample are the product of the TMM scaling factor and the original library size

Back to article page

ISSN: 2049-2618

Contact us

General enquiries: journalsubmissions@springernature.com