Validation and standardization of DNA extraction and library construction methods for metagenomics-based human fecal microbiome measurements

Background Validation and standardization of methodologies for microbial community measurements by high-throughput sequencing are needed to support human microbiome research and its industrialization. This study set out to establish standards-based solutions to improve the accuracy and reproducibility of metagenomics-based microbiome profiling of human fecal samples. Results In the first phase, we performed a head-to-head comparison of a wide range of protocols for DNA extraction and sequencing library construction using defined mock communities, to identify performant protocols and pinpoint sources of inaccuracy in quantification. In the second phase, we validated performant protocols with respect to their variability of measurement results within a single laboratory (that is, intermediate precision) as well as interlaboratory transferability and reproducibility through an industry-based collaborative study. We further ascertained the performance of our recommended protocols in the context of a community-wide interlaboratory study (that is, the MOSAIC Standards Challenge). Finally, we defined performance metrics to provide best practice guidance for improving measurement consistency across methods and laboratories. Conclusions The validated protocols and methodological guidance for DNA extraction and library construction provided in this study expand current best practices for metagenomic analyses of human fecal microbiota. Uptake of our protocols and guidelines will improve the accuracy and comparability of metagenomics-based studies of the human microbiome, thereby facilitating development and commercialization of human microbiome-based products. Video Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-021-01048-3.


Genome sequencing and assembly
For Escherichia coli strain NBRC 3301, Akkermansia muciniphila strain JCM 30893 and Faecalibacterium prausnitzii strain JCM 31915, extraction of DNA, sequencing library construction, sequencing and genome assembly were performed as described by Tourlousse et al. (2020aTourlousse et al. ( , 2020bTourlousse et al. ( , 2020cTourlousse et al. ( , 2020d. Genome coverage of the short-and long-read libraries were in the range of 100-210× and 45-120×, respectively. For Cutibacterium acnes subsp. acnes strain NBRC 107605 T , Bacillus subtilis subsp. subtilis strain NBRC 13719 T , Streptococcus mutans strain NBRC 13955 T , Lactobacillus delbrueckii subsp. delbrueckii strain NBRC 3202 T , short-read libraries were prepared using the TruSeq DNA PCR-Free kit and sequencing performed on a MiSeq instrument. For Staphylococcus epidermidis strain NBRC 100911 T , the TruSeq DNA Nano kit was used for library construction and sequencing performed using a HiSeq instrument. Quality control of Illumina short reads was performed using CLC Genomic Workbench v8.5.1 (Qiagen). For all above strains, long-read libraries were generated with a Ligation Sequencing Kit (Oxford Nanopore technologies) and a GridION device used for sequencing. Combined genome coverage of the short-and long-read libraries exceeded 500× for all strains.
Quality of all genomes was confirmed by CheckM v1.0.18 (Parks et al., 2015) based on genome completeness, contamination, and coding density.

Description of protocol P
Protocol P is an in-house phenol-chloroform based method for DNA extraction and was performed as follows.
Fecal sample was resuspended in 300 µl of 100 mM NaH2PO4 (pH 8.0), 300 µl of lysis buffer (100 mM of NaCl, 500 mM of Tris-HCl, 10% of sodium dodecyl sulfate, pH 8.0) and 300 µl of phenol-chloroform-isoamyl alcohol (PCI,25:24:1,v:v:v), and 1.2 g of 0.1-mm autoclaved Zirconia beads added. Cell lysis was performed by beadbeating using the FastPrep-24 instrument for 40 s or 60 s at a speed of 6 m/s. Following incubation for 10 min at 60 °C, the sample was centrifuged for 5 min at 14,000 ×g and 600 µl of supernatant recovered. If applicable, 0.01 volumes of RNase A (10 mg/ml) were added, and RNA digested for 10 min at 37 °C. An equal volume of PCI was then added, the sample mixed by vortexing, centrifuged for 3 min and the supernatant recovered.
These steps were repeated two times. Subsequently, an equal volume of chloroform-isoamyl alcohol (24:1, v/v) was added, the sample vigorously mixed by vortexing and the aqueous phase recovered after centrifugation.
Precipitation of DNA was performed by addition of 0.1 volumes of 3 M sodium acetate (pH 5.2) and an equal volume of isopropanol, followed by centrifugation for 30 min at 4 °C. The DNA pellet was washed with 70% ethanol, air dried and dissolved in EB buffer (Qiagen).

Distance-based analysis of variance
To obtain estimates of intermediate precision and interlaboratory reproducibility, we performed distance-based analysis of variance (ANOVA), as outlined below. Note that in the following example, an interlaboratory study is considered but an identical approach applies to the assessment of intermediate precision with a single factor (as performed in this study).
Consider an interlaboratory study in which a single sample is analyzed by p laboratories, each performing n measurements under repeatability conditions. Note that in our study, each laboratory performed the same number of measurements such that the ANOVA model was balanced. As for traditional univariate ANOVA, variance components were obtained based on the within-and between-group sums of squares, with the difference that the Aitchison distance (d ! ) was used to obtain the required sum of squares.
More specifically, the within-and between-group total sums of squares (denoted by TSS " and TSS # ) were obtained as follows: Based on the degrees of freedom for each of the total sum of squares, the mean sums of squares (MSS) were then obtained as: MSS " = TSS " p(n − 1) with expected values of the mean sums of squares given by: where σ + $ is the residual (that is, technical) population variance and σ , $ is the population variance due to varying laboratories. The repeatability sample variance s + $ , laboratory sample variance s , $ , and reproducibility sample variance s -$ were then obtained based on the mean sums of squares as follows: Confidence intervals for the estimates of the variances were calculated using the critical values of the chisquare distribution χ . $ with the appropriate degrees of freedom ν. For mvar -, effective degrees of freedom νwere obtained using the Welch-Satterthwaite equation (Satterthwaite, 1946;Welch, 1947): (MSS # /n) $ p − 1 + 2(n − 1)MSS " /n3 $ p(n − 1) Confidential intervals for the variances are then given by:

Setting acceptable levels of error for trueness/accuracy
To guide future development of SOPs and routine quality management, we set target values for the acceptable level of errors for metagenomics-based measurement of the mock community. Here, error refers to the differences between measured compositions and the "ground truth", determined based on DNA quantification by fluorometry for the DNA mock community and total DNA content quantification by acid-catalyzed depurination and quantification of released adenine content by HPLC, as described in the Methods, for the cell mock community. In short, we used simulations to account for the "uncertainty" in value assignments for the "ground truth" and metagenomics-based measurements. The acceptable level of error was then computed as the 95 th percentile of the differences between both sets of simulated compositions, as detailed below: DNA mock community. The acceptable level of error for the DNA mock community was set by considering differences between the "ground truth" and values determined by metagenome sequencing of libraries constructed by PCR-free methods using physical DNA fragmentation (protocols A0, C0, D0 and E0 in Table S2).
For the DNA fluorometric measurements, a type B uncertainty in the value assignments was considered, by assuming that measured DNA concentrations were normally distributed around the mean of replicated fluorometric measurements, with a standard deviation of 0.25 (equivalent to a coefficient of variation of 5%). Based on the "ground truth" value and type B uncertainty, 10,000 values (that is, DNA concentrations) were simulated for 20 strains using the function rnorm in R's stat package, simulated values combined and normalized to 100%.
For the metagenome sequencing based measurements, the closed geometric mean of measurements performed by each of the protocols (averaged across three technical replicates) was assigned as value. A type A uncertainty was then considered based on the variance matrix of the means across replicates for each protocol, computed using the function var.acomp function in the R package compositions (van den Boogaart et al., 2020). Based on the mean and variance matrix (considering only its diagonal), we then simulated 10,000 random compositions using rnorm.acomp.
We then calculated the geometric mean and maximum of absolute fold differences (AFD) between 10,000 pairs from the above two simulated compositions. Finally, the acceptable level of error was defined as the 95 th percentile of the 10,000 mean and maximum values of the AFDs.

Supplementary Figures
Supplementary Figure S1. Schematic of the design of this study. In the first phase, a wide range of protocols for DNA extraction and library construction were evaluated using defined mock communities, leading to: (i) ranking of protocols based on analytical performance (that is, agreement to the "ground truth" values) and considerations related to hands-on time, cost, etc., (ii) identification of main sources of quantitative bias, and (iii) establishment of target values for key performance metrics. In the second phase, SOPs were established for highly performant protocols and evaluated with respect to variability of measurement results within a single laboratory (that is, intermediate precision) and across multiple laboratories (interlaboratory reproducibility), using mock communities and human fecal samples, leading to (i) quantification of measurement variability and (ii) establishment of target values for key performance metrics. Detailed experimental schemes for assessment of intermediate precision and interlaboratory reproducibility are provided in Fig. S13.

DNA mock community
Cell mock community  Figure S5. Relationship between sequencing library fragment size and N50 values of metagenome assemblies as evaluated using the DNA mock community. For each library, a random set of two million quality-controlled reads pairs were assembled using MEGAHIT. Fragment sizes were estimated by mapping quality-controlled reads against the reference genome sequences using bowtie2 and parsing generated SAM files using samtools and BBMap's reformat.sh script. For the x-axis, symbols represent the mean of strain-wise median fragment sizes for individual sequencing library (three per protocol) and error bars represent the range of per-library strain-wise median fragment sizes.
Supplementary Figure S6. Variation in base call error rates across protocols for sequencing library construction as evaluated using the DNA mock community. Error rates were estimated by mapping qualitycontrolled reads against the reference genome sequences using bowtie2 and parsing generated SAM files using samtools and BBMap's reformat.sh script. A total of two million read pairs per library were analyzed and data represent the mean (solid lines) and standard deviation (ribbons, if visible) of base cell error rates for technical replicates for each protocol. Per-position error rates were calculated as the mean across strains. Panels a and b show data for reads 1 (forward) and 2 (reverse), respectively. Facets represent different kits (A-K) and colors represent different DNA input amounts (X0, XL and XH). Supplementary Figure S9. Strain-wise relative abundances for the cell mock community processed using different protocols for DNA extraction. Each facet shows the results of three technical replicates. A constant bead-beating regime of 1×40 s was used, except for protocols Q (8×60 s) and protocol O (2×600 s). Facets are sorted row-wise according to the total proportion of Gram-positives and Gram-negatives.  Supplementary Figure S10. Effect of bead-beating regime on library fragment size for the cell mock community. Fragment sizes were estimated by mapping of quality-controlled reads against the reference genome sequences using bowtie2 and parsing generated SAM files using samtools and BBMap's reformat.sh script. For the y-axis, symbols represent the mean of strain-wise median fragment sizes for each library and error bars represent the range of strain-wise median fragment sizes for each library.   Figure S12. Relationship between Aitchison distance to the "ground truth" as determined based on total DNA content (x-axis) and to protocol Q (y-axis) for ranking of protocols for DNA extractions. The Spearman's rank correlation coefficient is 0.93, showing consistent ranking of protocols. Values for protocol Q were calculated as the closed geometric mean of three technical replicates.
extracted by ten laboratories (that is, central laboratory Lab A and nine industry-based laboratories, designated B -J) from the cell mock community and fecal sample S01, and DNA gathered at the Lab A for centralized library preparation and sequencing. Three of the above nine industry-based laboratories also performed DNA extraction using a custom protocol. To assess reproducibility of library construction, sequencing libraries were constructed by four laboratories (that is, the central laboratory Lab A and three industry-based laboratories, designated B, G and H) from the DNA mock community, and DNA gathered at the Lab A for centralized sequencing. Laboratories B, G and H also performed sequencing at their own facilities and shared raw sequencing data with Lab A. Finally, Labs A, B, G and H also extracted DNA, prepared libraries, and generated sequencing data for fecal samples S01, S02, S03, S06 and S13; these data were analyzed at Lab A.  Table S3). Panels a and b show data for the DNA and cell mock community, respectively. Custom protocols for DNA extraction in panel b are shown as red x symbols.  Figure S15. (a) Illustration of GC-dependent bias associated with sequencing runs/platforms as observed in the interlaboratory study. The y-axis represents the ratio of the observed strain-wise abundances in sequencing data generated by the central laboratory and external laboratories (local). Data are for the DNA mock community, with sequencing libraries constructed using protocol BL. Colors represent four participating laboratories (that is, the central laboratory Lab A and three industry-based laboratories). Note that the laboratory with the highest bias used a NovaSeq platform (greenish color) while all other laboratories used a NextSeq 500/550 instrument. Data from the central laboratory Lab A shown in bluish color represent technical sequencing replicates. Symbols represent data from individual replicates (n=2). The linear regression lines and confidence intervals were generated using ggplot2's geom_smooth function. (b) Evaluation of repeatability of sequencing by repeated sequencing (n=4) of a single DNA mock community library (generated using protocol C0) in four different NextSeq 500 sequencing runs and its relationship to GC content. The red horizontal line represents the quadratic mean of strain-wise coefficients of variation (that is, qmCV). The smoothed trend line (generated using ggplot2's geom_smooth function) was added to visualize the U-shaped relationship between GC-content and between-sequencing-run variability in strain-wise abundances. dissimilarities between (laboratory)-different DNA extractions for the interlaboratory reproducibility study; D: dissimilarities between (laboratory)-different data generation for the interlaboratory reproducibility study for samples S01-S16, with all steps from DNA extraction to sequencing performed by the participating laboratories; E: withinlaboratory dissimilarities between different fecal samples; F: dissimilarities between custom DNA extraction protocols and protocol N for sample S01. Data from laboratories for which at least one the replicated measurements was considered as an outlier are shown in red (see Fig. S17). each dataset. Laboratories yielding outlying measurement results are marked by a red circle. Panel c shows a compositional PCA ordination plot (that is, based on Euclidean distances after clr transformation of relative abundances) with measurement results from outlying laboratories highlighted with a filled grey circle.

a b
Supplementary Figure S18. Determination of the limit-of-detection (LOD) through regressing the probability of detection (POD) of each species/mOTU (y-axis) to its mean abundance (x-axis) across available measurement results for sample S01. A total of 26 datasets were included in the analysis, namely 8 data sets generated as part of the assessment of intermediate precision of DNA extraction, and 18 data sets generated as part of the assessment of interlaboratory reproducibility of DNA extraction after exclusion of outlying data (see Fig. S17). Panels a and b show data for kraken2 (species-level) and mOTUs2 (OTU level), respectively. The model was generated by regressing the POD to the mean abundance (log-transformed), using a generalized linear model with family binomial and complementary log-log (cloglog) link function using R stats' glm function. The vertical dashed red lines represent the estimated LODs, with a fitted POD of 95%. . Panels a and c and panels b and d show results based on taxonomic profiling using kraken2 (species-level) and mOTUs2 (mOTU-level), respectively. Data generated using our recommended protocol BL are shown at the bottom of each chart. Samples are ordered along the y-axis based on their Bray-Curtis dissimilarity to protocol BL, independently in panels a and b. Note that only the top-12 most abundant species/mOTUs (based on featurewise maximum abundance across all samples) are plotted.  Figure S21. Results of the MOSAIC Standards Challenge samples for fecal samples 1 -5. Bar charts show the proportion of assigned reads at the phylum level using kraken2 (a) or mOTUs2 (b). Data for protocols P, N and Q were generated in this study. Boxplots represent the distribution of species richness observed for public data sets and colored symbols show data for protocols P (circles), N (squares) and Q (diamonds) as generated in this work. Boxplots were generated using ggplot2's geom_box function. negatives and Shannon diversity, based on data from the MOSAIC Standards Challenge. Shannon diversity was estimated based on species-level taxonomic profiles generated using kraken2. Protocol X represents publicly available data. The linear regression line and confidence intervals were generated using ggplot2's geom_smooth function. Subpanels labeled 1 through 5 represent the different fecal samples included in the MOSAIC Standards Challenge.     Dispersion of measurement results (repeatability) Coefficient of variation (b) : ≤ 4%, for PCRfree protocols or protocols with low PCR cycles, e.g., starting from >50 ng of input DNA; ≤ 6%, for protocols with high PCR cycles, e.g., starting from 1 ng of input DNA Agreement with "ground truth" (trueness) Mean absolute fold-differences (c) : ≤ 1.15 Maximum absolute fold-differences (c) : ≤ 1.5 Dispersion of measurement results (repeatability) Coefficient of variation: ≤ 6% Agreement with "ground truth" (trueness) Mean absolute fold-differences: ≤ 1.55 Maximum absolute fold-differences : ≤ 3.1 Prepare at least three replicates under repeatability conditions Point of replication: DNA extraction Scenario A: Implementation SOPs reference data mock community (even, 20 strains) (a) Complete reference genome sequences should be used and estimated genome relative abundances be on par with estimates generated by kallisto (b) Computed as the quadratic mean of strain-wise coefficients of variation (that is, qmCV) (c) Computed as geometric mean or maximum of strain-wise absolute fold differences (that is, gmAFD and maximum AFD) to the known composition ("ground truth") of the mock communities  Figure S27. (a,b) Relationship between the Aitchison distance and geometric mean of strainwise absolute fold differences (gmAFD) for (a) the DNA mock community and (b) the cell mock community. Data points were generated based on 1,000 random pairs of measurements, using data from phase I (comparison of methods) of this study. The blue line represents the fitted exponential model, generating using the function nls in R's stats package. (c-e) Relationship between the square root of the metric variance (x-axis) and quadratic mean of strain-wise coefficients of variation (qmCV) for (a) the DNA mock community, (b) the cell mock community and (c) fecal sample S01. The red line represents the proportionality constant of D -1/2 , where D represents the number of strains/species considered in the analysis. Data points were generated based on 10,000 random subsets of three measurements each, using data from phase I (comparison of methods) for the DNA and cell mock communities and data from phase II (evaluation of intermediate precision and interlaboratory reproducibility) for the fecal sample.  Tables   Supplementary Table S1. Overview of the DNA and cell mock communities developed in this study.      Note: Values represent rounded one-sided 95% confidence intervals of precision estimates generated in this study. (a) Represents the number of reads after quality filtering using fastp and removal of human genomic reads using BMTagger for human fecal samples. (a) Represents the number of reads after quality filtering using fastp and removal of human genomic reads using BMTagger for human fecal samples.