Assessment framework
To assess the qualitative and quantitative performance of marker-gene survey analysis methods we developed a framework utilizing our two-sample titration dataset (Fig. 1). The qualitative assessment evaluates feature presence-absence. The quantitative assessment evaluates feature relative and differential abundance.
Assessment dataset—mixture design We developed a dataset with real-world complexity and expected values for method assessment using mixtures of environmental samples. The mixtures were generated from samples collected at multiple timepoints during an Enterotoxigenic E. coli (ETEC) vaccine trial [30] (Fig. 2). Samples from five trial participants were selected for our mixture dataset. We mixed samples collected prior to (PRE) and after (POST) ETEC exposure following a two-sample titration mixture design. We selected trial participants (individuals) and sampling time points based on E. coli abundance data collected using qPCR and 16S rRNA sequencing from Pop et al. [22]. For our dataset, we identified five individuals with no E. coli detected in samples collected from trial participants prior to ETEC exposure (PRE). Post ETEC exposure (POST) samples were identified as the timepoint after exposure to ETEC with the highest E. coli concentration for each subject (Fig. 2a). Due to limited sample availability, for E01JH0016, the timepoint with the second highest E. coli concentration was used as the POST sample. Independent titration series were generated for each subject. POST samples were titrated into PRE samples with POST proportions of 1/2, 1/4, 1/8, 1/16, 1/32, 1/1024, and 1/32,768 (Fig. 2b). Unmixed (PRE and POST) sample DNA concentration was measured using NanoDrop ND-1000 (Thermo Fisher Scientific Inc. Waltham, MA USA). Unmixed samples were diluted to 12.5 ng/μL in tris-EDTA buffer before mixing. The resulting titration series was composed of 45 samples, 7 titrations, and 2 unmixed samples for each of the 5 subjects.
The 45 samples were processed using the Illumina 16S library protocol (16S Metagenomic Sequencing Library Preparation, posted date November 27, 2013, downloaded from https://support.illumina.com). This protocol specifies an initial PCR of the 16S rRNA gene, followed by a sample indexing PCR, sample concentration normalization, and sequencing.
A total of 192 16S rRNA PCR assays were sequenced across 2 96-well plates including 4 PCR replicates per sample and 12 no-template controls. The initial PCR assay targeted the V3–V5 region of the 16S rRNA gene, Bakt_341F, and Bakt_806R [14]. The V3–V5 region is 464 base pairs (bp) long, with forward and reverse reads overlapping by 136 bp, using 2 × 300 bp paired-end sequencing [31] (http://probebase.csb.univie.ac.at). Primer sequences include overhang adapter sequences for library preparation (forward primer 5’- TCG TCG GCA GCG TCA GAT GTG TAT AAG AGA CAG CCT ACG GGN GGC WGC AG - 3’ and reverse primer 5’- GTC TCG TGG GCT CGG AGA TGT GTA TAA GAG ACA GGA CTA CHV GGG TAT CTA ATC C - 3’). The 16S rRNA gene was PCR amplified using the Kapa HiFi HotStart ReadyMix reagents (KAPA Biosystems, Inc. Wilmington, MA, USA) and product amplicon size was verified using agarose gel electrophoresis. Concentration measurements were made after the initial 16S rRNA PCR, the indexing PCR, and normalization steps. DNA concentration was measured using the QuantIT Picogreen dsDNA Kit (Cat # P7589, ThermoFisher Scientific) and fluorescent measurements were made with a Synergy2 Multi-Detection MicroPlate Reader (BioTek Instruments, Inc, Winooski, VT, USA).
Initial PCR products were purified using 0.8X AMPure XP beads (Beckman Coulter Genomics, Danvers, MA, USA) following the manufacturer’s protocol. After purification, the 192 samples were indexed using the Illumina Nextera XT index kits A and D (Illumina Inc., San Diego CA, USA) and then purified using 1.12X AMPure XP beads. Prior to pooling, the purified sample concentration was normalized using SequalPrep Normalization Plate Kit (Catalog n. A10510-01, Invitrogen Corp., Carlsbad, CA, USA), according to the manufacturer’s protocol. Pooled library concentration was checked using the Qubit dsDNA HS Assay Kit (Part# Q32851, Lot# 1735902, ThermoFisher, Waltham, MA, USA). Due to the low pooled amplicon library DNA concentration, a modified protocol for low concentration libraries was used. The library was run on an Illumina MiSeq, and base calls were made using Illumina Real Time Analysis Software version 1.18.54. The sequence data were deposited in the NCBI SRA archive under Bioproject PRJNA480312. Individual SRA run accession numbers and metadata in Supplemental Table. The sequencing data presented in this manuscript are part of a larger study and the SRX accession numbers should be used to access the specific data presented here. Sequencing data quality control metrics were computed using the Bioconductor Rqc package [32, 33].
Sequence data were processed using four bioinformatic pipelines: a de novo clustering method - Mothur [7], an open-reference clustering method - QIIME [5], and a sequence inference method—DADA2 [8], and unclustered sequences as a control. The code used to run the bioinformatic pipelines is available at https://github.com/nate-d-olson/mgtst_pipelines.
The Mothur pipeline follows the developer’s MiSeq SOP [7, 23]. The pipeline was run using Mothur version 1.37 (http://www.mothur.org/). We sequenced a larger 16S rRNA region, with smaller overlap between the forward and reverse reads, than the 16S rRNA region the SOP was designed for. Pipeline parameters were modified to account for differences in overlap and are noted for individual steps below. The Makefile and scripts used to run the Mothur pipeline are available https://github.com/nate-d-olson/mgtst_pipelines/blob/master/code/mothur. The Mothur pipeline included an initial preprocessing step where the forward and reverse reads are trimmed and filtered using base quality scores and were merged into single contigs for each read pair. The following parameters were used for the initial contig filtering, no ambiguous bases, max contig length of 500 bp, and max homopolymer length of 8 bases. For the initial read filtering and merging step, low-quality reads were identified and filtered from the dataset based on the presence of ambiguous bases, failure to align to the SILVA reference database (V119, https://www.arb-silva.de/) [34], and identification as chimeras. Prior to alignment, the SILVA reference multiple sequence alignment was trimmed to the V3–V5 region, positions 6388 and 25,316. Chimera filtering was performed using UChime (version v4.2.40) without a reference database [25]. OTU clustering was performed using the OptiClust algorithm with a clustering threshold of 0.97 [6]. The RDP classifier implemented in Mothur was used for taxonomic classification against the Mothur provided version of the RDP v9 training set [35].
The QIIME open-reference clustering pipeline for paired-end Illumina data was performed according to the online tutorial (Illumina Overview Tutorial (an IPython Notebook): open reference OTU picking and core diversity analyses, http://qiime.org/tutorials/) using QIIME version 1.9.1 [5]. Briefly, the paired-end reads were merged using fastq-join (version 1.3.1, [36]) and open-reference clustering was performed using the Usearch algorithm [37] with Greengenes database version 13.8 with a 97% similarity threshold [38]. The bash script used to run the QIIME pipeline is available at https://github.com/nate-d-olson/mgtst_pipelines/blob/master/code/qiime_pipeline.sh.
DADA2, an R native pipeline, was also used to process the sequencing data [8]. The forward and reverse reads were independently quality filtered and grouped using the DADA2 probability model. Independently grouped forward and reverse reads were then merged and chimeras were filtered. Taxonomic classification was performed using the DADA2 implementation of the RDP naïve Bayesian classifier [35] and the SILVA database V123 provided by the DADA2 developers [34,https://benjjneb.github.io/dada2/training.html]. Code for running the DADA2 pipeline is available at https://github.com/nate-d-olson/mgtst_pipelines/blob/master/code/dada2_pipeline.R.
The unclustered pipeline was based on the Mothur de novo clustering pipeline, where the paired-end reads were merged, filtered, and dereplicated. Reads were aligned to the reference Silva alignment (V119, https://www.arb-silva.de/), and reads failing alignment were excluded from the dataset. Taxonomic classification implemented in Mothur was performed using the Mothur implemented RDP classifier implemented and the RDP v9 training set provided by the Mothur developers. To limit the dataset size, only the most abundant 40,000 OTUs (comparable to the Mothur dataset) across all samples were used.
Qualitative assessment The qualitative assessment evaluates feature presence absence by count table sparsity and artifactual feature proportion. Count table sparsity is the proportion of 0 valued cells in the count table. Artifactual feature proportion is the proportion of unmixed- and titration-specific features with observed abundance values not explained by sampling alone. Unmixed-specific features are features only observed in the unmixed PRE or POST samples (Fig. 1b). Titration-specific features are features only observed in the titrations. Unmixed- and titration-specific features can arise from errors in PCR/sequencing, feature inference processes, or differences in sampling depth. The artifactual feature proportion provides context for interpreting count table sparsity results, where low artifactual feature proportion and high sparsity is indicative of a high false-positive rate and high artifactual feature proportion, and high sparsity indicates a high false-negative rate (Fig. 1c).
Hypothesis tests were used to determine if random sampling alone, here sequencing depth, could account for unmixed- and titration-specific features. p values were adjusted for multiple comparisons using the Benjamini and Hochberg method [39]. For unmixed-specific features, a binomial test was used to evaluate if true feature relative abundance is less than the expected relative abundance. The binomial test was infeasible for titration-specific features. Count table abundance values for the titration-specific features was 0 in the unmixed samples, their estimated probability of occurrence, πmin, is equal to 0, and thus, the binomial test fails. Therefore, we formulated a Bayesian hypothesis test for titration-specific features defined in Eq. 2. This Bayesian approach evaluated if the true feature proportion is less than the minimum detected proportion. When assuming equal priors, P(π<πmin)=P(π>πmin), (2) reduces to (3). We define π as the true feature proportion, πmin the minimum detected proportion, C the expected feature counts, and Cobs the observed feature counts. Count values for C were simulated using a beta prior (with varying alpha and beta values) for π>πmin and a uniform distribution for π<πmin. Higher values of alpha and beta will skew the prior right and left, respectively. Our Bayesian hypothesis tests (3) results were largely unaffected by beta distribution parameterization (Fig. S4). πmin was calculated using the mixture Eq. 1 where qpre,j and qpost,j are min(Qpre) and min(Qpost) across all features for a subject and pipeline. Our assumption is that π is less than πmin for features not observed in unmixed samples. Artifacts not explained by sequencing alone are likely false-positives or false-negatives due to errors in the sequence measurement and inference processes.
$$ \begin{aligned} p & = P(\pi < \pi_{{\text{min}}} | C \geq C_{{\text{obs}}}) \\ & = \frac{P(C \geq C_{{\text{obs}}}| \pi < \pi_{{\text{min}}})P(\pi < \pi_{{\text{min}}})}{P(C\! \geq\! C_{{\text{obs}}}| \pi \!<\! \pi_{{\text{min}}})P(\pi \!< \!\pi_{{\text{min}}})\! \,+\, P(C\! \geq \!C_{{\text{obs}}}| \pi\! \geq\! \geq\! \pi_{{\text{min}}})P(\!\pi\! \geq\! \pi_{{\text{min}}})} \\ \end{aligned} $$
(2)
$$ p = \frac{P(C \geq C_{{\text{obs}}}| \pi < \pi_{{\text{min}}})}{P(C \geq C_{{\text{obs}}})} $$
(3)
Quantitative assessment For the quantitative assessment, we compared the observed relative abundance and log fold-changes to expected values derived from the titration experimental design. For the observed values in our relative abundance assessment, we averaged feature relative abundance across PCR replicates. Feature average relative abundance across PCR replicates was calculated using a negative binomial model. We used the averaged relative abundance across PCR replicates as the observed relative abundance values obs for the relative abundance assessment. Average relative abundance values were used to prevent PCR replicate outliers from biasing the assessment results. Equation (1) and inferred θ values were used to calculate the expected relative abundance values (exp). We defined the relative abundance error rate as |exp−obs|/exp. We developed bias and variance metrics to assess feature performance. The feature-level bias and variance metrics were defined as the median error rate and robust coefficient of variation (RCOV=IQR/median), respectively.
We assessed differential abundance estimates by comparing log fold-change between samples in the titration series including PRE and POST to the expected log fold-change values. Log fold-change estimates were calculated using EdgeR [40, 41]. Expected log fold-change values for feature j between titrations l and m was calculated using Eq. (4), where θ was the proportion of POST bacterial DNA in a titration, and q is feature relative abundance. For features only present in PRE samples, PRE-specific, the expected log fold-change was independent of the observed counts for the unmixed samples and was calculated using Eq. (5). For features only present in POST samples, POST-specific, the expected log fold-change values can be calculated in a similar manner. However, POST-specific features were rarely observed in more than one titration and therefore not included in this analysis.
Due to a limited number of PRE-specific features, both PRE-specific and PRE-dominant features were used in the differential abundance assessment. PRE-specific features were defined as features observed in all four PRE PCR replicates and not observed in any of the POST PCR replicates and PRE-dominant features were also observed in all four PRE PCR replicates and observed in one or more of the POST PCR replicates but with a log fold-change greater than 5 between PRE and POST samples.
$$ {\text{logFC}}_{lm,j} = \log_{2}\left(\frac{\theta_{l} q_{{\text{post}},j} + (1 - \theta_{l}) q_{{\text{pre}},j}}{\theta_{m} q_{{\text{post}},j} + (1 - \theta_{m}) q_{{\text{pre}},j}}\right) $$
(4)
$$ {\text{log}}FC_{lm,j} = {\text{log}}_{2}\left(\frac{1-\theta_{l}}{1-\theta_{m}}\right) $$
(5)
Count table assessment demonstration
We demonstrated the assessment framework by comparing the qualitative and quantitative assessment results across the three pipelines. We first characterized overall differences in the count tables produced by the pipelines. This characterization included total number of features, total abundance by individual, dropout-rate, and taxonomic composition.
Qualitative assessment For the qualitative assessment, we compared the proportion of artifactual features. The artifactual feature proportion was defined as the proportion of unmixed- and titration-specific features with abundance values that could not be explained by sampling alone. These are PCR replicates with p values less than 0.05 after multiple hypothesis test correction for the binomial and Bayesian hypothesis tests described in the “Assessment framework” section. We additionally used the count table sparsity values to identify potential mechanisms responsible for differences in artifactual feature proportions.
Quantitative assessment Mixed-effects models with individual as a random effect were used to compare feature-level error rate bias and variance metrics across pipelines. Extreme feature-level error rate bias and variance metric outliers were excluded from this analysis to minimize biases due to poor model fit. Features with large bias and variance metrics, 1.5×IQR from the median, were deemed outliers.
We fit the following mixed effect model to test for differences in measurement bias across pipelines
$$e_{ijk} = b + b_{i} + z_{j} + \epsilon_{ijk} $$
where eijk was the observed error across features and titrations k for pipeline i on individual j. bi was a fixed term modeling the pipeline effect, zj was a random effect (normally distributed with mean 0) capturing overall bias differences across individuals. We fit a similar model for differences in error variance across pipelines.
We used estimated terms \(\hat {b}_{i}\) from the mixed effects model to test for pair-wise differences between pipelines. These multiple comparisons were performed with Tukey’s HSD test. A one-sided alternative hypothesis was used to determine which pipelines had smaller feature-level error rates.