Development of NIBSC Gut Whole Cell Reference Reagent (WC-Gut RR)
Bacterial strains included in the NIBSC Gut Microbiome Whole Cell Reference Reagents (WC-Gut RRs) were the same 20 strains used for the NIBSC DNA-Gut-Mix RR [3] to facilitate complementary usage. Strains encompass 5 phyla, 13 families, 16 genera, and 19 species ensuring a wide range of cell wall types representing what would commonly occur in the gut. They also represent a taxonomic identification challenge for downstream sequencing and bioinformatics, similar to challenges during gut microbiome analysis. Twelve strains are gram positive, and eight strains are gram negative.
For stability and distribution purposes, bacteria were grown in liquid culture and fixed with 95% acetone. Acetone was chosen as a fixative agent as it has a demonstrated ability to fix the cell and preserve the DNA [44]. To ensure that acetone fixation did not change the extraction efficiency of cells, we performed DNA extractions on fixed and unfixed pure cultures. No significant difference was observed in DNA extraction efficiency between unfixed and fixed bacteria cells (pairwise t-test, FDR-adjusted p-value, see Supplementary Table 1, Additional file 1). The absence of bacterial growth for all final reagents was confirmed through culture-based viability testing. Furthermore, confocal and electron microscopy demonstrated that bacterial cell maintained their cell form, with both cellular DNA and RNA identified within the cells, observed as fine fibrillar structures (Fig. 1). The formation of these fine fibrillar structures is caused by the sample preparation method of chemical fixation combined with freeze substitution, which results in the aggregation of DNA within the nucleoplasm [45, 46]. Freeze-drying on the reagents did not influence the consistency of the reagent (i.e. the microbiome composition) nor the DNA extraction efficiency of the kits (see Supplementary Table 2, Supplementary Fig. 1 and 2, Additional files 1, 2, and 3, respectively).
Analysis of the variability in physicochemical characteristics of DNA across different extraction methods using Gut-WC RR
WC-Gut RR aims to standardise the DNA extraction step in gut microbiome analytical pipelines and highlight where biases occur. Therefore, we required a suitable reporting framework that scientists can use to establish that their methodologies and pipelines are fit for purpose. We developed a two-step framework that firstly assessed physicochemical data to ensure DNA is fit for the purpose of downstream sequencing and secondly assessed the taxonomic composition using a four-measure reporting system as previously described [3]. By employing this framework, users can save time and costs by first assessing the DNA quality prior to sequencing to determine composition.
For reporting of DNA physicochemical quality and quantity, we proposed three measures that assess total DNA yield, DNA integrity, and purity as measured by contamination of protein. Total DNA yield was measured by fluorimetry using a Qubit. DNA integrity was measured by the DNA Integrity Number (DIN) as measured by Agilent TapeStation since this gives a clear numerical value of DNA degradation based on electrophoresis. Finally, purity was assessed using spectrophotometry, specifically absorbance at 260 nm and 280 nm, with a ratio of 260/280 used to indicate purity. This was used on the basis that a 260/280 nm ratio of ~1.8 is widely accepted as a measure of pure DNA [47].
We assessed the suitability of WC-Gut RRs and the two-step reporting framework using eight DNA extraction kits commonly used in microbiome studies. Chosen kits are produced by various manufactures and use different strategies to extract nucleic acids from bacteria (see Supplementary Table 3, Additional file 1). Significant differences were observed in the mean average DNA yield depending on the kit (Tukey HSD following ANOVA, FDR-adjusted p-value < 0.05, except for kits 1, 2, and 5, as well as Kit 5 and Kit 7 that had similar yield) with a range from 2462 ng (Kit 6) to 125 ng (Kit 3) (Fig. 2). Six of the eight kits extracted DNA of similar integrity with DIN scores ranging from 4.12 to 5.32. However, Kit 5 had a notably low DIN score of 3.35, as a result of the very low DNA concentration. Similarly, due to the consistent poor yield of Kit 3, no integrity analysis was possible, as the TapeStation’s limit of detection is 1 ng/μl. Seven of the eight kits had a consistent mean average of > 1.8 260/280 absorbance ratio with Kit 2 slightly lower at 1.69. Collectively, these results demonstrate that Gut-WC RRs can help to compare the yield, integrity, and purity of DNA extracted using different commercial DNA extraction kits. The exact threshold for what users should achieve is dependent on the downstream sequencing technology to be used, with long-read technologies favouring intact and pure DNA. A potential limitation is that extractions from stool will have inhibitors and enzymes causing degradation. However, by using this reagent, users can consistently gauge how well a DNA extraction has worked across experiments and studies and be confident that inadequate readings for the three measures on a whole cell reagent will almost certainly reflect extraction failure for their samples. Notably, all DNA extractions using the WC-Gut RR were shown to be reproducible, with average coefficient of variation (CV) values being below 100% for all measures (see Supplementary Table 4, Additional file 1).
Analysis of variability in community composition of DNA across different extraction methods using WC-Gut RRs
We next assessed whether the WC-Gut RRs could be used to evaluate how well different DNA extraction kits preserve the composition of the community in the reagent. To do this, we used a four-measure reporting system, as previously developed for evaluating next-generation sequencing and bioinformatic pipelines for NIBSC DNA-Gut-Mix RR, a complementary DNA reagent [3]. This allows for a comprehensive analysis of how well a methodology detects strains, prevents false positives, and accurately reconstitutes the community composition of the target sample. Measures calculate the number of strains detected (Diversity), the Bray-Curtis Similarity between the community composition given by the pipeline and the actual composition of the reagent (Similarity), the number of correct strains detected (Sensitivity), and the relative abundance of false-positive identifications in the final dataset (false-positive relative abundance, FPRA). Further details of the rationale behind these measures are discussed exhaustively in previous work [3].
Previous work using NIBSC DNA-Gut-Mix RR demonstrated that MetaPhlAn3 had the best analytical performance according to the four-measure reporting system [3]. We therefore initially analysed taxonomic profiles of extracted DNA using shotgun sequencing paired with MetaPhlAn3 and demonstrated significant differences across the various DNA extraction kits used (Fig. 3). Using the four-measure reporting system, we compared the MetaPhlAn3 results to the samples’ actual compositions (Fig. 3A). Differences introduced by the kits were observed for FPRA, Diversity, and Similarity, whereas Sensitivity was constant at 84% (16/19 species) for all kits. While the species Ruminococcus gauvreauii was also absent from the DNA-Gut-Mix RR, i.e. it was absent due to bioinformatics bias, the species Clostridium butyricum and Alistipes finegoldii were absent only in the WC-Gut RR. That indicates bias during the DNA extraction process for these two strains since they were present in the DNA-Gut-Mix RR. Clostridium butyricum and Alistipes finegoldii were present in the 16S amplicon sequencing results in small amounts (0–0.01%), indicating only traces of DNA were extracted and could only be identified through the PCR amplification of the 16S rRNA region but not via the shotgun sequencing methodology. FPRA ranged from 3.5 (Kit 6) to 8.2% (Kit 8). Diversity was 18 for kits 1, 2, 4, 6, 7, and 8 and 17 for Kit 3 and Kit 5, both of which had the lowest DNA concentration as assessed through physicochemical analysis. We next assessed Similarity, which was significantly different between kits (PERMANOVA, p-value < 0.05, see Supplementary Table 6, Additional file 1), with kits 1,2,6, 7, and 8 ranging between 63 and 66% similarity to the ground truth. Given equivalent levels of Sensitivity, Diversity and Similarity, these five kits performed best, with Kit 6 leading to lower levels of FPRA but decreased levels of Similarity relative to kits 1 and 7. Collectively, these results demonstrate how the NIBSC WC-Gut RR can help in determining the ability of different DNA extraction kits to accurately represent microbiome composition.
Even with the highest performing DNA extraction kit according to the Similarity measure, a 34% reduction in similarity to the known ground truth was observed. To better understand the reasons for this, we investigated the results of the complementary DNA reagent, NIBSC DNA-Gut-Mix RR, which was sequenced in parallel. Previous studies have suggested that DNA extraction kits introduce the most bias into studies [6]. However, these studies have not separated the bioinformatic tool from the DNA extraction kit. By combining the results of DNA-Gut-Mix RR with WC-Gut RR, we clearly dereplicate that 22% loss of similarity from the known composition is introduced by bioinformatic tools. Using the DNA reagents, shotgun sequencing paired with MetaPhlAn3 was only capable of achieving 78% similarity to the known ground truth. However, as the similarity of the composition of DNA extracted from the WC-Gut RR is at best 74% similar to the NIBSC DNA-Gut-Mix RR, it is likely that even with a perfect sequencing and bioinformatic pipeline, the best results that could be achieved would be 68–74% Similarity to the ground truth when using DNA extraction kits 1,2,4,7, and 8.
In order to understand the differences between the DNA extraction kits, we performed a more detailed analysis of the changes in individual species (Fig 3B). Most of the strains were significantly different in Relative Abundance in comparison with the DNA-Gut-Mix RR (Kruskal-Wallis, FDR p-value < 0.05, see Supplementary Table 5, Additional file 1). Significant differences were also observed in β-diversity (Fig. 3C), with the microbiome resulted from the extraction of the WC-Gut RR having significantly different β-diversity in comparison with the DNA-Gut-Mix RR, confirming the results observed using measures of Similarity. When grouping strains are based on their gram stain (Fig. 3D), analysis demonstrated significant differences (Tukey HSD following ANOVA, FDR-adjusted p-value < 0.001, see Supplementary Table 7, Additional file 1) in the ability of the different kits to extract DNA from gram-positive and gram-negative bacteria, supporting results of studies prior [17, 18]. Two pairs of kits, Kit 1 and Kit 2 and Kit 4 and Kit 8, had a similar ability in extracting DNA from gram-positive and gram-negative bacteria. Kits 3 and 5, which had the worst performance in the physicochemical measures and the four measures of taxonomic composition, had a poor ability to extract DNA from gram-positive bacteria (Fig. 3D and Supplementary Table 7, Additional file 1).
Previous studies have demonstrated a reagent-derived contamination present with some extraction kits [15, 23, 48]. Using negative controls, MetaPhlAn3 did not map any sequences for the negative control samples for any of the DNA extraction kits. Despite this, we would still recommend all studies incorporate negative controls into their extractions.
Our past work with NIBSC DNA-Gut-Mix RR demonstrated MetaPhlAn3 was the most specific bioinformatic tool (by the measure of FPRA) and had the highest similarity of microbiome community to the actual composition when using DNA reference reagents [3]. As a follow-up step, we tested whether individual DNA extraction kits may perform better with other bioinformatic pipelines, perhaps due to biases in pipelines correcting or masking biases introduced by different DNA extraction kits. We therefore compared different combinations of DNA extraction kits with different bioinformatic tools. Tools used included MetaPhlAn3, Centrifuge, Kaiju, Bracken, and Kraken (see Supplementary Fig. 3, Additional file 4). Using DNA extracted from the WC-Gut RR, results matched those previously described, with Sensitivity being reduced when using Centrifuge, Kraken, and Bracken, but increased when using Kaiju, in comparison with data produced by MetaPhlAn3. While the FPRA values varied between the different bioinformatics tools, the Diversity values where very different for the actual Diversity, with Centrifuge pipeline identifying 46–57 species in the reagent, Kaiju identifying 167–207 species, Kraken 68–84, and Bracken 72–95 species, in comparison with the 17–20 species identified using MetaPhlAn3 and the 19 species expected based on the actual composition of the sample. As previously shown by Amos et al. (2020), Centrifuge, Bracken, and Kraken have the lowest Similarity to the actual composition (Similarity < 60%, see Supplementary Fig. 3 A, C, and D, Additional file 4). MetaPhlAn3 and Kaiju demonstrated similarities to the actual composition of > 60%, with kits 1, 2, 6, and 7, all being between 64 and 66%. This suggests that kits 1, 2, 6, and 7 combined with MetaPhlAn3 give the most accurate analysis of the composition of a target sample, among the DNA extraction kits and bioinformatics tools tested. The observed differences between the bioinformatics tools may also be attributed to the design of the tools (e.g. MetaPhlAn3 being more conservative) and whether they report relative sequence abundance or relative taxonomic abundance [24].
Compatibility of WC-Gut RRs with 16S rRNA sequencing
Previous work using NIBSC DNA-Gut-Mix RR demonstrated the V4 region with analysis through QIIME2, and Deblur gave the highest levels of Sensitivity, FPRA, Diversity, and Similarity compared to the known composition. To ensure DNA extracted from WC-Gut RRs was compatible with 16S rRNA sequencing, we performed amplicon sequencing of the V4 region on the same DNAs extracted for shotgun sequencing with subsequent analysis with Deblur through the QIIME2 platform. Analysis at the genera level revealed differences across extraction kits in microbial diversity and composition (Fig. 4). The four taxonomic data measures (Sensitivity, FPRA, Diversity, and Similarity) were more consistent across kits than for shotgun sequencing data (Fig. 4A). This could be due to primer amplification bias offsetting extraction biases and thereby leading to a more even composition. Additionally, taxonomic identification is being set at a genera level giving a greater margin for error in taxonomic analysis by bioinformatic pipelines. Despite this, the reagents could clearly identify those kits that did not detect all genera and those reagents that led to elevated measures of Diversity. Specifically, Kit 6 had reduced Sensitivity with reference to all other kits, and all kits except Kits 3, 5, and 6 identified 16 as the number of genera present, the same as the actual composition. Similarity was consistently between 69 and 71% to the actual composition and 59–67 % when adjusting to 16S copy number, confirming previous work indicating that adjusting for 16S copy number does not improve the results [36]. Similarity between the results of the WC-Gut RR and the DNA-Gut-Mix RR ranged from 67 to 78 %. This suggests that part of the bias is due to library preparation, sequencing, and bioinformatics analysis. Improving the outcome from that part of the process, e.g. by using a bioinformatic pipeline of higher accuracy, may reveal the true level of bias introduced by the DNA extraction process, i.e. which microbes’ DNA was not efficiently extracted and therefore not appropriately represented in the sequencing results. Overall, when excluding the bioinformatics bias, by calculating the Similarity of the WC-Gut RR to the DNA-Gut-Mix RR, a 22–33% reduction in Similarity was observed. This is slightly improved Similarity in comparison with the shotgun sequencing data, most likely due to the different technologies and resolution at the genera level vs species level.
Similar to the shotgun sequencing results, analysis of the microbial composition (Fig. 4B) demonstrated nearly all of the genera were significantly different in relative abundance in comparison with the DNA-Gut-Mix RR (based on Kruskal-Wallis, FDR p-value < 0.05, see Supplementary Table 8, Additional file 1). Analysis of the β-diversity (Fig. 4C) also identified similar patterns to those found in the shotgun sequencing results. All the tested kits presented significantly different β-diversity in comparison with the DNA-Gut-Mix RR and to each other (based on PERMANOVA, p-value < 0.05, see Supplementary Table 9, Additional file 1). When the strains were grouped based on their gram stain (Fig. 4D), significant differences were identified in the ability of the kits to extract from gram-negative and gram-positive bacteria, relative to the DNA-Gut-Mix RR and each other (Tukey HSD following ANOVA, FDR-adjusted p-value, see Supplementary Table 10, Additional file 1). Post hoc testing demonstrated that two pairs of kits, Kit 1–Kit 2 and Kit 4–Kit 8, did not exhibit significant differences in the ability to extract DNA from gram-positive and gram-negative bacteria (Fig. 4D and Supplementary Table 10, Additional file 1).
In summary, these results demonstrate the ability of NIBSC WC-Gut RR to distinguish between the ability of DNA extraction kits using sequencing technologies but also highlight that using 16S rRNA sequencing and having a higher taxonomic resolution can reduce the appearance of biases in the dataset compared to shotgun sequencing.
Impact of mock community composition on kit performance
Due to our findings that kits extracted gram-positive and gram-negative bacteria with different efficiencies, we explored how community composition influenced the performance of kits. For this purpose, we used the three highest performing DNA extraction kits, from different manufacturers, based on scores of Similarity to the Actual composition. Four different commercial whole cell reagents were extracted, shotgun sequenced, and analysed with MetaPhlAn3.
Kits demonstrated variable ability in extracting DNA from the five different communities as measured by Sensitivity, FPRA, and Similarity (Fig. 5). All kits had reduced Sensitivity for the NIBSC WC-Gut RR compared to other commercial reagents. Kits also had the highest FPRA for NIBSC WC-Gut RR compared to other reagents, with Similarity being broadly comparable across ATCC reagents and the NIBSC WC-Gut RR. This clearly demonstrates that the specific composition of the NIBSC WC-Gut RR poses a strong challenge to common microbiome pipelines. This highlights the importance of using reagents that are targeted and complex. Reagents that are of low complexity, well clinically characterised, and not representative of fastidious anaerobic bacteria that are observed in the gut will give users false confidence as to how well their pipelines are working. Kits significantly differed in their ability to extract DNA from gram-positive and gram-negative bacteria (Tukey HSD following ANOVA, FDR-adjusted p-value, see Supplementary Table 11, Additional file 1), depending on the strains used in the reagent (Fig. 5D, the reagent composition can be found in the Supplementary Table 12, Additional file 1). These results highlight the necessity of reagents that include strains with different lysis ability, and strains that are relevant to the microbiome of interest, in order to recapitulate more accurately the real bias introduced by the DNA extraction processes and downstream bioinformatics analysis.