Species classifier choice is a key consideration when analysing low-complexity food microbiome data
© The Author(s). 2018
Received: 1 December 2017
Accepted: 5 March 2018
Published: 20 March 2018
The use of shotgun metagenomics to analyse low-complexity microbial communities in foods has the potential to be of considerable fundamental and applied value. However, there is currently no consensus with respect to choice of species classification tool, platform, or sequencing depth. Here, we benchmarked the performances of three high-throughput short-read sequencing platforms, the Illumina MiSeq, NextSeq 500, and Ion Proton, for shotgun metagenomics of food microbiota. Briefly, we sequenced six kefir DNA samples and a mock community DNA sample, the latter constructed by evenly mixing genomic DNA from 13 food-related bacterial species. A variety of bioinformatic tools were used to analyse the data generated, and the effects of sequencing depth on these analyses were tested by randomly subsampling reads.
Compositional analysis results were consistent between the platforms at divergent sequencing depths. However, we observed pronounced differences in the predictions from species classification tools. Indeed, PERMANOVA indicated that there was no significant differences between the compositional results generated by the different sequencers (p = 0.693, R2 = 0.011), but there was a significant difference between the results predicted by the species classifiers (p = 0.01, R2 = 0.127). The relative abundances predicted by the classifiers, apart from MetaPhlAn2, were apparently biased by reference genome sizes. Additionally, we observed varying false-positive rates among the classifiers. MetaPhlAn2 had the lowest false-positive rate, whereas SLIMM had the greatest false-positive rate. Strain-level analysis results were also similar across platforms. Each platform correctly identified the strains present in the mock community, but accuracy was improved slightly with greater sequencing depth. Notably, PanPhlAn detected the dominant strains in each kefir sample above 500,000 reads per sample. Again, the outputs from functional profiling analysis using SUPER-FOCUS were generally accordant between the platforms at different sequencing depths. Finally, and expectedly, metagenome assembly completeness was significantly lower on the MiSeq than either on the NextSeq (p = 0.03) or the Proton (p = 0.011), and it improved with increased sequencing depth.
Our results demonstrate a remarkable similarity in the results generated by the three sequencing platforms at different sequencing depths, and, in fact, the choice of bioinformatics methodology had a more evident impact on results than the choice of sequencer did.
Next generation sequencing has revolutionised microbiological research by enabling high-throughput metagenomic analysis of mixed microbial communities from many different environments [1–3]. Briefly, metagenomics involves the culture-independent analysis of genomic DNA isolated from an entire microbial community, whereas genomics involves the culture-dependent analysis of genomic DNA isolated from a single microbial isolate . Metagenomic sequencing is an umbrella term which encompasses two distinct culture-independent sequencing approaches: amplicon sequencing or shotgun metagenomics. To date, amplicon sequencing, primarily of the 16S rRNA gene, has been the most commonly utilised metagenomic approach . 16S rRNA gene sequencing is used to investigate the bacterial composition of samples , but it is typically limited to genus-level identification , although higher resolution is sometimes possible [8, 9]. In contrast, shotgun metagenomics enables species-level , and potentially strain-level, classification [11–14] of microorganisms. Importantly, shotgun metagenomics can also be applied to determine the genetic content of samples to assess the associated functional potential . Shotgun metagenomics has been relatively underutilised, primarily because it is more expensive than 16S rRNA gene sequencing as it necessitates considerably higher sequencing depths . Indeed, desired sequencing depth is a factor that frequently dictates the choice of sequencing platform for high-throughput sequencing investigations .
A variety of sequencing platforms is currently available from several manufacturers, which vary in sequencing chemistry, read length, and/or throughput. Presently, Illumina sequencers are the most commonly used sequencing platforms for microbiological research applications, including shotgun metagenomics . Illumina sequencing chemistry is based on sequencing-by-synthesis, wherein adaptor-ligated DNA fragments on the surface of a flow cell are amplified by bridge PCR to generate clusters which are then sequenced via cyclic rounds of single-base extension with a mixture of fluorescently labelled dNTPs whose incorporation is detected using a high-sensitivity camera . The Illumina range of sequencers includes, in order of throughput, the MiSeq, NextSeq, and HiSeq series. Generally, the NextSeq or the HiSeq are preferred to the MiSeq for shotgun metagenomics, although there are several examples of the MiSeq also being used for this approach [20–22].
The Ion Torrent PGM from Life Technologies is another frequently utilised sequencer in microbiology, particularly for whole genome sequencing analysis of microbial isolates , although it is also used for shotgun metagenomics . In contrast, the higher-throughput Ion Proton, also from Life Technologies, is comparatively overlooked for metagenomic sequencing, whereas it is widely used for exome sequencing analysis of higher organisms [25–27]. Ion sequencing chemistry is based on semiconductor sequencing, wherein adaptor-ligated DNA fragments attached to the surface of beads are amplified using emulsion PCR . Subsequently, these beads are placed inside microwells on a semiconductor sequencing chip, where a sequencing-by-synthesis reaction occurs which is similar to the Illumina method, except that base incorporation is determined by the measurement of pH changes caused by the escape of hydrogen ions during DNA extension.
Numerous studies have previously compared the performances of the Illumina MiSeq versus the Ion Torrent PGM to determine the relative accuracy of the sequencers, and now, it has been well established that the error rate of the Illumina platforms, less than 1%, is lower than that of their Ion counterparts, approximately 1.7% . Specifically, Ion reads contain a higher incidence of insertions/deletions , and they are susceptible to premature sequence truncation . Long homopolymer tracts are especially problematic for Ion sequencing .
Previous investigations have aimed to determine if the choice of sequencing platform significantly influences metagenomic analyses. Recently, Fouhy et al. compared the MiSeq with the PGM for 16S rRNA gene sequencing analysis and reported that compositional results differed depending on the platform used . However, when these platforms were compared with the HiSeq for shotgun metagenomic applications, it was apparent that compositional results were similar across platforms but varied depending on the species classification tools used . Although these studies focused on gut microbial populations, shotgun metagenomics also has enormous potential with respect to the analysis of low-complexity microbial communities, such as those in foods. Indeed, shotgun metagenomics has already vastly improved our knowledge of the microbiology of a number of fermented foods  and has numerous potential applications relating to food quality and safety . Furthermore, it has been proposed that metagenomic analysis of fermented foods can yield insights into the nature of microbial interactions or microbial community formation in other, more complicated environments . However, the absence of a consensus with respect to the optimal sequencing platform or bioinformatic tools for shotgun metagenomic analysis of simple microbial communities could delay the more widespread application of the approach.
Here, we describe the first comparison of the performances of the short-read DNA sequencing platforms, the Illumina MiSeq, the Illumina NextSeq, and the Ion Proton, for shotgun metagenomic analysis of low-complexity food-associated microbial communities. This analysis was combined with an investigation of the impact of sequencing depth and downstream bioinformatic analysis, with a view to informing researchers, and especially food microbiologists, when designing shotgun metagenomic experiments.
Compositional analysis is influenced more by the choice of species classifier than the platform used
Bacterial strains whose genomic DNA was mixed in an equimolar ratio to construct the Mock Community DNA sample
RefSeq assembly accession
GC content (%)
Genome size (bp)
Bifidobacterium adolescentis Reuter
A number of species not present in the mock community DNA sample were detected as false positives (Additional file 2: Figure S2). With respect to platforms, the MiSeq and NextSeq gave the lowest and highest numbers of false positives, respectively. Of the species classifiers, MetaPhlAn2 and SLIMM gave the lowest and highest numbers of false positives, respectively. However, it is important to note that all of the false positives were detected at less than 1% relative abundance, and species assigned were closely related to actual mock community species.
Overall, our results indicate that MetaPhlAn2 is the most accurate method, since it provided the lowest number of false positives. Additionally, the relative abundances predicted by MetaPhlAn2 were not biased by reference genome sizes.
We averaged the results from each species classifier to generate a consensus taxonomic profile of the kefir samples (Additional file 7: Figure S4A), and subsequent MDS analysis verified that there was no significant dissimilarity between the sequencers (PERMANOVA: p = 0.912, R2 = 0.02) (Additional file 7: Figure S4B).
Bacterial strain identification was consistent across platforms
For the kefir samples, PanPhlAn was used to provide strain-level analysis of the two most dominant species, Lactobacillus kefiranofaciens and Leuconostoc mesenteroides. Analysis on the MiSeq, NextSeq, and Proton platforms all indicated that the Lactobacillus kefiranofaciens strain detected in the kefir samples was most closely related to L. kefiranofaciens GCF_001434195, but the MiSeq detected significantly fewer shared pangenome gene families than either the NextSeq (p = 0.01) or the Proton (p = 0.01). Similarly, analysis of data from all the three platforms indicated that the Leuconostoc mesenteroides strain was most closely related to L. mesenteroides GCF_000447945 (Fig. 3b), but, again, the MiSeq detected significantly fewer shared pangenome gene families than either the NextSeq (p = 0.024) or the Proton (p = 0.024). It is likely that the decreased accuracy achieved with the MiSeq was due to its lower sequencing depth relative to the other two sequencers. The contribution of sequencing depth to the accuracy of strain-level analysis is investigated in the subsequent sections.
Metagenome assembly completeness varies significantly between platforms but functional profiles remain consistent
IDBA-UD was used to assemble the mock community and kefir metagenomes. The n50 number, which is a measure of metagenome assembly completeness, of MiSeq assemblies was significantly lower than either that of NextSeq (p = 0.03) or Proton assemblies (p = 0.011) (Additional file 8: Figure S5). The mean n50 numbers for each platform were as follows: n50 = 3151 (MiSeq), n50 = 13,874 (NextSeq), and n50 = 9307 (Proton).
Metagenomic pathway analysis tools provide inconsistent results
The results from SUPER-FOCUS were compared to those from HUMAnN2, which is an alternative tool for functional analysis of metagenomes. MDS analysis revealed that there was a significant dissimilarity between the two tools (PERMANOVA: p = 0.808, R2 = 0.057) (Additional file 9: Figure S6), based on the relative abundances of 865 level-4 enzyme commission (EC) categories which were detected by both programs. Indeed, 749 EC categories were differentially abundant between the methods (Additional file 10: Table S4).
Sequencing depth does not significantly affect composition or functional potential of low-complexity food microbiomes
Reads from the mock community and kefir samples were randomly subsampled to assess the effects of sequencing depth on compositional and functional analysis. MiSeq reads were subsampled from 100,000 to 1,000,000 reads per sample, while NextSeq and Proton reads were subsampled from 100,000 to 7,500,000 reads per sample.
SUPER-FOCUS analysis of subsampled kefir reads again revealed that the functional profiles were highly similar at the different sequencing depths. Indeed, MDS analysis indicated that data points did not cluster by the number of reads per sample (Fig. 6c), but instead, we identified six distinct clusters, representing each of the six kefir samples. However, we did identify 15 differentially abundant level 2 subsystems at different sequencing depths, but these functions were all present at < 0.01% relative abundance (Additional file 13: Figure S8).
The reproducibility of random subsampling improves with increased sequencing depth
The reproducibility of sequence subsampling was assessed by randomly subsampling each kefir sample 10 times at 100,000 reads, 250,000 reads, and 500,000 reads. The subsampled reads were analysed using MetaPhlAn2 and SUPER-FOCUS. For MetaPhlAn2, MDS showed that replicates clustered together at each sequencing depth (Additional file 14: Figure S9A). However, the average distance from replicates to their respective centroids significantly decreased with increased sequencing depth for each sequencer (Additional file 14: Figure S9B). Additionally, at 500,000 reads, the distance to the centroid was significantly lower for the MiSeq than for either the NextSeq or the Proton (Additional file 14: Figure S9C). Similarly, for SUPER-FOCUS, MDS showed that replicates clustered together at each sequencing depth (Additional file 15: Figure S10A). However, again, the distance to the centroid significantly decreased with increased sequencing depth for each sequencer (Additional file 15: Figure S10B). Furthermore, at all sequencing depths, the distance to the centroid was lower for the MiSeq than for either the NextSeq or the Proton, and it was also lower for the NextSeq than for the Proton (Additional file 15: Figure S10C). Overall, our results indicate that random subsampling is consistent but reproducibility does improve with sequencing depth. The MiSeq gave the most consistent results, which is perhaps because it produces longer read lengths than the other two platforms.
Currently, there is no consensus as to which next-generation sequencing platforms are most suitable for shotgun metagenomics of low-complexity microbial communities, such as those in foods. Optimised determination of food microbiota is of considerable relevance to ensuring the safety, quality, and health-promoting attributes of foods. Here, we use a variety of bioinformatic tools to benchmark the performances of three high-throughput platforms for shotgun metagenomics of food microbial communities: the Illumina MiSeq, the Illumina NextSeq, and the Ion Proton. Our results highlight a remarkable similarity in the results generated with each platform in terms of compositional, functional, and strain-level analysis. In contrast, several issues with the outputs from species classifiers were identified. Notably, the results of MetaPhlAn2 analysis differed from those of the other species classifiers. We expect that this is because MetaPhlAn2 is based on the alignments with species-specific marker gene sequences, whereas the other methods, which can be categorised as taxonomic binning tools, are based on alignments with whole genome sequences. In fact, we noted that the relative abundances of mock community species, as predicted by all of the species classifiers apart from MetaPhlAn2, correlated to the size of their respective reference genomes. Thus, our results confirm previous observations that these species classifiers are biased by the size of the reference genome , in the same way that 16S rRNA gene sequencing is biased by the number of 16S rRNA genes per genome. It is important to be aware of this issue when reporting species abundances. A potential solution to the problem is to normalise relative abundances by genome size. Indeed, this solution has already been suggested elsewhere [38, 39], and we found that normalisation resulted in a more even species distribution. However, this solution is limited by the assumption that intraspecific strains share the same genome sizes, when, in fact, genome sizes often vary within a species . We noted some additional discrepancies between the species classifiers. Specifically, Corynebacterium casei was overlooked within the mock community by CLARK or Kraken, even though the species was present in their respective databases. Compositional analysis of the mock community also produced numerous probable false positive species classifications, especially in the case of SLIMM, but most of the false positives were closely related to the actual mock community species and they were present at less than 1% relative abundance. Overall, our results indicated that none of the classifiers are entirely accurate, but we suggest that MetaPhlAn2, and perhaps Kaiju, are the most suitable for compositional analysis of low-complexity communities, especially foods, since both tools identified all of the mock community species and they can additionally detect eukaryotic organisms.
Compositional analysis of kefir showed that the choice of sequencing platform did not noticeably affect the results. However, dissimilarity analysis again highlighted marked differences between the outputs generated by the species classifiers. Thus, for compositional analysis, the choice of sequencing platform had less of an influence on results than the choice of species classifier. These observations are consistent with the findings from a previous sequencing platform comparison study , where the authors demonstrated that gut metagenome samples clustered by species classifier. Such results highlight a need for consistency in bioinformatics methodologies across studies, but the issue is confounded by the increasing availability of different species classifiers. The recently developed method MetaMeta , which integrates the results from multiple species classifiers to mitigate the flaws from each individual tool, might partially address this problem. We did not use MetaMeta here because the default program employs a different combination of species classifiers to that used in our study. Instead, we averaged the predicted taxonomic profiles from each species classifier for every sample, as an alternative solution, and subsequent analysis confirmed that there was no significant dissimilarity between the sequencers. Another possible option for compositional analysis, which we did not explore here, is to use a de novo metagenome assembly approach, wherein genomes are binned using tools like CONCOCT  or MetaBAT , and reads are then mapped against these bins to calculate species abundances. An advantage of such an approach is that it does not rely on a reference database for diversity analysis and it may also be able to estimate the abundances of potentially novel genomes. However, sequence alignment against a reference database is still necessary to assign taxonomy to the bins, and, additionally, the approach requires a considerably higher sequencing depth than short-read alignment-based methods .
Another important aspect of shotgun metagenomics is its ability to characterise the functional potential of metagenomes. Again, the results of functional analysis were generally consistent between all three sequencing platforms, but SUPER-FOCUS did detect significant differences in three functions which were present at greater than 1% relative abundance within the kefir metagenome. Such discrepancies suggest that results generated with different sequencers cannot be reliably compared.
Above, we described a considerable difference in the compositional profiles determined by different species classifiers. Hence, we also compared results from SUPER-FOCUS with those from HUMAnN2, which is an alternative tool for functional analysis of metagenomes. We observed a similarly pronounced disparity in the results generated by these methods. Specifically, 865 level-4 enzyme commission (EC) categories were detected with both tools, but 749 of these EC categories were differentially abundant between them. Our observation is not unexpected since these pipelines use inherently different approaches, but it does further emphasise that results obtained using different methods cannot be directly compared.
Next, we compared the results of strain-level analysis using PanPhlAn, and we found that all three sequencers correctly identified the analysed strains from the mock community sample. Similarly, the three platforms each indicated that the L. kefiranofaciens and L. mesenteroides strains detected in the kefir samples were most closely related to L. kefiranofaciens GCF_001434195 and L. mesenteroides GCF_000447945, respectively. PanPhlAn was significantly less accurate when utilising data generated by the MiSeq compared to either NextSeq or Proton data, suggesting that sequencing depth affected strain-level analysis. We subsequently confirmed this by randomly subsampling kefir sequencing reads which demonstrated that PanPhlAn failed to detect L. kefiranofaciens GCF_001434195 or L. mesenteroides GCF_000447945 a subset of kefir samples below 500,000 reads per sample using any sequencer. Similarly, and as expected, we observed that sequencing depth significantly improved metagenome assembly completeness. On the other hand, sequencing depth did not have a noticeable effect on compositional or functional analysis of the mock community or kefir, regardless of the choice of sequencer. Indeed, the results of these analyses were almost uniform at sequencing depths ranging from 100,000 reads per sample to 7,500,000 reads per sample, regardless of the choice of species classifier. It is important to note, however, that increased sequencing depth caused a slight, but significant, improvement in the reproducibility of random subsampling, which suggests that higher coverage offers more reproducible results.
Overall, our findings confirm that the Proton is on par with Illumina sequencers in terms of accuracy, but only a handful of studies have used the Proton for shotgun metagenomics to date [44, 45], even if it is widely used for human exome sequencing. On the basis of these investigations, the Proton is a viable option for metagenomic analyses.
To date, most high-throughput sequencing-based studies of microbial communities of food have relied upon 16S rRNA gene sequencing . Shotgun metagenomics can, in general, offer higher taxonomic resolution than amplicon sequencing, although the latter approach may be superior for studying poorly microbiologically characterised environments that contain few species for which there are available reference genomes. Shotgun metagenomics can also be used for the direct functional characterisation of metagenomes. Several recent studies have demonstrated the enormous potential for shotgun metagenomic analysis of foods, and indeed, we have previously used the approach to identify the cause of a pink discoloration defect in Swiss-type cheeses , link microbial species with distinct flavours during kefir fermentation , and identify pathogenic strains in nunu . However, the higher cost of shotgun metagenomics is considered prohibitive for commercial application of the technology by the food industry and, consequently, the approach has been relatively underutilised. This is partially due to a perception that shotgun metagenomics requires considerable sequencing depth per sample. Notably, our results suggest that this is not necessarily true for the low-complexity microbial communities present in foods and suggest that 750,000 to 1,000,000 reads per sample is sufficient for compositional and/or functional analysis of such simple communities.
In conclusion, analysis of low-diversity metagenomic DNA representative of food microbial communities highlighted that outputs were consistent across a variety of sequencing platforms at different sequencing depths, but there were clear disparities between the outputs of bioinformatic tools. Thus, the choice of sequencer for shotgun metagenomics can be dictated by logistical factors, like platform availability, budget, or sample size, rather than sequencing chemistry. It is hoped that this work will guide researchers, particularly food microbiologists, in designing shotgun metagenomic experiments to exploit the extensive possibilities offered by the approach.
Sources of metagenomic DNA
Metagenomic DNA representative of a low-complexity, food-based, microbial community was generated by mixing equimolar ratios of genomic DNA from 13 food-related bacteria (Table 1). Strains were selected on the basis of the availability of corresponding complete or near-complete genome sequences from RefSeq . Genomic DNA was sourced from ATCC, DSM, and LMG. Genomic DNA concentration was determined prior to pooling using the Qubit High Sensitivity DNA assay (BioSciences, Dublin, Ireland). We also analysed metagenomic DNA from six kefir milk samples which were previously isolated by Walsh et al. . Briefly, the samples were produced using either the Ick grain (samples i24hd4, i24hd5, i24hd6) or the UK3 grain (samples u24hd4, u24hd5, u24hd6). Three separate kefir fermentations were done using each grain. Fermented kefir samples were collected after 24 h fermentation.
Illumina libraries were prepared using the Nextera XT kit in accordance with the Nextera XT DNA Library Preparation Guide from Illumina. MiSeq libraries were sequenced on the Illumina MiSeq sequencing platform in the Teagasc sequencing facility, using a 2 × 300 cycle v3 kit, following standard Illumina sequencing protocols. NextSeq libraries were sequenced on the Illumina NextSeq 500, with a NextSeq 500/550 High Output Reagent Kit v2 (300 cycles), in accordance with the standard Illumina sequencing protocols. Proton libraries were prepared in accordance with the Ion Xpress Plus gDNA Fragment Library Preparation User Guide. Proton libraries were enriched using the ION Proton PI template OT2 200 Kit v3, and sequenced using the Ion PI Sequencing 200 Kit v3, in accordance with the standard Ion protocols.
Raw shotgun metagenomic fastq files were converted to bam files using SAMtools , and duplicate reads were subsequently removed using Picard Tools (https://github.com/broadinstitute/picard). Next, low-quality reads were removed using SAMtools in combination with Picard Tools. Illumina reads were filtered to 200 bp, and reads with a quality score less than Q30 were discarded. Ion Proton reads were filtered to 110 bp, and reads with a quality score less than Q20 were discarded. Processed bam files were converted to fastq files using the fastq-dump option from the NCBI SRA Toolkit (https://github.com/ncbi/sratoolkit), which were then converted to fasta files using the fq2fa option from IDBA-UD . Reads were randomly subsampled using seqtk (https://github.com/lh3/seqtk).
Compositional analysis was performed using the following species classifiers: CLARK , Kaiju , Kraken , MetaPhlAn2 , and SLIMM . Species detected below 0.1% relative abundance were categorised as “other” for each classifier. Note that Bowtie 2  was used to map reads against the slimmDB_5k database. Strain-level metagenomic analysis was performed using PanPhlAn , which aligns reads against a pangenome database to functionally characterise strains. See Additional file 16 for a detailed description of the settings used for each species classifier and/or PanPhlAn. Functional analysis was performed with SUPER-FOCUS , using the aligner DIAMOND , and HUMAnN2 , using the --bypass-translated-search option. Briefly, SUPER-FOCUS measures the abundances of subsystems, or groups of proteins with shared functionality, by aligning sequencing reads against a reduced SEED database , whereas HUMAnN2 measures the abundances of UniRef clusters  by aligning sequences against the ChocoPhlAn database. HUMAnN2 gene families were mapped to level-4 enzyme commission (EC) categories using HUMAnN2 utility mapping files. Metagenome assembly was performed using IDBA-UD .
Sequence data have been deposited in the European Nucleotide Archive (ENA) under the project accession number PRJEB22610.
Statistical analysis was performed in R-3.2.2 . The vegan package (version 2.3.0)  was used for alpha diversity analysis, as well as Bray-Curtis-based multidimensional scaling (MDS) analysis. The adonis function in vegan was used for PERMANOVA (permutational analysis of variance) analysis, and the betadisper function, also in vegan, was used to calculate the distance of points from the centroid. The Kruskal-Wallis test was used to identify significant differences, and the resultant p values were adjusted using the Benjamini-Hochberg method. The Hmisc package (version 3.16.0)  was used for correlation analysis. The ggplot2 package (version 2.2.1)  was used for data visualisation.
It is important to note that the mock community DNA sample was only sequenced once on each platform, and thus, we were unable to assess technical variation across sequencing runs. However, previous studies have already demonstrated that such variation is small, accounting for 1.3 to 2.3% variation between KEGG functional profiles . Additionally, we chose 0.1% relative abundance as an arbitrary cut-off to compare species or pathways, whereas, in reality, potentially important taxa or functions may be present below this threshold.
We would like to thank Paul Cormican for his assistance in installing the bioinformatic tools used in this study.
This research was funded by Science Foundation Ireland in the form of a centre grant (APC Microbiome Institute grant number SFI/12/RC/2273). Research in the Cotter laboratory is also funded by Science Foundation Ireland through the PI award “Obesibiotics” (11/PI/1137). Orla O’Sullivan is funded by Science Foundation Ireland through a Starting Investigator Research Grant award (13/SIRG/2160).
Availability of data and materials
Raw shotgun metagenomic data can be retrieved from the European Nucleotide Archive under the project accession number PRJEB22610. Detailed information on the bioinformatic scripts used here can be found in the Additional file 16.
AMW prepared the samples for sequencing, performed the bioinformatic/statistical analysis, generated the figures, and drafted the manuscript. FC and LF assisted in the library preparation and performed the DNA sequencing. OOS contributed to the study design. FC, MJC, and PDC contributed to the study design and helped to coordinate and edit the manuscript. All authors read and approved the final manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Consortium HMP. Structure, function and diversity of the healthy human microbiome. Nature. 2012;486(7402):207.View ArticleGoogle Scholar
- Fierer N, Leff JW, Adams BJ, Nielsen UN, Bates ST, Lauber CL, Owens S, Gilbert JA, Wall DH, Caporaso JG. Cross-biome metagenomic analyses of soil microbial communities and their functional attributes. Proc Natl Acad Sci. 2012;109(52):21390–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Lauro FM, McDougald D, Thomas T, Williams TJ, Egan S, Rice S, DeMaere MZ, Ting L, Ertan H, Johnson J. The genomic basis of trophic strategy in marine bacteria. Proc Natl Acad Sci. 2009;106(37):15527–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Gilbert JA, Dupont CL. Microbial metagenomics: beyond the genome. Annu Rev Mar Sci. 2011;3:347–71.View ArticleGoogle Scholar
- Ranjan R, Rani A, Metwally A, McGee HS, Perkins DL. Analysis of the microbiome: advantages of whole genome shotgun versus 16S amplicon sequencing. Biochem Biophys Res Commun. 2016;469(4):967–77.View ArticlePubMedGoogle Scholar
- Yarza P, Yilmaz P, Pruesse E, Glöckner FO, Ludwig W, Schleifer K-H, Whitman WB, Euzéby J, Amann R, Rosselló-Móra R. Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences. Nat Rev Microbiol. 2014;12(9):635.View ArticlePubMedGoogle Scholar
- Noecker C, McNally CP, Eng A, Borenstein E. High-resolution characterization of the human microbiome. Transl Res. 2017;179:7–23.View ArticlePubMedGoogle Scholar
- Allard G, Ryan FJ, Jeffery IB, Claesson MJ. SPINGO: a rapid species-classifier for microbial amplicon sequences. BMC Bioinformatics. 2015;16(1):1.View ArticleGoogle Scholar
- Eren AM, Maignien L, Sul WJ, Murphy LG, Grim SL, Morrison HG, Sogin ML. Oligotyping: differentiating between closely related microbial taxa using 16S rRNA gene data. Methods Ecol Evol. 2013;4(12):1111–9.View ArticlePubMed CentralGoogle Scholar
- Lindgreen S, Adair KL, Gardner PP. An evaluation of the accuracy and speed of metagenome analysis tools. Sci Rep. 2016;6:19233.View ArticlePubMedPubMed CentralGoogle Scholar
- Luo C, Knight R, Siljander H, Knip M, Xavier RJ, Gevers D. ConStrains identifies microbial strains in metagenomic datasets. Nat Biotechnol. 2015;33(10):1045.View ArticlePubMedPubMed CentralGoogle Scholar
- Scholz M, Ward DV, Pasolli E, Tolio T, Zolfo M, Asnicar F, Truong DT: Strain-level microbial epidemiology and population genomics from shotgun metagenomics. 2016, 13(5):435-438.Google Scholar
- Truong DT, Tett A, Pasolli E, Huttenhower C, Segata N. Microbial strain-level population structure and genetic diversity from metagenomes. Genome Res. 2017;27(4):626–38.View ArticlePubMedPubMed CentralGoogle Scholar
- Zolfo M, Tett A, Jousson O, Donati C, Segata N. MetaMLST: multi-locus strain-level bacterial typing from metagenomic samples. Nucleic Acids Res. 2016; https://doi.org/10.1093/nar/gkw837.
- Franzosa EA, Hsu T, Sirota-Madi A, Shafquat A, Abu-Ali G, Morgan XC, Huttenhower C. Sequencing and beyond: integrating molecular ‘omics’ for microbial community profiling. Nat Rev Microbiol. 2015;13(6):360–72.View ArticlePubMedPubMed CentralGoogle Scholar
- Knight R, Jansson J, Field D, Fierer N, Desai N, Fuhrman JA, Hugenholtz P, van der Lelie D, Meyer F, Stevens R. Unlocking the potential of metagenomics through replicated experimental design. Nat Biotechnol. 2012;30(6):513–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP. Sequencing depth and coverage: key considerations in genomic analyses. Nat Rev Genet. 2014;15(2):121–32.View ArticlePubMedGoogle Scholar
- Reuter JA, Spacek DV, Snyder MP. High-throughput sequencing technologies. Mol Cell. 2015;58(4):586–97.View ArticlePubMedPubMed CentralGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR. Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008;456(7218):53–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Dubin K, Callahan MK, Ren B, Khanin R, Viale A, Ling L, No D, Gobourne A, Littmann E, Huttenhower C. Intestinal microbiome analyses identify melanoma patients at risk for checkpoint-blockade-induced colitis. Nat Commun. 2016;7:10391.View ArticlePubMedPubMed CentralGoogle Scholar
- Milani C, Ticinesi A, Gerritsen J, Nouvenne A, Lugli GA, Mancabelli L, Turroni F, Duranti S, Mangifesta M, Viappiani A et al: Gut microbiota composition and Clostridium difficile infection in hospitalized elderly individuals: a metagenomic study. Sci Rep 2016, 6:25945.Google Scholar
- Yergeau E, Michel C, Tremblay J, Niemi A, King TL, Wyglinski J, Lee K, Greer CW: Metagenomic survey of the taxonomic and functional microbial communities of seawater and sea ice from the Canadian Arctic. Sci Rep 2017, 7:42242.Google Scholar
- Deng X, den Bakker HC, Hendriksen RS. Genomic epidemiology: whole-genome-sequencing—powered surveillance and outbreak investigation of foodborne bacterial pathogens. Annu Rev Food Sci Technol. 2016;7:353–74.View ArticlePubMedGoogle Scholar
- Speth DR, Guerrero-Cruz S, Dutilh BE, Jetten MS: Genome-based microbial ecology of anammox granules in a full-scale wastewater treatment system. Nature communications 2016, 7:11172.Google Scholar
- Ni J, Ramkissoon SH, Xie S, Goel S, Stover DG, Guo H, Luu V, Marco E, Ramkissoon LA, Kang YJ. Combination inhibition of PI3K and mTORC1 yields durable remissions in mice bearing orthotopic patient-derived xenografts of HER2-positive breast cancer brain metastases. Nat Med. 2016;22(7):723–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Riera M, Navarro R, Ruiz-Nogales S, Méndez P, Burés-Jelstrup A, Corcóstegui B, Pomares E. Whole exome sequencing using Ion Proton system enables reliable genetic diagnosis of inherited retinal dystrophies. Sci Rep. 2017;7:42078.View ArticlePubMedPubMed CentralGoogle Scholar
- Tarailo-Graovac M, Shyr C, Ross CJ, Horvath GA, Salvarinova R, Ye XC, Zhang L-H, Bhavsar AP, Lee JJ, Drögemöller BI. Exome sequencing and the management of neurometabolic disorders. N Engl J Med. 2016;374(23):2246–55.View ArticlePubMedPubMed CentralGoogle Scholar
- Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, Leamon JH, Johnson K, Milgrew MJ, Edwards M. An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011;475(7356):348–52.View ArticlePubMedGoogle Scholar
- Ashktorab H, Azimi H, Nickerson ML, Bass S, Varma S, Brim H: Targeted Exome Sequencing Outcome Variations of Colorectal Tumors within and across Two Sequencing Platforms. Next Gener Seq Appl 2016, 3(1):123.Google Scholar
- Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, Pallen MJ. Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012;30(5):434–9.View ArticlePubMedGoogle Scholar
- Salipante SJ, Kawashima T, Rosenthal C, Hoogestraat DR, Cummings LA, Sengupta DJ, Harkins TT, Cookson BT, Hoffman NG. Performance comparison of Illumina and ion torrent next-generation sequencing platforms for 16S rRNA-based bacterial community profiling. Appl Environ Microbiol. 2014;80(24):7583–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y. A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012;13(1):341.View ArticlePubMedPubMed CentralGoogle Scholar
- Fouhy F, Clooney AG, Stanton C, Claesson MJ, Cotter PD. 16S rRNA gene sequencing of mock microbial populations-impact of DNA extraction method, primer choice and sequencing platform. BMC Microbiol. 2016;16(1):123.View ArticlePubMedPubMed CentralGoogle Scholar
- Clooney AG, Fouhy F, Sleator RD, O’Driscoll A, Stanton C, Cotter PD, Claesson MJ. Comparing apples and oranges?: next generation sequencing and its impact on microbiome analysis. PLoS One. 2016;11(2):e0148028.View ArticlePubMedPubMed CentralGoogle Scholar
- De Filippis F, Parente E, Ercolini D. Metagenomics insights into food fermentations. Microb Biotechnol. 2017;10(1):91–102.View ArticlePubMedGoogle Scholar
- Doyle CJ, O'Toole PW, Cotter PD: Metagenome-based surveillance and diagnostic approaches to studying the microbial ecology of food production and processing environments. Environ Microbiol 2017, 19(11):4382-4391.Google Scholar
- Wolfe BE, Button JE, Santarelli M, Dutton RJ. Cheese rind communities provide tractable systems for in situ and in vitro studies of microbial diversity. Cell. 2014;158(2):422–33.View ArticlePubMedPubMed CentralGoogle Scholar
- Nayfach S, Pollard KS. Average genome size estimation improves comparative metagenomics and sheds light on the functional ecology of the human microbiome. Genome Biol. 2015;16(1):51.View ArticlePubMedPubMed CentralGoogle Scholar
- Piro VC, Matschkowski M, Renard BY: MetaMeta: integrating metagenome analysis tools to improve taxonomic profiling. Microbiome 2017, 5(1):101.Google Scholar
- Land M, Hauser L, Jun S-R, Nookaew I, Leuze MR, Ahn T-H, Karpinets T, Lund O, Kora G, Wassenaar T. Insights from 20 years of bacterial genome sequencing. Funct Integr Genomics. 2015;15(2):141–61.View ArticlePubMedPubMed CentralGoogle Scholar
- Alneberg J, Bjarnason BS, De Bruijn I, Schirmer M, Quick J, Ijaz UZ, Lahti L, Loman NJ, Andersson AF, Quince C. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11(11):1144.View ArticlePubMedGoogle Scholar
- Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3:e1165.View ArticlePubMedPubMed CentralGoogle Scholar
- Quince C, Walker AW, Simpson JT, Loman NJ, Segata N. Shotgun metagenomics, from sampling to analysis. Nat Biotechnol. 2017;35(9):833.View ArticlePubMedGoogle Scholar
- Noyes NR, Yang X, Linke LM, Magnuson RJ, Cook SR, Zaheer R, Yang H, Woerner DR, Geornaras I, McArt JA. Characterization of the resistome in manure, soil and wastewater from dairy and beef production systems. Sci Rep. 2016;6:24645.View ArticlePubMedPubMed CentralGoogle Scholar
- Kumaresan D, Cross AT, Moreira-Grez B, Kariman K, Nevill P, Stevens J, Allcock RJN, O’Donnell AG, Dixon KW, Whiteley AS: Microbial Functional Capacity Is Preserved Within Engineered Soil Formulations Used In Mine Site Restoration. Scientific Reports 2017, 7:564.Google Scholar
- Quigley L, O’Sullivan DJ, Daly D, O’Sullivan O, Burdikova Z, Vana R, Beresford TP, Ross RP, Fitzgerald GF, McSweeney PLH, et al. Thermus and the pink discoloration defect in cheese. mSystems. 2016;1(3)Google Scholar
- Walsh AM, Crispie F, Kilcawley K, O’Sullivan O, O’Sullivan MG, Claesson MJ, Cotter PD. Microbial succession and flavor production in the fermented dairy beverage kefir. mSystems. 2016;1(5):e00052–16.View ArticlePubMedPubMed CentralGoogle Scholar
- Walsh AM, Crispie F, Daari K, O'Sullivan O, Martin JC, Arthur CT, Claesson MJ, Scott KP, Cotter PD. Strain-level metagenomic analysis of the fermented dairy beverage nunu highlights potential food safety risks. Appl Environ Microbiol. 2017; https://doi.org/10.1128/AEM.01144-17.
- Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2006;35(suppl_1):D61–5.PubMedPubMed CentralGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R. The sequence alignment/map format and SAMtools. Bioinformatics. 2009;25(16):2078–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Peng Y, Leung HC, Yiu S-M, Chin FY. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–8.View ArticlePubMedGoogle Scholar
- Ounit R, Wanamaker S, Close TJ, Lonardi S. CLARK: fast and accurate classification of metagenomic and genomic sequences using discriminative k-mers. BMC Genomics. 2015;16(1):1.View ArticleGoogle Scholar
- Menzel P, Ng KL, Krogh A: Fast and sensitive taxonomic classification for metagenomics with Kaiju. Nature Communications 2016, 7:11257.Google Scholar
- Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):R46.View ArticlePubMedPubMed CentralGoogle Scholar
- Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, Tett A, Huttenhower C, Segata N. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015;12(10):902–3.View ArticlePubMedGoogle Scholar
- Dadi TH, Renard BY, Wieler LH, Semmler T, Reinert K. SLIMM: species level identification of microorganisms from metagenomes. PeerJ. 2017;5:e3138.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9(4):357–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Silva GGZ, Green KT, Dutilh BE, Edwards RA. SUPER-FOCUS: a tool for agile functional analysis of shotgun metagenomic data. Bioinformatics. 2016;32(3):354–61.View ArticlePubMedGoogle Scholar
- Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nat Methods. 2015;12(1):59–60.View ArticlePubMedGoogle Scholar
- Abubucker S, Segata N, Goll J, Schubert AM, Izard J, Cantarel BL, Rodriguez-Mueller B, Zucker J, Thiagarajan M, Henrissat B. Metabolic reconstruction for metagenomic data and its application to the human microbiome. PLoS Comput Biol. 2012;8(6):e1002358.View ArticlePubMedPubMed CentralGoogle Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang H-Y, Cohoon M, de Crécy-Lagard V, Diaz N, Disz T, Edwards R. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33(17):5691–702.View ArticlePubMedPubMed CentralGoogle Scholar
- Suzek BE, Wang Y, Huang H, McGarvey PB, Wu CH, Consortium U. UniRef clusters: a comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics. 2014;31(6):926–32.View ArticlePubMedPubMed CentralGoogle Scholar
- Team RC: R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. 2013. ISBN 3-900051-07-0; 2014.Google Scholar
- Oksanen J, Kindt R, Legendre P, O’Hara B, Stevens MHH, Oksanen MJ, Suggests M. The vegan package. Commun Ecol Packag. 2007;10:631–7.Google Scholar
- Harrell FE Jr, Harrell MFE Jr. Package ‘Hmisc’. R Found Stat Comput. 2017; https://cran.r-project.org/web/packages/Hmisc/index.html
- Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.Google Scholar
- Nayfach S, Pollard KS. Toward accurate and quantitative comparative metagenomics. Cell. 2016;166(5):1103–16.View ArticlePubMedPubMed CentralGoogle Scholar