Recovering complete and draft population genomes from metagenome datasets
© Sangwan et al. 2016
Received: 29 June 2015
Accepted: 5 February 2016
Published: 8 March 2016
Assembly of metagenomic sequence data into microbial genomes is of fundamental value to improving our understanding of microbial ecology and metabolism by elucidating the functional potential of hard-to-culture microorganisms. Here, we provide a synthesis of available methods to bin metagenomic contigs into species-level groups and highlight how genetic diversity, sequencing depth, and coverage influence binning success. Despite the computational cost on application to deeply sequenced complex metagenomes (e.g., soil), covarying patterns of contig coverage across multiple datasets significantly improves the binning process. We also discuss and compare current genome validation methods and reveal how these methods tackle the problem of chimeric genome bins i.e., sequences from multiple species. Finally, we explore how population genome assembly can be used to uncover biogeographic trends and to characterize the effect of in situ functional constraints on the genome-wide evolution.
KeywordsMetagenomics Genotype Assembly Binning Curation
Metagenomic recovery of complete or draft bacterial and archaeal genomes provides a route to analyze the “taxon-specific” potential of organisms within their community and ecosystem context. This is allowing insights into ecological adaptation, trophic interactions, and metabolic versatility of uncultured and eco-genetically adapted organisms (Fig. 1; [6, 12–16]). A genome can be defined as the total gene content of a single cell, whereas a population genome or genotype is defined as the total gene content of a group of closely related organisms. Genetic variability can be extensive in many bacterial species , which creates barriers to the recovery of strain-specific genotypes from complex microbial communities. This is because genome recovery (reassembly) methods are often based on clustering genetic sequences by nucleotide composition (oligonucleotide frequency-based correlation) and subsequent alignment-free visualization of metagenome contigs . Therefore, within a population of extremely closely related strains, it is difficult to segregate the gene content of each genotype.
In this review, we discuss the existing theoretical frameworks and methodologies for reconstructing genomes from metagenomic data sets, how these methods are limited by the availability of computational resources, and provide a series of recommendations on best strategies for metagenome assembly (analysis of sequence coverage and assembly errors) and binning.
Assembling contigs from short read metagenomic data
Assembling community genomics data, especially the co-assembly of multiple samples, is a complex task. This is in part due to computational memory constraints but mainly as a result of biological complexity, including genetic diversity and mobile genetic elements. Long stretches of near-identical metagenomic sequences are especially hard to assemble with the short reads from next generation platforms because such sequences could be from multiple sources: repetitive DNA of a single genome, homologous regions of closely related strains, or conserved regions of different species that coexist in the community. Failure to resolve these regions often result in rearrangement errors and chimeric assembly [18–20].
Recent developments in assembly algorithms [21–25] and related methods [26, 27] have led to significant improvements in the accuracy and efficiency of sequence assembly. These successes are measured by more contiguous pieces (greater N50 scores), increased numbers of predicted genes, fewer break points and rearrangements, and error limits close to the expected sequencing substitution error rate. Importantly, with memory use being such a bottleneck for analysis, many of these new programs have focused on reducing memory requirements, e.g., Minia , or parallel algorithms on distributed-memory machines e.g., HiPMer . While metagenomic assembly is improving, metrics such as N50, designed for single-genome assembly, can be misleading; instead, combining N50 with other metrics such as rigorous statistical metrics e.g., ALE , fragment coverage distribution , total assembly size, and number of predicted non-redundant genes may provide an improved measure assembly success .
In 2012, Namiki et al. outlined the most important limitation of the single-genome assembly programs, namely, the inability of these algorithms to cluster sequence reads with diverse origins and heterogeneous coverage. Focusing on the “always increasing” property of the de Bruijn graph construction method, using the k-mer frequency patterns of the input dataset, the authors presented a strategy to decompose the de Bruijn graph of multiple species into subgraphs, each representing a cluster of reads from an individual species. However, since multiple species can have similar coverage patterns , these individual subgraphs can represent population genotype bins. A similar framework was implemented in various other metagenomic assemblers . In 2012, Pell and colleagues  demonstrated the use of bloom filters as data structures for storing sparse sets as de Bruijn graphs (predominant assembly method), these filters lower the memory requirement by 40-fold. Recently, Scholz and colleagues presented a new method, metagenomic assembly by merging (MeGAMerge), to generate an improved metagenome assembly by merging the contigs generated from multiple assemblies . Using an overlay consensus (OLC)-based assembler (for example, Minimus-2 ), contig bins assembled across different platforms (e.g., Velvet , SOAPdenovo2 , and Ray ), assembly parameters (e.g., k-mer length), and sequencing technologies (e.g., Illumina and Pyrosequencing) were merged into a composite assembly. Similarly, Deng et al. highlighted the sequential use of de Brujin graphs and OLC assemblers to increase the percentage recovery of targeted genomes . Individual metagenome assemblies were generated from quality-trimmed metagenome reads (complete and partitioned datasets) using de Brujin graph-based assemblers such as SOAPdenovo2  and ABySS . Finally, multiple assembly outputs generated across complete and partitioned sequence datasets were merged using an OLC assembler CAP3 . Optimized sequential use of different assembly platforms has demonstrated the potential to improve contig and scaffold lengths.
To determine the most effective strategy to use requires knowledge of how parameters such as genetic diversity, k-mer length, and sequence errors influence assembly success. To quantify these impacts, several recent studies have employed simulated shotgun sequence data [18, 40, 41]. Cahruvaka and Rangawala  suggest that the evenness of abundance of members of the community had a significant influence on accuracy, whereby the lower the evenness (greater dominance) the greater the accuracy; also, as expected, high intra-strain level diversity significantly reduced accuracy. They also demonstrated that clustering of contigs generated from assembly across different k-mer lengths created longer but less accurate contigs. Finally, while sequencing errors did not influence the annotation of gene function, they played a significant role in reducing the assembly accuracy . However, to determine an effective strategy, one must also consider the metagenomic coverage, which is the fraction of total community diversity captured in the dataset.
Recently, Rodriguez and Konstantinidis highlighted methods for estimating metagenome coverage using real microbial metagenomic data . Accurate coverage estimates are important in comparative studies as it informs the statistical tests required for interpretation of results  and is directly related to assembly quality . Rarefaction analysis is the primary qualitative method used to estimate metagenome coverage, but it is suboptimal for metagenomic coverage analysis as it is reliant on deep sequence coverage of a metagenome, high-quality assemblies, and representative reference data sets, which limit its use for complex natural communities with low sequencing depth and for species with no reference genome [45, 46]. Nonpareil  addresses these problems by using singleton genes to calculate average metagenome coverage. Specifically, ungapped alignment between terminal regions of sequence reads is used to calculate the redundancy (portion of the total reads in the dataset that shows overlap with at least one other read) values for a subset of a complete dataset. Using a binomial distribution approach, individual read abundance (number of matches with other reads in the dataset) was processed to compute a saturation function of redundancy. Finally, the saturation function was summarized to calculate the average coverage. Nonpareil allows accurate estimation of the sequencing effort required to achieve a fixed average coverage, which can be used as a quality metric for the expected metagenome assembly.
Binning: grouping assembled contigs into taxonomic bins
Metagenomic binning has two major components: (i) clustering and (ii) data representation. Clustering involves grouping contigs, scaffolds, or genes based on their genetic characteristics, including oligonucleotide frequency or coverage, using a combination of different approaches, such as hierarchical clustering and neural networks. These clusters are then grouped with various data representation approaches into individual taxonomic bins.
Key methodological features of three main metagenome binning approaches
Nucleotide composition (NC)
Oligonucleotide frequency matrix and %G+C-based screening.
HCL, correlation-based network graph and emergent self-organization maps (ESOM).
(i) More efficent for the genomes with skewed nucleotide composition patterns.
(i) Individual metagenome assemblies or samples where populations do not change over time can be used.
(i) R packages: qgraph (8), i graph, pv-clust 
(ii) Less efficient in differentiating between closely related genotypes.
(iv) 2T-binning  (http://hmp.ucalgary.ca/HMP/metagenomes/data/SCADC/454/Binning/2TBinning/)
(iii) Depends on the visualization and manual inspection of bins and therefore are not suitable for very large assemblies representing complex environments.
Nucleotide composition and abundance (NCA)
A composite distance matrix from oligonucleotide frequency matrix and coverage.
K-medioids clustering, Gaussian mixture models, and expectation and maximization algorithm.
(ii), (iv) Require multiple samples for better performance, and therefore are associated with cost, time, and computational resources.
(i), (ii) Improved contig binning than NC method.
Differential abundance (DA)
Differential coverage patterns across multiple samples where population changed in abundance over time.
Profile based correlation cut-off.
(iv) Must have multiple samples with population changed in abundance over time, and therefore are associated with cost, computational time, and resources.
(ii), (iii) Strain level resolution can be achieved.
The majority of the community genomics surveys published in recent years [3, 6, 12, 13, 16, 47, 48] have used NC, mostly oligonucleotide frequency and %G+C. Mackelprang et al.  used a hierarchical agglomerative clustering method to process the tetranucleotide frequency matrix and cluster metagenome contigs into genome bins, while Iverson et al. used a graph-based approach for assisting individual genome reassembly. In this latter study, a network graph was constructed where nodes (individual contigs and/or scaffolds) are connected by edges representing tetranucleotide Z-statistic correlation. Outliers were excluded from the graph using an empirically determined distance cutoff (Pearson’s correlation coefficient (PCC) >0.9). Connected nodes (scaffolds) of these graphs were later checked manually for coverage and %G+C profiles. Open-source software packages (Table 1) are available, including qgraph  and igraph (https://cran.r-project.org/web/packages/igraph/index.html), to perform such clustering and network-based graph construction and visualization. However, this pairwise analysis is computationally expensive for large datasets. NC techniques have mostly been applied to communities with genotypes that possess distinct nucleotide composition pattern, such as a low %G+C and consistent oligonucleotide frequency . It is likely, though not proven, that this technique in isolation will struggle with communities that exhibit high oligonucleotide compositional variance.
In 2013, Sharon et al. demonstrated the DA approach on time series data to reconstruct six complete and two near complete bacterial genomes; these taxa had relative abundances as low as 0.05 % of the total community. The raw sequence data were assembled (de novo) using a de Bruijn graph approach, the contigs were binned according to the k-mer coverage, and the bins with greatest abundance were selected for the individual assembly. For each iteration, assembly parameters were optimized according to the selected coverage profile. Finally, the reads that mapped over the assembly were removed from the original set, and the remaining data was again binned according to the k-mer abundance to determine coverage. Size-selected (>3 kb) scaffolds were clustered into the bins using emergent self-organizing maps (ESOM) with a normalized time-series abundance profile.
A similar DA approach was used to recover the high-quality population genomes from environmental samples processed using two different DNA extraction methods, which resulted into the creation of two community gene pools with different population relative abundance profiles . Size-selected scaffolds from the larger metagenome dataset were binned using coverage information, but then, tetranucleotide frequencies were employed for clustering and visualization (in a permutation on the NCA approach described below). Individual reads mapped over refined genome bins were extracted and reassembled independently. Paired-end information was further employed to identify multiple-copy genes, including rRNA operons. The authors also provided Perl scripts to facilitate the assembly visualization, including the reference-free assembly validation statistics .
Nielsen et al. used a DA method called Canopy to reconstruct microbial and phage genomes, and plasmids, using co-abundance patterns across multiple samples . Initially, an iteratively optimized Markov clustering (MCL) algorithm and co-abundance-based correlation distance (1-correlation coefficient) matrix was used to cluster 2 % of the total community genetic repertoire with the strongest correlation to the human type 2 diabetes phenotype . Again, due to the pairwise analysis, this method is computationally expensive, so Nielsen and colleagues used a novel approach to overcome these computational limitations. Using a global sequence identity cutoff of 95 %, a non-redundant community gene pool was created. Normalized co-abundance patterns were calculated for each gene using paired-end read mapping. Clustering was performed by randomly picking a “seed” gene from the community gene pool and cluster genes with similar co-abundance profiles using strict pairwise correlation cutoff. Each cluster then represents a “seed canopy,” and canopies with median abundance profiles within a distance of 0.97 PCC to each other and passing the rejection criterion explained in the original paper  were co-abundance groups (CAGs). CAGs with >700 genes were referred to as metagenome species (MGS), and reads mapped over MGSs were extracted from each sample and reassembled individually. This was referred to as MGS augmented assembly and resulted in the reconstruction of 741 genotypes including 238 microbial genomes meeting the high-quality genome standards of the Human Microbiome Project (HMP) .
In early 2014, a supervised binning NCA method called metagenome binning with abundance and tetranucleotide frequency (MetaBAT; ) was used to reconstruct 173 highly specific genome bins from a human microbiome metagenome collection. For each pair of contigs, MetaBAT calculates two probabilities of pairwise distances using tetranucleotide frequency and abundance patterns across samples. It then integrates all pairwise probabilities into a composite distance matrix. Using information from whole genome sequencing projects in IMG database, the authors suggest that sequencing bias can cause significant coverage variation among contigs assembled across one sequencing library. To overcome this coverage bias, a normal distribution-based approximation method was used to calculate the abundance matrix for each pair of contigs across one sample. Then, a geometric mean of all distances for all the samples was used to calculate the final abundance matrix. Finally, a modified k-medoid algorithm iteratively clusters the composite matrix into individual genome bins.
Due to the theoretical superiority of NCA methods, more tools (a binning algorithm without a cool acronym (ABAWACA) (https://github.com/CK7/abawaca), clustering contigs on coverage and composition (CONCOCT) , MaxBin , and GroopM ) have emerged in this category to provide automated genome binning. While all these tools bear family resemblance (e.g., some form of iterative clustering; the use of marker genes for bin delineation) to the MetaBAT algorithm described above as a representative, there are major modeling and algorithmic differences that are poorly understood. To date, our understanding of the impact of these differences comes from a small number of comparative evaluations by the method developers, and we have seen these tools give significantly different results on the same data set both in these experiments and in our own experience. Thus, we posit that the field of metagenome binning is in a similar place to where genome assembly or whole genome alignment was a few years ago before the occurrence of comprehensive benchmarks and competitive assessment studies such as Assemblathon , Alignathon , and GAGE . An example of such comparative studies in binning (and metagenomic analysis in general) is the critical assessment of metagenome interpretation initiative (CAMI; http://www.cami-challenge.org) that is currently under way. Until we see outcomes of more external studies where unpublished, diverse, simulated, and real data sets are used for evaluating binning accuracy, it is unlikely that we will be able to conclusively recommend one tool over the others.
Key methodological features of NCA-based metagenome binning tools
Sequence composition model
Differential abundance model
Post-processing and other notable heuristics
Combined mono-, di-, and tri- nucleotide frequencies
Hierarchical clustering with iterative splitting; long scaffolds are broken into 5-kb fragments at the beginning; splitting based on a single metric that results in the best separation in each round
No separation can be made given quality score based on the extent to which the broken scaffolds are grouped correctly
Genome assessment based on marker genes and consensus taxonomic placement with reciprocal best BLAST hits; manual inspection using ggKBase; scaffold extension
Inter-assembly tetranucleotide frequency z-profiles created on 5-kb windows only in post-binning chimera detection
Abundance distance defined in terms of Pearson correlation and Spearman’s rank correlation coefficients
Canopy clustering (seed-and-recruit)
Stabilization of canopy profiles
Sample-specific augmented assemblies on two samples with most mapped reads and one with most gene containing de novo contigs
K-mer frequencies (tetranucleotide by default); uniform Dirichlet distribution prior on the relative frequencies; dimension reduction using principal component analysis to keep 90 % of joined composition and coverage variance
Combined log-transformed profile of normalized coverage and composition vectors
Gaussian mixture models; regularized expectation-maximization; cluster number determined by automatic relevance determination
Parameter convergence and maximum iteration number
Empirical variational Bayesian approach; variational approximation used to perform integral in optimizing mixing coefficients
Tetranucleotide frequencies; dimension reduction using principal component analysis to keep 80 % of compositional variance
Transformed coverage space to reduce unevenness of variability distribution
Iterative clustering in two custom steps: two-way clustering and Hough partitioning; bin refinement using self-organizing map
1:1 correspondence between bins and sub regions on the SOM surface
GC variance model for chimera detection
Tetranucleotide frequencies; Euclidean distance; empirically estimated Gaussian distributions of intra- and inter-genome distances
Expectation-maximization; cluster number estimated from single-copy genes; initial parameters inferred from the shortest marker gene
Parameter convergence and maximum iteration number
Recursive checking of all bins for median number of marker genes
Tetranucleotide frequencies; Euclidean distance; empirical posterior probability derived from different contig sizes using logistic regression
Abundance distance defined as the non-shared area of two normal distributions
Modified K-medoid clustering without the need to set the number of clusters
Progressive weighting of the relative importance of DA vs TNF based on the number of samples; optional assembly, based on CheckM assessment, of mapped reads from a single most represented sample to reduce contamination
The apparent orthogonal design considerations in the NCA tools lead us to think that the performance of these tools may depend heavily on the data and that one may achieve better results by combining multiple methods. Indeed, this is the lesson we learned in genome assembly: because there is no clear winner that suits all situations, ensemble approaches such as iMetAMOS , MeGAMerge , and GAM-NGS  were developed to try multiple assemblers on the same data or improve individual results by merging them. Given the a wide range of ecological diversity, sample numbers, and sequencing characteristics in metagenome data sets, we suspect that ensemble approaches will work best for genome binning as well.
There is also room for binning methods to improve in several directions. First, phylogeny information is still underexplored in automated NCA methods. Tetranucleotide frequency has been broadly adopted for its simplicity, but information theory-based studies show that relative k-mer abundance profiles may be better phylogeny signatures  and longer k-mers have higher information content . Another source of valuable phylogeny signal comes from homology relationships between genes from assembled contigs and reference genomes. Traditional supervised approaches derive contig-level taxonomic placement from the consensus of individual predicted genes based on reciprocal BLAST hits. This method can be extrapolated to uncultured, unknown genomes without a close reference sequence [65, 66]. Despite the presence of horizontal gene transfer and uneven mutation rates along different protein lineages, there is a distinguishing power in the distribution of best hits across a range of diverse reference genomes. While this kind of information is leveraged by most research groups in post-binning inspection and refinement, incorporating it into automated optimization may greatly improve binning accuracy. Second, binning results are sensitive to parameters and most automated methods have limited preset parameters. This is one of the reasons achieving high-quality bins often requires manual tweaking aided by visualization. Automated parameter search needs to be part of ensemble binning methods, and computer vision-inspired algorithms such as Hough partitioning (used in GroopM) can also help further automate the curation process. Finally, most of the research in genome binning thus far has been concentrated on accuracy rather than computational efficiency. As a result, many binning tools are too slow or too memory intensive to handle large metagenomic data sets. Recent studies have changed the expectation for the number of bins from tens to hundreds [14, 65, 67]. As sequencing gets deeper, scalable tools such as MetaBAT and Canopy that are several orders of magnitude faster than other tools  will be appreciated.
Curation and validation of reconstructed population genomes
Currently proposed methods for the validation of reassembled genomes rely on the same theoretical framework used for detecting misassembled regions and percentage completeness across individual genome assemblies. These include paired-end read mapping-based identification of misassembled regions (i.e., structural variations including deletions and insertions), alignment-based comparison with complete genomes of closely related reference organisms, and marker gene copy number variation analysis . However, using paired-end mapping on sequencing libraries with the multimodal insert size distribution can increase the error rate, so that the number of false positive or negative events significantly increases. Meanwhile, alignment against reference genomes is fundamentally limited by the availability of already-sequenced genomes that are closely related to the organism of interest.
Two additional methods have been proposed to deal with the problem of chimeric genome bins (sequences from multiple species) observed in the metagenome assemblies . First is identifying contigs with skewed coverage patterns; using peak detection methods, coverage subsets with more than one peak are selected and removed. Second is analyzing the nucleotide composition consistency in contigs with tetranucleotide usage patterns; a median tetranucleotide frequency z-profile can be calculated for each contig, and using an empirically determined cutoff for the Pearson’s correlation coefficient distance to this median profile, it is possible to cluster contigs into high-quality population genome bins.
Individual genotype fragmentation into two different bins can occur due to population level repeats, genome coverage, or sequencing %G+C bias . To assess fragmentation, it essential to accurately quantify the genome bin completeness. The presence of single-copy genes, which mostly encode central metabolism processes (replication, translation, and transcription) or conserved core genes, are the primary target for assessing completeness. A set of 31 single-copy genes has been proposed for bacteria . This was extended to the domain Archaea, and using reciprocal BLAST-based homology searches on 112,064 proteins from 50 representative archaeal genomes, 104 universally present, single-copy genes were identified . Finally, a list of 101 hidden Markov models (HMM) from the Pfam  and TIGRFAM  databases has been produced that shows similarity to only one gene when compared against complete bacterial genomes (95 %; [72, 73]).
Recently, Parks et al. presented CheckM, a new method for estimating the completeness and contamination across population genomes . Using marker genes that are specific to a genome-based lineage within a reference tree, CheckM provides better estimates of genome completeness and contamination compared to the universal single-copy marker genes. Similarly, Busco uses lineage-specific orthologs to estimate the completeness of the draft or complete genome . However, using orthologous groups with single-copy orthologs in >90 % species (n = 40), Busco provides robust estimations across lineages with rare gene duplications and evolutionary loss of conserved genes, as is frequently the case of population genomes . Overall, the probability of a universally single-copy ortholog being present in a single-copy genome is higher than a conserved marker gene, and thus, we advocate the use of CheckM.
Using reconstructed population genomes to advance microbial ecology
Reconstructed population genomes can reveal how environmental factors shape niche-specific adaptations between individual taxa. In addition, they can also reveal the effect of in situ functional constraints on the evolution of microbial consortia. Comparative genomics of reconstructed genomes and their reference genomes employs analytical methods that are well understood and have been extensively reviewed . However, another framework for analysis relies on variation in codon bias to determine the genome-wide influence of in situ functional constraints on individual taxa. Since percentage codon bias variation analysis is a phylogenetically independent method that directly reflects the strength of selection and the translation efficiency of expressed genes [77, 78], it circumvents the need for reference genomes and can reveal the influence of in situ functional constraints over natural selection patterns. It is important to note that for complete genome sequences, codon use patterns are influenced by nucleotide composition (mutational biases) and horizontal gene transfer. However, because each gene in a reassembled genome represents the population with an even nucleotide composition, one can assume that these clonal isolate-based limitations will not skew codon usage.
Recently, a near complete genome sequence (contigs = 200; 1.9 Mb) of a Staphylococcus genotype, Staphylococcus SK5, was reconstructed from a sample sequenced from the floor of a public restroom . Whole genome-based average nucleotide identity (ANI) analysis revealed that SK5 shared strain level nucleotide identity (~99 %) with its ecotype from human-skin, Staphylococcus lugdunensis N920143 . This suggested that this organism was dispersed from a human source and had potentially been selected for on the restroom floor. Furthermore, using pairwise codon bias variation analysis  across orthologous regions of both these ecotypes (N920143 and SK5), both genomes were observed to be under different environmental selection, suggesting that functional constraints dominated . Similar evidence for considerable dispersal and environmental selection was observed in a sediment metagenome, from which a complete genome sequence (2.3 Mb) of a previously uncultured taxon, Candidatus Sulfuricurvum sp. RIFRC-1 , was recovered. Whole genome-based ANI analysis revealed that RIFRC-1 shares 75 % genome-wide identity with an ecotype from the oil fields of Japan, Sulfuricurvum kujiense . Interestingly, comparing the codon bias variation across the orthologous segments of these ecotypes, it was clear that both populations were under similar and strong functional constraints. Using these approaches, it is possible to infer the mode of environmental selection for given taxa in specific ecosystems and hypothesize about potential effect of in situ functional constraints on the mutation pressure, natural selection, and genetic drift [77, 78, 80, 81].
Recovery of novel genomes from metagenomic datasets provides components to better parameterize systems biology efforts, by increasing the availability of information on taxonomically resolved, novel metabolic potential. Also, using metagenome contigs binned at species level, phylogenetically independent analysis can be used to accurately estimate the strength of selection and translation efficiency of expressed genes assembled across ecotypes. The computational challenges that limit metagenomic-derived genome reconstructions are slowly being rolled back, and with the decreasing cost of high throughput sequencing, it will soon be possible to perform integrated analysis of inter- and/or intraspecies community dynamics with transcriptomic, proteomic, and metabolomic data from the same samples. Perhaps the most important implication of this integrative multi-omic analysis will be the potential to build predictive models that can further identify specific metabolic exchanges between species.
a binning algorithm without a cool acronym
clustering contigs on coverage and composition
nucleotide composition and abundance
This work was supported by the US Dept. of Energy under Contract DE-AC02-06CH11357.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Ramette A, Tiedje JM. Biogeography: an emerging cornerstone for understanding prokaryotic diversity, ecology, and evolution. Microb Ecol. 2007;53:197–207.View ArticlePubMedGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43.View ArticlePubMedGoogle Scholar
- Hua Z-S, Han Y-J, Chen L-X, Liu J, Hu M, Li S-J, et al. Ecological roles of dominant and rare prokaryotes in acid mine drainage revealed by metagenomics and metatranscriptomics. ISME J. 2015;9(6):1280–94.PubMed CentralView ArticlePubMedGoogle Scholar
- Sharon I, Morowitz MJ, Thomas BC, Costello EK, Relman DA, Banfield JF. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 2013;23:111–20.PubMed CentralView ArticlePubMedGoogle Scholar
- Hess M, Sczyrba A, Egan R, Kim T-W, Chokhawala H, Schroth G, et al. Metagenomic discovery of biomass-degrading genes and genomes from cow rumen. Science. 2011;331:463–7.View ArticlePubMedGoogle Scholar
- Iverson V, Morris RM, Frazar CD, Berthiaume CT, Morales RL, Armbrust EV. Untangling genomes from metagenomes: revealing an uncultured class of marine Euryarchaeota. Science. 2012;335:587–90.View ArticlePubMedGoogle Scholar
- Kantor RS, Wrighton KC, Handley KM, Sharon I, Hug LA, Castelle CJ, et al. Small genomes and sparse metabolisms of sediment-associated bacteria from four candidate phyla. mBio. 2013;4:e00708–00713.PubMed CentralView ArticlePubMedGoogle Scholar
- Brown CT. Strain recovery from metagenomes. Nat Biotechnol. 2015;33:1041–3.View ArticlePubMedGoogle Scholar
- Morowitz MJ, Denef VJ, Costello EK, Thomas BC, Poroyko V, Relman DA, et al. Strain-resolved community genomic analysis of gut microbial colonization in a premature infant. Proc Natl Acad Sci U S A. 2011;108:1128–33.PubMed CentralView ArticlePubMedGoogle Scholar
- Ofek-Lalzar M, Sela N, Goldman-Voronov M, Green SJ, Hadar Y, Minz D. Niche and host-associated functional signatures of the root surface microbiome. Nat Commun. 2014;5:4950.View ArticlePubMedGoogle Scholar
- Gilbert JA, Dupont CL. Microbial metagenomics: beyond the genome. Ann Rev Mar Sci. 2011;3:347–71.View ArticlePubMedGoogle Scholar
- Handley KM, Bartels D, O’Loughlin EJ, Williams KH, Trimble WL, Skinner K, et al. The complete genome sequence for putative H2- and S-oxidizer candidatus Sulfuricurvum sp., assembled de novo from an aquifer-derived metagenome. Environ Microbiol. 2014;16:3443–62.View ArticlePubMedGoogle Scholar
- Mackelprang R, Waldrop MP, DeAngelis KM, David MM, Chavarria KL, Blazewicz SJ, et al. Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature. 2011;480:368–71.View ArticlePubMedGoogle Scholar
- Wrighton KC, Thomas BC, Sharon I, Miller CS, Castelle CJ, VerBerkmoes NC, et al. Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science. 2012;337:1661–5.View ArticlePubMedGoogle Scholar
- Castelle CJ, Hug LA, Wrighton KC, Thomas BC, Williams KH, Wu D, et al. Extraordinary phylogenetic diversity and metabolic versatility in aquifer sediment. Nat Commun. 2013;4:2120.PubMed CentralView ArticlePubMedGoogle Scholar
- Sangwan N, Lambert C, Sharma A, Gupta V, Khurana P, Khurana JP, et al. Arsenic rich Himalayan hot spring metagenomics reveal genetically novel predator-prey genotypes. Environ Microbiol Rep. 2015;7(6):812–23.View ArticlePubMedGoogle Scholar
- Eppinger M, Daugherty S, Agrawal S, Galens K, Sengamalay N, Sadzewicz L, et al. Whole-genome draft sequences of 26 enterohemorrhagic Escherichia coli O157:H7 strains. Genome Announc. 2013;1:e0013412.PubMedGoogle Scholar
- Mende DR, Waller AS, Sunagawa S, Järvelin AI, Chan MM, Arumugam M, et al. Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS One. 2012;7:e31386.PubMed CentralView ArticlePubMedGoogle Scholar
- Treangen TJ, Salzberg SL. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat Rev Genet. 2012;13(1):36–46.Google Scholar
- Vázquez-Castellanos JF, García-López R, Pérez-Brocal V, Pignatelli M, Moya A. Comparison of different assembly and annotation tools on analysis of simulated viral metagenomic communities in the gut. BMC Genomics. 2014;15(1):1.View ArticleGoogle Scholar
- Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40:e155.PubMed CentralView ArticlePubMedGoogle Scholar
- Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol J Comput Mol Cell Biol. 2012;19:455–77.View ArticleGoogle Scholar
- Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012;13:R122.PubMed CentralView ArticlePubMedGoogle Scholar
- Peng Y, Leung HCM, Yiu SM, Chin FYL. IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinforma Oxf Engl. 2012;28:1420–8.View ArticleGoogle Scholar
- Li D, Liu C-M, Luo R, Sadakane K, Lam T-W. MEGAHIT: An ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. ArXiv14097208 Q-Bio. 2015;31(10):1674-6.
- Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT. Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proc Natl Acad Sci U S A. 2012;109:13272–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Scholz M, Lo C-C, Chain PSG. Improved assemblies using a source-agnostic pipeline for metagenomic assembly by merging (MeGAMerge) of contigs. Sci Rep. 2014;4:6480.PubMed CentralView ArticlePubMedGoogle Scholar
- Salikhov K, Sacomoto G, Kucherov G. Using cascading Bloom filters to improve the memory usage for de Brujin graphs. Algorithms Mol Biol AMB. 2014;9:2.View ArticlePubMedGoogle Scholar
- Georganas E, Buluç A, Chapman J, Hofmeyr S, Aluru C, Egan R, et al. HipMer: an extreme-scale de novo genome assembler. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. New York: ACM; 2015. p. 14:1–14:11 [SC’15].Google Scholar
- Clark SC, Egan R, Frazier PI, Wang Z. ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinforma Oxf Engl. 2013;29:435–43.View ArticleGoogle Scholar
- Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD. REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013;14:R47.PubMed CentralView ArticlePubMedGoogle Scholar
- Luo C, Tsementzi D, Kyrpides NC, Konstantinidis KT. Individual genome assembly from complex community short-read metagenomic datasets. ISME J. 2012;6:898–901.PubMed CentralView ArticlePubMedGoogle Scholar
- Sommer DD, Delcher AL, Salzberg SL, Pop M. Minimus: a fast, lightweight genome assembler. BMC Bioinformatics. 2007;8:64.PubMed CentralView ArticlePubMedGoogle Scholar
- Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–9.PubMed CentralView ArticlePubMedGoogle Scholar
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, et al. SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaSci. 2012;1:18.View ArticleGoogle Scholar
- Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010;17:1519–33.PubMed CentralView ArticlePubMedGoogle Scholar
- Deng X, Naccache SN, Ng T, Federman S, Li L, Chiu CY, et al. An ensemble strategy that significantly improves de novo assembly of microbial genomes from metagenomic next-generation sequencing data. Nucleic Acids Res. 2015;43(7):e46. gkv002.PubMed CentralView ArticlePubMedGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol İ. ABySS: a parallel assembler for short read sequence data. Genome Res. 2009;19:1117–23.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang X, Madan A. CAP3: a DNA sequence assembly program. Genome Res. 1999;9:868–77.PubMed CentralView ArticlePubMedGoogle Scholar
- Pignatelli M, Moya A. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. PLoS One. 2011;6:e19984.PubMed CentralView ArticlePubMedGoogle Scholar
- Charuvaka A, Rangwala H. Evaluation of short read metagenomic assembly. BMC Genomics. 2011;12 Suppl 2:S8.PubMed CentralView ArticlePubMedGoogle Scholar
- Rodriguez-R LM, Konstantinidis KT. Nonpareil: a redundancy-based approach to assess the level of coverage in metagenomic datasets. Bioinforma Oxf Engl. 2014;30:629–35.View ArticleGoogle Scholar
- Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007;4:495–500.View ArticlePubMedGoogle Scholar
- Wendl MC. A general coverage theory for shotgun DNA sequencing. J Comput Biol J Comput Mol Cell Biol. 2006;13:1177–96.View ArticleGoogle Scholar
- Wu D, Hugenholtz P, Mavromatis K, Pukall R, Dalin E, Ivanova NN, et al. A phylogeny-driven genomic encyclopaedia of Bacteria and Archaea. Nature. 2009;462:1056–60.PubMed CentralView ArticlePubMedGoogle Scholar
- Rodriguez-R LM, Konstantinidis KT. Estimating coverage in metagenomic data sets and why it matters. ISME J. 2014;8:2349–51.View ArticlePubMedGoogle Scholar
- Ghai R, Mizuno CM, Picazo A, Camacho A, Rodriguez-Valera F. Key roles for freshwater Actinobacteria revealed by deep metagenomic sequencing. Mol Ecol. 2014;23:6073–90.View ArticlePubMedGoogle Scholar
- Gibbons SM, Schwartz T, Fouquier J, Mitchell M, Sangwan N, Gilbert JA, et al. Ecological succession and viability of human-associated microbiota on restroom surfaces. Appl Environ Microbiol. 2015;81:765–73.PubMed CentralView ArticlePubMedGoogle Scholar
- Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KL, Tyson GW, Nielsen PH. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013;31:533–8.View ArticlePubMedGoogle Scholar
- Vezzi F, Narzisi G, Mishra B. Reevaluating assembly evaluations with feature response curves: GAGE and Assemblathons. PLoS One. 2012;7:e52210.PubMed CentralView ArticlePubMedGoogle Scholar
- Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014;32:822–8.View ArticlePubMedGoogle Scholar
- Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012;490:55–60.View ArticlePubMedGoogle Scholar
- Chain PSG, Grafham DV, Fulton RS, Fitzgerald MG, Hostetler J, Muzny D, et al. Genomics. Genome project standards in a new era of sequencing. Science. 2009;326:236–7.View ArticlePubMedGoogle Scholar
- Kang DD, Froula J, Egan R, Wang Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 2015;3:e1165.PubMed CentralView ArticlePubMedGoogle Scholar
- Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11:1144–6.View ArticlePubMedGoogle Scholar
- Wu Y-W, Simmons BA, Singer SW: MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32(4):605-7. btv638.
- Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ. 2014;2:e603.PubMed CentralView ArticlePubMedGoogle Scholar
- Bradnam KR, Fass JN, Alexandrov A, Baranay P, Bechner M, Birol I, et al. Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013;2:10.PubMed CentralView ArticlePubMedGoogle Scholar
- Earl D, Nguyen N, Hickey G, Harris RS, Fitzgerald S, Beal K, et al. Alignathon: a competitive assessment of whole-genome alignment methods. Genome Res. 2014;24:2077–89.PubMed CentralView ArticlePubMedGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al. GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–67.PubMed CentralView ArticlePubMedGoogle Scholar
- Koren S, Treangen TJ, Hill CM, Pop M, Phillippy AM. Automated ensemble assembly and validation of microbial genomes. BMC Bioinformatics. 2014;15:126.PubMed CentralView ArticlePubMedGoogle Scholar
- Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics. 2013;14 Suppl 7:S6.PubMed CentralView ArticlePubMedGoogle Scholar
- Nalbantoglu OU, Way SF, Hinrichs SH, Sayood K. RAIphy: phylogenetic classification of metagenomics samples using iterative refinement of relative abundance index profiles. BMC Bioinformatics. 2011;12:41.PubMed CentralView ArticlePubMedGoogle Scholar
- Akhter S, Bailey BA, Salamon P, Aziz RK, Edwards RA. Applying Shannon’s information theory to bacterial and phage genomes and metagenomes. Sci Rep. 2013;3:1033.PubMed CentralView ArticlePubMedGoogle Scholar
- Brown CT, Hug LA, Thomas BC, Sharon I, Castelle CJ, Singh A, et al. Unusual biology across a group comprising more than 15 % of domain Bacteria. Nature. 2015;523:208–11.View ArticlePubMedGoogle Scholar
- Di Rienzi SC, Sharon I, Wrighton KC, Koren O, Hug LA, Thomas BC, et al. The human gut and groundwater harbor non-photosynthetic bacteria belonging to a new candidate phylum sibling to Cyanobacteria. eLife. 2013;2:e01102.PubMed CentralView ArticlePubMedGoogle Scholar
- Baker BJ, Lazar CS, Teske AP, Dick GJ. Genomic resolution of linkages in carbon, nitrogen, and sulfur cycling among widespread estuary sediment bacteria. Microbiome. 2015;3(1):14.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu M, Eisen JA. A simple, fast, and accurate method of phylogenomic inference. Genome Biol. 2008;9:R151.PubMed CentralView ArticlePubMedGoogle Scholar
- Wu M, Scott AJ. Phylogenomic analysis of bacterial and archaeal sequences with AMPHORA2. Bioinformatics. 2012;28(7):1033–4. bts079.View ArticlePubMedGoogle Scholar
- Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, et al. The Pfam protein families database. Nucleic Acids Res. 2012;40(Database issue):D290–301.PubMed CentralView ArticlePubMedGoogle Scholar
- Haft DH, Selengut JD, White O. The TIGRFAMs database of protein families. Nucleic Acids Res. 2003;31:371–3.PubMed CentralView ArticlePubMedGoogle Scholar
- Dupont CL, Rusch DB, Yooseph S, Lombardo M-J, Alexander Richter R, Valas R, et al. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. ISME J. 2012;6:1186–99.PubMed CentralView ArticlePubMedGoogle Scholar
- Davidsen T, Beck E, Ganapathy A, Montgomery R, Zafar N, Yang Q, et al. The comprehensive microbial resource. Nucleic Acids Res. 2010;38(Database issue):D340–345.PubMed CentralView ArticlePubMedGoogle Scholar
- Parks DH, Imelfort M, Skennerton CT, Hugenholtz P, Tyson GW. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55. gr.186072.114.PubMed CentralView ArticlePubMedGoogle Scholar
- Simão FA, Waterhouse RM, Ioannidis P, Kriventseva EV, Zdobnov EM. BUSCO: assessing genome assembly and annotation completeness with single-copy orthologs. Bioinformatics. 2015;31(19):3210–2. btv351.View ArticlePubMedGoogle Scholar
- Edwards DJ, Holt KE. Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data. Microb Inform Exp. 2013;3:2.PubMed CentralView ArticlePubMedGoogle Scholar
- Roller M, Lucić V, Nagy I, Perica T, Vlahovicek K. Environmental shaping of codon usage and functional adaptation across microbial communities. Nucleic Acids Res. 2013;41:8842–52.PubMed CentralView ArticlePubMedGoogle Scholar
- Botzman M, Margalit H. Variation in global codon usage bias among prokaryotic organisms is associated with their lifestyles. Genome Biol. 2011;12:R109.PubMed CentralView ArticlePubMedGoogle Scholar
- Heilbronner S, Holden MTG, van Tonder A, Geoghegan JA, Foster TJ, Parkhill J, et al. Genome sequence of Staphylococcus lugdunensis N920143 allows identification of putative colonization and virulence factors. FEMS Microbiol Lett. 2011;322:60–7.PubMed CentralView ArticlePubMedGoogle Scholar
- Kodama Y, Watanabe K. Sulfuricurvum kujiense gen. nov., sp. nov., a facultatively anaerobic, chemolithoautotrophic, sulfur-oxidizing bacterium isolated from an underground crude-oil storage cavity. Int J Syst Evol Microbiol. 2004;54(Pt 6):2297–300.View ArticlePubMedGoogle Scholar
- Zhang Z, Li J, Cui P, Ding F, Li A, Townsend JP, et al. Codon deviation coefficient: a novel measure for estimating codon usage bias and its statistical significance. BMC Bioinformatics. 2012;13:43.PubMed CentralView ArticlePubMedGoogle Scholar
- Suzuki R, Shimodaira H. Pvclust: an R package for assessing the uncertainty in hierarchical clustering. Bioinformatics. 2006;22:1540–2.View ArticlePubMedGoogle Scholar
- Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton AP, et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 2009;10:R85.PubMed CentralView ArticlePubMedGoogle Scholar
- Ultsch A, Mörchen F. ESOM-Maps: tools for clustering, visualization, and classification with emergent SOM. Germany: Data Bionics Research Group, University of Marburg; 2005.
- Saeed I, Tang S-L, Halgamuge SK. Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition. Nucleic Acids Res. 2011;40(5):e34 gkr1204.
- Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, Loman NJ, Andersson AF, Quince C. CONCOCT: clustering contigs on coverage and composition. ArXiv13124038 Q-Bio. 2014;11(11):1144-6.
- Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2:26.PubMed CentralView ArticlePubMedGoogle Scholar