Recovering complete and draft population genomes from metagenome datasets

Sangwan, Naseer; Xia, Fangfang; Gilbert, Jack A.

doi:10.1186/s40168-016-0154-5

Microbiome

Table 2 Key methodological features of NCA-based metagenome binning tools

From: Recovering complete and draft population genomes from metagenome datasets

Binning software	Sequence composition model	Differential abundance model	Clustering algorithm	Stopping criteria	Post-processing and other notable heuristics
ABAWACA	Combined mono-, di-, and tri- nucleotide frequencies		Hierarchical clustering with iterative splitting; long scaffolds are broken into 5-kb fragments at the beginning; splitting based on a single metric that results in the best separation in each round	No separation can be made given quality score based on the extent to which the broken scaffolds are grouped correctly	Genome assessment based on marker genes and consensus taxonomic placement with reciprocal best BLAST hits; manual inspection using ggKBase; scaffold extension
Canopy^a	Inter-assembly tetranucleotide frequency z-profiles created on 5-kb windows only in post-binning chimera detection	Abundance distance defined in terms of Pearson correlation and Spearman’s rank correlation coefficients	Canopy clustering (seed-and-recruit)	Stabilization of canopy profiles	Sample-specific augmented assemblies on two samples with most mapped reads and one with most gene containing de novo contigs
CONCOCT	K-mer frequencies (tetranucleotide by default); uniform Dirichlet distribution prior on the relative frequencies; dimension reduction using principal component analysis to keep 90 % of joined composition and coverage variance	Combined log-transformed profile of normalized coverage and composition vectors	Gaussian mixture models; regularized expectation-maximization; cluster number determined by automatic relevance determination	Parameter convergence and maximum iteration number	Empirical variational Bayesian approach; variational approximation used to perform integral in optimizing mixing coefficients
GroopM	Tetranucleotide frequencies; dimension reduction using principal component analysis to keep 80 % of compositional variance	Transformed coverage space to reduce unevenness of variability distribution	Iterative clustering in two custom steps: two-way clustering and Hough partitioning; bin refinement using self-organizing map	1:1 correspondence between bins and sub regions on the SOM surface	GC variance model for chimera detection
MaxBin	Tetranucleotide frequencies; Euclidean distance; empirically estimated Gaussian distributions of intra- and inter-genome distances	Poisson distribution	Expectation-maximization; cluster number estimated from single-copy genes; initial parameters inferred from the shortest marker gene	Parameter convergence and maximum iteration number	Recursive checking of all bins for median number of marker genes
MetaBAT	Tetranucleotide frequencies; Euclidean distance; empirical posterior probability derived from different contig sizes using logistic regression	Abundance distance defined as the non-shared area of two normal distributions	Modified K-medoid clustering without the need to set the number of clusters	Medoid convergence	Progressive weighting of the relative importance of DA vs TNF based on the number of samples; optional assembly, based on CheckM assessment, of mapped reads from a single most represented sample to reduce contamination

^aWe have also included the DA method Canopy because it uses sequence composition in post-binning refinement

Back to article page

ISSN: 2049-2618

Contact us

General enquiries: journalsubmissions@springernature.com