From: Recovering complete and draft population genomes from metagenome datasets
Binning software | Sequence composition model | Differential abundance model | Clustering algorithm | Stopping criteria | Post-processing and other notable heuristics |
---|---|---|---|---|---|
ABAWACA | Combined mono-, di-, and tri- nucleotide frequencies | Â | Hierarchical clustering with iterative splitting; long scaffolds are broken into 5-kb fragments at the beginning; splitting based on a single metric that results in the best separation in each round | No separation can be made given quality score based on the extent to which the broken scaffolds are grouped correctly | Genome assessment based on marker genes and consensus taxonomic placement with reciprocal best BLAST hits; manual inspection using ggKBase; scaffold extension |
Canopya | Inter-assembly tetranucleotide frequency z-profiles created on 5-kb windows only in post-binning chimera detection | Abundance distance defined in terms of Pearson correlation and Spearman’s rank correlation coefficients | Canopy clustering (seed-and-recruit) | Stabilization of canopy profiles | Sample-specific augmented assemblies on two samples with most mapped reads and one with most gene containing de novo contigs |
CONCOCT | K-mer frequencies (tetranucleotide by default); uniform Dirichlet distribution prior on the relative frequencies; dimension reduction using principal component analysis to keep 90Â % of joined composition and coverage variance | Combined log-transformed profile of normalized coverage and composition vectors | Gaussian mixture models; regularized expectation-maximization; cluster number determined by automatic relevance determination | Parameter convergence and maximum iteration number | Empirical variational Bayesian approach; variational approximation used to perform integral in optimizing mixing coefficients |
GroopM | Tetranucleotide frequencies; dimension reduction using principal component analysis to keep 80Â % of compositional variance | Transformed coverage space to reduce unevenness of variability distribution | Iterative clustering in two custom steps: two-way clustering and Hough partitioning; bin refinement using self-organizing map | 1:1 correspondence between bins and sub regions on the SOM surface | GC variance model for chimera detection |
MaxBin | Tetranucleotide frequencies; Euclidean distance; empirically estimated Gaussian distributions of intra- and inter-genome distances | Poisson distribution | Expectation-maximization; cluster number estimated from single-copy genes; initial parameters inferred from the shortest marker gene | Parameter convergence and maximum iteration number | Recursive checking of all bins for median number of marker genes |
MetaBAT | Tetranucleotide frequencies; Euclidean distance; empirical posterior probability derived from different contig sizes using logistic regression | Abundance distance defined as the non-shared area of two normal distributions | Modified K-medoid clustering without the need to set the number of clusters | Medoid convergence | Progressive weighting of the relative importance of DA vs TNF based on the number of samples; optional assembly, based on CheckM assessment, of mapped reads from a single most represented sample to reduce contamination |