Skip to main content

Table 2 Key methodological features of NCA-based metagenome binning tools

From: Recovering complete and draft population genomes from metagenome datasets

Binning software

Sequence composition model

Differential abundance model

Clustering algorithm

Stopping criteria

Post-processing and other notable heuristics

ABAWACA

Combined mono-, di-, and tri- nucleotide frequencies

 

Hierarchical clustering with iterative splitting; long scaffolds are broken into 5-kb fragments at the beginning; splitting based on a single metric that results in the best separation in each round

No separation can be made given quality score based on the extent to which the broken scaffolds are grouped correctly

Genome assessment based on marker genes and consensus taxonomic placement with reciprocal best BLAST hits; manual inspection using ggKBase; scaffold extension

Canopya

Inter-assembly tetranucleotide frequency z-profiles created on 5-kb windows only in post-binning chimera detection

Abundance distance defined in terms of Pearson correlation and Spearman’s rank correlation coefficients

Canopy clustering (seed-and-recruit)

Stabilization of canopy profiles

Sample-specific augmented assemblies on two samples with most mapped reads and one with most gene containing de novo contigs

CONCOCT

K-mer frequencies (tetranucleotide by default); uniform Dirichlet distribution prior on the relative frequencies; dimension reduction using principal component analysis to keep 90 % of joined composition and coverage variance

Combined log-transformed profile of normalized coverage and composition vectors

Gaussian mixture models; regularized expectation-maximization; cluster number determined by automatic relevance determination

Parameter convergence and maximum iteration number

Empirical variational Bayesian approach; variational approximation used to perform integral in optimizing mixing coefficients

GroopM

Tetranucleotide frequencies; dimension reduction using principal component analysis to keep 80 % of compositional variance

Transformed coverage space to reduce unevenness of variability distribution

Iterative clustering in two custom steps: two-way clustering and Hough partitioning; bin refinement using self-organizing map

1:1 correspondence between bins and sub regions on the SOM surface

GC variance model for chimera detection

MaxBin

Tetranucleotide frequencies; Euclidean distance; empirically estimated Gaussian distributions of intra- and inter-genome distances

Poisson distribution

Expectation-maximization; cluster number estimated from single-copy genes; initial parameters inferred from the shortest marker gene

Parameter convergence and maximum iteration number

Recursive checking of all bins for median number of marker genes

MetaBAT

Tetranucleotide frequencies; Euclidean distance; empirical posterior probability derived from different contig sizes using logistic regression

Abundance distance defined as the non-shared area of two normal distributions

Modified K-medoid clustering without the need to set the number of clusters

Medoid convergence

Progressive weighting of the relative importance of DA vs TNF based on the number of samples; optional assembly, based on CheckM assessment, of mapped reads from a single most represented sample to reduce contamination

  1. aWe have also included the DA method Canopy because it uses sequence composition in post-binning refinement