Skip to main content

Table 1 Key methodological features of three main metagenome binning approaches

From: Recovering complete and draft population genomes from metagenome datasets

Method

Starting point

Clustering methods

Negatives

Positives

Computational Resources

Nucleotide composition (NC)

Oligonucleotide frequency matrix and %G+C-based screening.

HCL, correlation-based network graph and emergent self-organization maps (ESOM).

(i) More efficent for the genomes with skewed nucleotide composition patterns.

(i) Individual metagenome assemblies or samples where populations do not change over time can be used.

(i) R packages: qgraph (8), i graph, pv-clust [82]

(ii) tetramerFreqs [83] (https://github.com/tetramerFreqs/Binning)

(iii) Databionic ESOM tools [84]. (http://databionic-esom.sourceforge.net/)

(ii) Less efficient in differentiating between closely related genotypes.

(iv) 2T-binning [85] (http://hmp.ucalgary.ca/HMP/metagenomes/data/SCADC/454/Binning/2TBinning/)

(iii) Depends on the visualization and manual inspection of bins and therefore are not suitable for very large assemblies representing complex environments.

Nucleotide composition and abundance (NCA)

A composite distance matrix from oligonucleotide frequency matrix and coverage.

K-medioids clustering, Gaussian mixture models, and expectation and maximization algorithm.

(ii), (iv) Require multiple samples for better performance, and therefore are associated with cost, time, and computational resources.

(i), (ii) Improved contig binning than NC method.

(i) MetaBAT [54]. (https://bitbucket.org/berkeleylab/metabat)

(ii) CONCOCT [86] (https://github.com/BinPro/CONCOCT)

(iii) MaxBin [87] (http://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html)

(iv) GroopM [57]. (https://github.com/minillinim/GroopM)

(v) Databionic ESOM tools [84] (http://databionic-esom.sourceforge.net/)

Differential abundance (DA)

Differential coverage patterns across multiple samples where population changed in abundance over time.

Profile based correlation cut-off.

(iv) Must have multiple samples with population changed in abundance over time, and therefore are associated with cost, computational time, and resources.

(ii), (iii) Strain level resolution can be achieved.

(i) Multi-metagenome [49] (https://github.com/MadsAlbertsen/multi-metagenome)

(ii) MGS Canopy algorithm [51] (https://github.com/fplaza/mgs-canopy-algorithm).

(iii) Databionic ESOM tools [84] (http://databionic-esom.sourceforge.net/)