VizBin - an application for reference-independent visualization and human-augmented binning of metagenomic data
© Laczny et al.; licensee BioMed Central. 2015
Received: 15 September 2014
Accepted: 18 December 2014
Published: 20 January 2015
Metagenomics is limited in its ability to link distinct microbial populations to genetic potential due to a current lack of representative isolate genome sequences. Reference-independent approaches, which exploit for example inherent genomic signatures for the clustering of metagenomic fragments (binning), offer the prospect to resolve and reconstruct population-level genomic complements without the need for prior knowledge.
We present VizBin, a Java™-based application which offers efficient and intuitive reference-independent visualization of metagenomic datasets from single samples for subsequent human-in-the-loop inspection and binning. The method is based on nonlinear dimension reduction of genomic signatures and exploits the superior pattern recognition capabilities of the human eye-brain system for cluster identification and delineation. We demonstrate the general applicability of VizBin for the analysis of metagenomic sequence data by presenting results from two cellulolytic microbial communities and one human-borne microbial consortium. The superior performance of our application compared to other analogous metagenomic visualization and binning methods is also presented.
VizBin can be applied de novo for the visualization and subsequent binning of metagenomic datasets from single samples, and it can be used for the post hoc inspection and refinement of automatically generated bins. Due to its computational efficiency, it can be run on common desktop machines and enables the analysis of complex metagenomic datasets in a matter of minutes. The software implementation is available at https://claczny.github.io/VizBin under the BSD License (four-clause) and runs under Microsoft Windows™, Apple Mac OS X™ (10.7 to 10.10), and Linux.
KeywordsMetagenomics Machine learning Visualization Binning
Mixed microbial communities are ubiquitous and play fundamental roles in the Earth’s biogeochemical cycles, as well as in human health. Shotgun sequencing of extracted DNA from microbial consortia allows for culture-independent analysis of their composition and/or genetic potential. So far, metagenomics has been applied to a panoply of microbial communities of differing complexities [1-6].
In metagenomic analyses, the characterization of constituent populations is typically carried out using reference-based approaches whereby sequence fragments, e.g., filtered sequence reads or reconstructed genomic fragments (contigs), are aligned to previously characterized isolate genomes [7,8]. However, disparities between the genomes of isolate strains and natural populations  as well as the lack of a comprehensive set of representative reference genomes  results in the need for reference-independent analysis approaches.
Reference-independent deconvolution of metagenomic datasets from single samples generally relies on the use of data-inherent characteristics, e.g., oligonucleotide composition , to group metagenomic fragments into clusters (bins) comprising sequence fragments derived from distinct microbial populations. To determine oligonucleotide sequence composition, distinct kmers are counted over sequence fragments and counts are normalized to represent frequency distributions  resulting in vectors (genomic signatures) of fixed dimensionality, 4 k . So far, the exploration of the signature space has been hampered by the comparably high dimensionality of genomic signatures: for k=5, the vectors are embedded in a 1,024-dimensional space. To reduce the dimensionality of the data, approaches based on self-organizing maps (SOMs) have been used for the visualization and delineation of population-specific sequence clusters, e.g., emergent SOMs (ESOMs) [12,13].
We have recently demonstrated  that nonlinear dimension reduction of centered log-ratio transformed genomic signatures via Barnes-Hut stochastic neighbor embedding (BH-SNE; ) results in improved performance in terms of decreased input sequence lengths, decreased computation time, increased homogeneity of clusters, and more intuitive interpretation compared to the more traditional ESOM-based approaches. Here, we present VizBin, a cross-platform software implementation of the method for the rapid and reliable reference-independent visualization and subsequent human-augmented binning of metagenomic datasets from single samples based on a parallelized version of BH-SNE (https://claczny.github.io/VizBin).
VizBin is a graphical user interface (GUI)-based desktop application for Microsoft Windows™, Apple Mac OS X™ (10.7 to 10.10), and Linux. The GUI is written in Java™ and makes use of different Java™ libraries as described in https://github.com/claczny/VizBin. The only runtime requirements of VizBin are a working Java™ installation and the Java™ Standard Edition Runtime Environment (JRE; http://www.oracle.com/technetwork/java/javase/index.html). This work implements parallelization into the original C source code for BH-SNE (http://homepage.tudelft.nl/19j49/t-SNE.html) by integrating the OpenMP®; Application Programming Interface (API). VizBin incorporates this parallelized version of BH-SNE for the computation of the two-dimensional embedding which is then visualized by the GUI. A copy of the VizBin software can be downloaded from http://claczny.github.io/VizBin.
Results and discussion
In this section, we compare the performance of our approach to a commonly used method for visualizing metagenomic data, i.e., ESOM-based visualization, as well as to a state-of-the-art fully automated binning method (MaxBin; ). The performances of VizBin and MaxBin were quantitatively assessed by inferring the homogeneity and completeness of bins using a collection of 107 single-copy marker genes. These genes, referred to in the following as ‘essential genes’, are conserved in 95% of all sequenced bacteria  and have been previously used to assess the performance of different binning methods . When using this essential gene set, increased homogeneity (lower amount of multiple essential gene copies) and increased completeness (higher fraction of essential genes recovered) indicate better binning performance.
Comparison against the ESOM-based approach
Comparison against MaxBin
MaxBin  is a state-of-the-art fully automated reference-independent binning approach that uses coverage information in addition to oligonucleotide frequencies. Coverage information is obtained by mapping the sequencing reads back onto the assembled contigs. Clusters are identified automatically via expectation maximization.
We used VizBin to inspect the original MaxBin-based bins of two cellulolytic microbial communities (37A and 37B)  and one human-borne microbial consortium (SRS013705; tongue dorsum) . For individual bin visualizations, see Additional file 2. The MaxBin-based bins are often comprised of single clusters in the VizBin plots (Figure 2; Additional file 1: Figure S2). However, numerous exceptions exist where multiple subclusters are apparent for single MaxBin-based bins (Figure 2; Additional file 1: Figure S3). Given the previously demonstrated ability of our approach for resolving population-level genomic complements [14x], these MaxBin-based bins likely represent mixtures of sequences derived from originally distinct microbial populations, thus suggesting heterogeneity in the automatically generated bins. This suggestion is supported by the occurrence of essential genes in multiple copies in the original MaxBin-based bins (Additional file 1: Tables S1-S3).
Statistics of subclusters identified using VizBin for MaxBin-based bins 37B.out.024, 37B.out.026, and SRS013705.out.029
Number of contigs
Due to the pronounced heterogeneity observed for bin SRS013705.out.029, we also re-analyzed it using VizBin. The original SRS013705.out.029 bin separates into five distinct subclusters (Additional file 1: Figure S4) and the number of essential genes in multiple copies per subcluster is markedly reduced when separating these (Table 1; Additional file 1: Table S3). In particular, all but one subcluster are homogeneous, with SRS013705.out.029.001 being almost completely homogeneous.
Overall, the presented results demonstrate the potential of VizBin for the post hoc inspection and refinement of automatically generated bins.
Here, we present VizBin, an easy-to-use, stand-alone software application for the visualization, inspection, and human-augmented binning of metagenomic datasets from single samples. The presented results demonstrate that the VizBin software can be applied to metagenomic data from various environments and leverages the superior pattern recognition capabilities of the human eye-brain system for cluster identification and delineation. In addition to its use for de novo human-augmented binning of metagenomic data , VizBin holds great potential for the post hoc inspection and subsequent refinement of automatically generated bins. This is illustrated by the increased homogeneity and increased completeness in the case of metagenomic data derived from cellulolytic or human-borne microbial consortia. Furthermore, the herein-presented software application improves on the computational efficiency of the approach previously described in  by the integration of parallelization. Two-dimensional embeddings are obtained in less than 3 min for datasets of ≈30,000 fragments on a common, multi-core desktop computer. Moreover, our results demonstrate that despite recent advances in automated unsupervised binning of individual samples, as represented by MaxBin , improved results can be obtained through efficient visualization of the entire community and/or of automatically generated bins. MetaWatt  and GroopM  are two recent approaches which involve human input for the definition or refinement of metagenomic bins. However, MetaWatt has been demonstrated for microbial communities with relatively small numbers of binnable populations, and GroopM is a representative of a set of recent approaches for automated unsupervised binning which rely on abundance information across several samples [23,24]. Abundance-based approaches may not be generally applicable for metagenomic analysis of microbial consortia due to various reasons, such as limited sample quantities or prohibitive costs of analysing the numbers of samples required (e.g., a suggested minimum of 18 samples in ). While a minimum number of three related samples is suggested for GroopM , VizBin allows for the characterization of single samples. We are currently exploring ways to integrate coverage information (from single samples) into the dimension reduction step as it is expected to provide another important and likely informative feature for the visualization and subsequent binning of metagenomic data. At the present time, as described above, sequence coverage from the metagenomic assembly of a single sample as well as other information may optionally be provided to VizBin to enhance scatter plot visualization.
Availability and requirements
Project name: VizBinProject home page: https://claczny.github.io/VizBin Operating system(s): Platform independentProgramming language: Java™ version 7 or greaterOther requirements: Java™ Standard Edition Runtime Environment (JRE); local installation of BLAS/LAPACK for maximum performance (detailed information is provided in the project’s wiki)License: BSD License (4-clause). Detailed licensing information is available at https://github.com/claczny/VizBin.Restrictions: None
Barnes-Hut stochastic neighbor embedding
Graphical user interface
Application programming interface
Java™ standard edition runtime environment
The authors thank Emilie Muller, Joëlle Fritz, Susanne Reinsbach, Shaman Narayanasamy, Anna Heintz-Buschart, Yohan Jarosz, and the participants of the AllBio 2014 workshop at SciLifeLab in Stockholm (Sweden) for testing the application and fruitful discussions. The authors thank Andreas Keller for insightful suggestions. The present work was supported by an ATTRACT programme grant (A09/03) and a European Union Joint Programming in Neurodegenerative Diseases grant (INTER/JPND/12/01) to PW and an Aide à la Formation Recherche grant (AFR PHD/4964712) to CCL all funded by the Luxembourg National Research Fund (FNR). This research includes results from GeneGrabber/ACD developed by the Banfield Laboratory at UC Berkeley with funding provided by the Subsurface Biogeochemistry and Genomic Sciences Programs, Biological and Environmental Research (BER), Office of Science, U.S. Department of Energy. The ACD metagenome was collected and developed with the support of the Integrated Field Research Challenge (IFRC) site at Rifle, Colorado. The Rifle IFRC Project is a multidisciplinary, multi-institutional project managed by Lawrence Berkeley National Laboratory, Berkeley, California for the U.S. Department of Energy under Contract No. DE-AC02-05CH11231.
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen J a, et al.Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004; 304(5667):66–74. doi:10.1126/science.1093857.PubMedView ArticleGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004; 428(6978):37–43. doi:10.1038/nature02340.PubMedView ArticleGoogle Scholar
- Warnecke F, Luginbühl P, Ivanova N, Ghassemian M, Richardson TH, Stege JT, et al. Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite. Nature. 2007; 450(7169):560–5. doi:10.1038/nature06269.PubMedView ArticleGoogle Scholar
- Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010; 464(7285):59–65. doi:10.1038/nature08821.PubMed CentralPubMedView ArticleGoogle Scholar
- Arumugam M, Raes J, Pelletier E, Le Paslier D, Yamada T, Mende DR, et al. Enterotypes of the human gut microbiome. Nature. 2011; 473(7346):174–80. doi:10.1038/nature09944.PubMed CentralPubMedView ArticleGoogle Scholar
- Howe AC, Jansson JK, Malfatti SA, Tringe SG, Tiedje JM, Brown CT. Tackling soil diversity with the assembly of large, complex metagenomes. Proc Natl Acad Sci USA. 2014; 111(13):4904–9. doi:10.1073/pnas.1402564111.PubMed CentralPubMedView ArticleGoogle Scholar
- Sunagawa S, Mende DR, Zeller G, Izquierdo-Carrasco F, Berger Sa, Kultima JR, et al. Metagenomic species profiling using universal phylogenetic marker genes. Nat Methods. 2013; 10(12):1196–9. doi:10.1038/nmeth.2693.PubMedView ArticleGoogle Scholar
- Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014; 15(3):46. doi:10.1186/gb-2014-15-3-r46.View ArticleGoogle Scholar
- Wilmes P, Simmons SL, Denef VJ, Banfield JF. The dynamic genetic repertoire of microbial communities. FEMS Microbiol Rev. 2009; 33(1):109–32. doi:10.1111/j.1574-6976.2008.00144.x.PubMed CentralPubMedView ArticleGoogle Scholar
- Karlin S, Burge C. Dinucleotide relative abundance extremes: a genomic signature. Trends Genet. 1995; 11(7):283–90.PubMedView ArticleGoogle Scholar
- Gori F, Mavroedis D, Jetten MSM, Marchiori E. Genomic signatures for metagenomic data analysis: exploiting the reverse complementarity of tetranucleotides In: Chen L, Zhang X-S, Wu L-Y, Wang Y, editors. 2011 IEEE Int Conf Systems Biol (ISB). Zhuhai, China: IEEE: 2011. p. 149–54. doi:10.1109/ISB.2011.6033147.Google Scholar
- Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T. Informatics for unveiling hidden genome signatures. Genome Res. 2003; 13(4):693–702. doi:10.1101/gr.634603.PubMed CentralPubMedView ArticleGoogle Scholar
- Dick GJ, Andersson AF, Baker BJ, Simmons SL, Thomas BC, Yelton aP, et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 2009; 10(8):85. doi:10.1186/gb-2009-10-8-r85.View ArticleGoogle Scholar
- Laczny CC, Pinel N, Vlassis N, Wilmes P. Alignment-free visualization of metagenomic data by nonlinear dimension reduction. Sci Rep. 2014; 4:4516. doi:10.1038/srep04516.PubMed CentralPubMedView ArticleGoogle Scholar
- Van Der Maaten L, Barnes-Hut-SNE. arXiv. 2013:1301–33421.
- Wu Y-W, Tang Y-H, Tringe SG, Simmons BA, Singer SW. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014; 2(1):26. doi:10.1186/2049-2618-2-26.PubMed CentralPubMedView ArticleGoogle Scholar
- Dupont CL, Rusch DB, Yooseph S, Lombardo M-J, Richter RA, Valas R, et al. Genomic insights to SAR86, an abundant and uncultivated marine bacterial lineage. ISME J. 2012; 6(6):1186–99. doi:10.1038/ismej.2011.189.PubMed CentralPubMedView ArticleGoogle Scholar
- Albertsen M, Hugenholtz P, Skarshewski A, Nielsen KrL, Tyson GW, Nielsen PH. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat Biotechnol. 2013; 31(6):533–8. doi:10.1038/nbt.2579.PubMedView ArticleGoogle Scholar
- The Human Microbiome Project Consortium. Structure, function and diversity of the healthy human microbiome. Nature. 2012; 486(7402):207–14. doi:10.1038/nature11234.PubMed CentralView ArticleGoogle Scholar
- Wrighton KC, Thomas BC, Sharon I, Miller CS, Castelle CJ, VerBerkmoes NC, et al. Fermentation, hydrogen, and sulfur metabolism in multiple uncultivated bacterial phyla. Science. 2012; 337(6102):1661–5. doi:10.1126/science.1224041.PubMedView ArticleGoogle Scholar
- Strous M, Kraft B, Bisdorf R, Tegetmeyer HE. The binning of metagenomic contigs for microbial physiology of mixed cultures. Front Microbiol. 2012; 3:410. doi:10.3389/fmicb.2012.00410.PubMed CentralPubMedView ArticleGoogle Scholar
- Imelfort M, Parks D, Woodcroft BJ, Dennis P, Hugenholtz P, Tyson GW. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ. 2014; 2:603. doi:10.7717/peerj.603.View ArticleGoogle Scholar
- Nielsen HB, Almeida M, Juncker AS, Rasmussen S, Li J, Sunagawa S, et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol. 2014; 32(8):822–8. doi:10.1038/nbt.2939.PubMedView ArticleGoogle Scholar
- Alneberg J, Bjarnason BS, de Bruijn I, Schirmer M, Quick J, Ijaz UZ, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014. doi:10.1038/nmeth.3103.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.