METABOLIC: High-throughput profiling of microbial genomes for functional traits, biogeochemistry, and community-scale metabolic networks

Background Advances in microbiome science are being driven in large part due to our ability to study and infer microbial ecology from genomes reconstructed from mixed microbial communities using metagenomics and single-cell genomics. Such omics-based techniques allow us to read genomic blueprints of microorganisms, decipher their functional capacities and activities, and reconstruct their roles in biogeochemical processes. Currently available tools for analyses of genomic data can annotate and depict metabolic functions to some extent, however, no standardized approaches are currently available for the comprehensive characterization of metabolic predictions, metabolite exchanges, microbial interactions, and contributions to biogeochemical cycling. Results We present METABOLIC (METabolic And BiogeOchemistry anaLyses In miCrobes), a scalable software to advance microbial ecology and biogeochemistry using genomes at the resolution of individual organisms and/or microbial communities. The genome-scale workflow includes annotation of microbial genomes, motif validation of biochemically validated conserved protein residues, identification of metabolism markers, metabolic pathway analyses, and calculation of contributions to individual biogeochemical transformations and cycles. The community-scale workflow supplements genome-scale analyses with determination of genome abundance in the community, potential microbial metabolic handoffs and metabolite exchange, and calculation of microbial community contributions to biogeochemical cycles. METABOLIC can take input genomes from isolates, metagenome-assembled genomes, or from single-cell genomes. Results are presented in the form of tables for metabolism and a variety of visualizations including biogeochemical cycling potential, representation of sequential metabolic transformations, and community-scale metabolic networks using a newly defined metric ‘MN-score’ (metabolic network score). METABOLIC takes ∼3 hours with 40 CPU threads to process ∼100 genomes and metagenomic reads within which the most compute-demanding part of hmmsearch takes ∼45 mins, while it takes ∼5 hours to complete hmmsearch for ∼3600 genomes. Tests of accuracy, robustness, and consistency suggest METABOLIC provides better performance compared to other software and online servers. To highlight the utility and versatility of METABOLIC, we demonstrate its capabilities on diverse metagenomic datasets from the marine subsurface, terrestrial subsurface, meadow soil, deep sea, freshwater lakes, wastewater, and the human gut. Conclusion METABOLIC enables consistent and reproducible study of microbial community ecology and biogeochemistry using a foundation of genome-informed microbial metabolism, and will advance the integration of uncultivated organisms into metabolic and biogeochemical models. METABOLIC is written in Perl and R and is freely available at https://github.com/AnantharamanLab/METABOLIC under GPLv3.


135
we considered both the score type (full length or domain) and the score value to assign whether 136 an individual protein hit is significant or not. Methods on the manual curation of these databases 137 are described in the next section.

139
Curation of cutoff scores for metabolic HMMs 140 Two curation methods for adjusting NC or TC of TIGRfam/Pfam/Custom databases were used 141 for a specific HMM profile. First, we parsed and downloaded representative protein sequences 142 according to either the corresponding KEGG identifier or UniProt identifier [28]. We then 143 randomly subsampled a small portion of the sequences (10% of the whole collection if this was 144 more than 10 sequences, or at least 10 sequences) as the query to search against the 145 representative protein collections [29]. Subsequently, we obtained a collection of hmmsearch 146 scores by pair-wise sequence comparisons. We plotted scores against hmmsearch hits and 147 selected the mean value of the sharpest decreasing interval as the adjusted cutoff. Second, we 148 downloaded a collection of proteins that belong to a specific HMM profile and pre-checked the 149 quality and phylogeny of these proteins by constructing and manually inspecting phylogenetic 150 trees. We applied pre-checked protein sequences as the query search against a set of training 151 metagenomes (data not shown). We then obtained a collection of hmmsearch scores of resulting 152 hits from the training metagenomes. By using a similar method as described above, the cutoff 153 was selected as the mean value of the sharpest decreasing interval.

155
The following example demonstrates how the method above was used to curate the 156 hydrogenase enzymes. We then expanded this method to all genes using a similar method. We

262
A recent metagenomic-based study of the microbial community from an aquifer adjacent to

327
Given the ever-increasing number of microbial genomes from microbiome studies, we

379
Outputs consist of six different results that are reported in an Excel spreadsheet (Additional file 380 1: Figure S1). These contain details of protein hits (Additional file 1: Figure S1A) which include 381 both presence/absence and protein names, presence/absence of functional traits (Additional file 382 1: Figure S1B), presence/absence of KEGG modules (Additional file 1: Figure S1C),

441
The thickness of edges represents gene coverages (measured as the average of these two steps).

442
The color of the edge corresponds to the taxonomic group, and at the whole community level,

499
The microbial community dataset of 98 MAGs from a deep-sea hydrothermal vent was used as 500 a test to demonstrate this workflow. After running the bioinformatic analyses described above,

558
To compare the prediction performance ( Figure 7B), we used two representative bacterial 559 genomes as the test datasets. We randomly picked 100 protein sequences from individual 599 Figure S7B). Nevertheless, deep-sea hydrothermal vent has a higher fraction of genomes 600 (59/98) and a higher relative abundance (73%) of these genomes involving sulfur oxidation 601 compared to the freshwater lake (Additional file 7: Figure S7B). This indicates that the deep-