A multi-source domain annotation pipeline for quantitative metagenomic and metatranscriptomic functional profiling

Background Biochemical and regulatory pathways have until recently been thought and modelled within one cell type, one organism and one species. This vision is being dramatically changed by the advent of whole microbiome sequencing studies, revealing the role of symbiotic microbial populations in fundamental biochemical functions. The new landscape we face requires the reconstruction of biochemical and regulatory pathways at the community level in a given environment. In order to understand how environmental factors affect the genetic material and the dynamics of the expression from one environment to another, we want to evaluate the quantity of gene protein sequences or transcripts associated to a given pathway by precisely estimating the abundance of protein domains, their weak presence or absence in environmental samples. Results MetaCLADE is a novel profile-based domain annotation pipeline based on a multi-source domain annotation strategy. It applies directly to reads and improves identification of the catalog of functions in microbiomes. MetaCLADE is applied to simulated data and to more than ten metagenomic and metatranscriptomic datasets from different environments where it outperforms InterProScan in the number of annotated domains. It is compared to the state-of-the-art non-profile-based and profile-based methods, UProC and HMM-GRASPx, showing complementary predictions to UProC. A combination of MetaCLADE and UProC improves even further the functional annotation of environmental samples. Conclusions Learning about the functional activity of environmental microbial communities is a crucial step to understand microbial interactions and large-scale environmental impact. MetaCLADE has been explicitly designed for metagenomic and metatranscriptomic data and allows for the discovery of patterns in divergent sequences, thanks to its multi-source strategy. MetaCLADE highly improves current domain annotation methods and reaches a fine degree of accuracy in annotation of very different environments such as soil and marine ecosystems, ancient metagenomes and human tissues. Electronic supplementary material The online version of this article (10.1186/s40168-018-0532-2) contains supplementary material, which is available to authorized users.

MetaCLADE runs on High Performance Computing. Parallel computation is exploited in MetaCLADE first main step and in the first two sub-steps of the second main step. For each MG/MT sample, MetaCLADE run time is reported in Supplemental Table S7. For example, for the largest dataset analysed in this publication, the Puerto Rico Rainforest dataset, we used 3 354.35 hours to accomplish domain hit identification (MetaCLADE step 1). On 64 cores, this step takes approximately 52 hours. This step is the most time consuming and it is dependent of the time complexity of the search tools HMMer and PSI-BLAST. Domain filtering (MetaCLADE step 2) is faster. The first filter took 183.24 hours, while the second took 48.76 hours. The third and final step in domain selection cannot run in parallel but took only 19.66 minutes. On 64 cores, domain selection (the entire step 2) can be realised in approximately 3.94 hours.
The pre-computed step (the Naive Bayes training step) was realised in a total of 67 637 135 hours, considering a single core. In reality, each probabilistic model analysis was run in parallel on several cores. It is important to notice that this step was executed once, for all datasets analysed in this article. Note that the training spaces constructed in this step can be improved by the addition of new negative sequences to existing ones. Figure S1: Distribution of Pfam domain model sizes. Distribution of the size of the SCMs associated to Pfam domains used in metaCLADE. Figure S2: Distribution of potential negative sequences, generated with Markov models of order 3. Distribution of all potential negative sequences, all domain models confounded, generated with Markov models of order 3. They make a total of 39 241 830 000 sequences. Note that the identification of these sequences was possible after the generation of about 16 million sequences by Markov models, for each one of the 14 831 domains in Pfam.

A.
B.  Figure S4: Distribution of negative sequences generated or not by Markov models. The impact of the construction of Markov models is illustrated by the distribution of negative sequences in the training sets of CCMs (bottom) and SCMs (pHMMs; top) constructed by either reshuffling of 2-mers or inversion only (A) compared to the generation of negative sequences constructed with the three methodological approaches, i.e. reshuffling, inversion and Markov models (B). Note that about 50% of the domains are characterised by training sets containing at least the 50% of negative sequences.
GA score -mean(top 5 bit-scores of negative sequences) GA score -mean(top 5 bit-scores of negative sequences) Figure S5: GA vs negative bit-scores distribution. For each Pfam domain and its associated SCM (pHMM), we computed the difference between the GA threshold associated to the SCM (pHMM) and the mean of the 5 best negative sequences bit-scores identified by the SCM (pHMM). The distribution of the differences shows a small standard deviation.

A.
B.    scale. Note that this distribution cannot be immediately compared to Figures S7D-S10D and Figure 5D because it corresponds only to domain identifications that were not identified before. The detection of these hits constitutes a challenge. A.

B.
C. D.    x-axis using a − log 10 scale. D. Distribution of probabilities associated to those exclusive UProC domain annotations that have been detected by MetaCLADE but discarded because of the probability threshold 0.9. E Venn diagram as in A of Figure S10, but where MetaClade is replaced by MetaClade+UProC. F As in C, but where MetaClade is replaced by MetaClade+UProC.   Table 3 of the article. Also, for each curve, the associated AUC is reported in parenthesis on the inset legend. The ranking score is computed based on the significance of the bit-score (value between 0 and 1), for CCMs (left) and SCMs (right). Values on the table are the ones giving best results on the dataset simulated from 56 fully sequences genomes.         Note that the total length of the fragments in a dataset corresponds to 1.5 * #CDS * CDS average length. Table S10: Species and Models used to perform the analysis described in Figure 6 Domain Species Models    Table S13: Comparison of HMM-GRASPx and MetaCLADE against a simulated 200-bp marine data set with uneven coverage. MetaCLADE/HMM-GRASPx-Assembly is the annotation obtained by applying MetaCLADE on the assembled contigs of HMM-GRASPx and transferring it to the reads that map against annotated contigs. In the "strict" domain annotation table, only hits having the same pfam domain with respect to the ground-truth are counted as true positives. In the "clan-based" annotation, however, domain hits that belong to the same clan with respect to the ground-truth are also counted as true positives. The assembly of the KO00680 dataset would have required more than 128GB of RAM and we were not able to finish its construction.