MetaEuk algorithm
Code and resources availability
The MetaEuk source code, compilation instructions, and a brief user guide are available from https://github.com/soedinglab/metaeuk under the GNU General Public License v3.0. The resources produced during this study are available from http://wwwuser.gwdg.de/~compbiol/metaeuk/.
Putative exons compatibility
In the first two stages of the MetaEuk algorithm, all possibly coding protein fragments are translated from the input contigs. We scan each contig in six frames and extract the fragments between stop codons. These fragments are queried against the reference target database using MMseqs2. A set of fragments from the same contig and strand that have local matches to the same specific target T define a set of putative exons. We say two putative exons Pi and Pj from the same set are compatible with each other if they can be joined together to a multi-exon protein.
Each Pi is associated with four coordinates: the amino-acid position on T from which the match to Pi starts (\( {P}_i^{\mathrm{ST}} \)) and ends (\( {P}_i^{\mathrm{ET}} \)); the nucleotide position on the contig from which the translation of Pi starts (\( {P}_i^{\mathrm{SC}} \)) and ends (\( {P}_i^{\mathrm{EC}} \)). We require a match of at least 10 amino acids (a minimal exon length). We consider putative exons Pi and Pj with \( {P}_i^{\mathrm{ST}}<{P}_j^{\mathrm{ST}} \) as compatible on the plus strand if:
-
(1)
their order on the contig is the same as on the target: \( {P}_i^{\mathrm{SC}}<{P}_j^{\mathrm{SC}} \);
-
(2)
the distance between them on the contig is at least the length of a minimal intron but not more than the length of a maximal intron: \( 15\le \left({P}_j^{\mathrm{SC}}-{P}_i^{\mathrm{EC}}\right)\le \mathrm{10,000} \);
-
(3)
their matches to T should not overlap. In practice, we allow for a short overlap to account for alignment errors: \( \left({P}_j^{\mathrm{ST}}-{P}_i^{\mathrm{ET}}\right)\ge -10 \).
In case Pi and Pj are on the negative strand, we modify conditions (1) and (2) accordingly:
-
(1)
$$ {P}_i^{\mathrm{SC}}>{P}_j^{\mathrm{SC}}; $$
-
(2)
$$ 15\le \left({P}_i^{\mathrm{EC}}-{\mathrm{P}}_j^{\mathrm{SC}}\right)\le \mathrm{10,000}. $$
Since the adjustment of conditions to the minus strand is straightforward, in the interest of brevity, we focus solely on the plus strand in the following text.
We say a set of k > 1 putative exons is compatible if, when ordered by their \( {P}_i^{\mathrm{ST}} \) values, each pair of consecutive putative exons is compatible. (A set of a single exon is always compatible.)
Bit-score and E-value computation
A set of k compatible putative exons defines a pairwise protein alignment to the target T. This alignment is the concatenation of the ordered local alignments of all putative exons to T. Between each consecutive putative pair of exons Pi and Pi+1 there might be unmatched amino acids in T or there might be a short overlap of their matches to T. We denote the number of unmatched amino acids between Pi and Pi+1 as li, which can take a negative value in case of an overlap. MetaEuk computes the bit-score of the concatenated pairwise alignment S(Pset, T) by summing the individual Karlin-Altschul [58] bit-scores S(Pi, T) of the putative exons to T and penalizing for unmatched or overlapping amino acids in T as follows:
-
(3)
$$ S\left({P}_{\mathrm{set}},T\right)=\sum \limits_{i=1}^kS\left({P}_i,T\right)+\sum \limits_{i=1}^{k-1}C\left({l}_i\right)+{\log}_2\left(k!\right) $$
where the penalty function is C(li) = − |li| for li ≠ 1 and 0 if li = 1. The last term rewards the correct ordering of the k exons.
An E-value is the expected number of matches above a given bit-score threshold. Since for each contig, at most one gene call is reported per strand and target in the reference database, the E-value takes into account the number of amino acids in the reference database D and the search on two strands:
-
(4)
$$ E-\mathrm{Value}\left({P}_{\mathrm{set}},T\right)=2\times D\times {2}^{-S\left({P}_{\mathrm{set}},T\right)} $$
Dynamic programming
Given a set of n putative exons and their target, MetaEuk finds the set of compatible exons with the highest combined bit-score. First, all putative exons are sorted by their start on the contig, such that \( {P}_1^{\mathrm{SC}}\le \dots \le {P}_n^{\mathrm{SC}} \). The dynamic programming computation iteratively computes vectors S, k, and b from their first entry 1 to their nth. The entry Si holds the score of the best exon alignment ending in exon i and ki holds the number of exons in that set. Once the maximum score is found, the exon alignment is back traced using b, in which entry bi holds the index of the aligned exon preceding exon i (0 if i is the first aligned exon). Using the following values:
-
(5)
$$ {S}_0=0;{k}_0=0;{b}_0=0 $$
all putative exons Pj are examined according to their order and the score vector is updated:
-
(6)
$$ {S}_j=\underset{i}{\max}\left({S}_i+S\left({P}_j,T\right)+C\left({l}_j^i\right)+{\log}_2\left({k}_i+1\right)|0\le i<j,i\ \mathrm{compatible}\ \mathrm{with}\ j\right) $$
kj and bj are updated accordingly. The terms log2(ki + 1) add up to the score contribution \( \sum \limits_{i=1}^k{\log}_2(i)={\log}_2\left(k!\right) \) and the transition 0 to j is defined as compatible with \( C\left({l}_j^0\right)=0 \) for all j. The optimal exon set is then recovered by tracing back from the exon with the maximal score. This dynamic programming procedure has time complexity of O(n2).
Clustering gene calls to reduce redundancy
MetaEuk assigns a unique identifier to each extracted putative protein fragment (stage 1 in Fig. 1). A MetaEuk exon refers to the part of a fragment that matched some target T (stage 2 in Fig. 1, tinted background) and has the same identifier as the fragment. Two calls that have the same exon identifier in their exon set are said to share an exon. MetaEuk reduces redundancy by clustering calls that share an exon (stage 4 in Fig. 1) and selecting a representative call as the gene prediction of each cluster. To that end, all N MetaEuk calls from the same contig and strand combination are ordered according to the contig start position of their first exon. Since this order can include equalities, they are sub-ordered by decreasing number of exons. The first cluster is defined by the first call, which serves as its tentative representative. Let m be the last contig position of the last exon of this representative. Each of the following calls is examined so long as its start position is smaller than m (i.e., it overlaps the representative on the contig). If that call shares an exon with the representative, it is assigned to its cluster. In the next iteration, the first unassigned call serves as representative for a new cluster and the following calls are examined in a similar manner, adding unassigned calls to the cluster in case they share an exon with the representative. The clustering ends with the assignment of all calls. At this stage, the final prediction is the call with the highest score in each cluster. This greedy approach has time complexity of O(N × log(N) + N × A), where A is the average number of calls that overlap each representative on the contig. Since in practice, A ≪ N, the expected time complexity is O(N × log(N)).
Resolving same-strand overlapping predictions
After the redundancy reduction step, MetaEuk sorts all predictions on the same contig and strand according to their E-value . It examines the sorted list and retains predictions only if they do not overlap any preceding predictions on the list.
Benchmark datasets
The UniRef90 database was obtained on March 2018. The annotated information of Schizosaccharomyces pombe (GCA_000002945.2), Acanthamoeba castellanii str. Neff (GCA_000313135.1), Babesia bigemina (GCA_000981445.1), Phytomonas sp. isolate EM1 (GCA_000582765.1), Nocleomorph of Lotharella oceanica (GCA_000698435.2), Phaeodactylum tricornutum (GCA_000150955.2), and Aspergillus nidulans (GCA_000149205.2) were downloaded from the NCBI genome assembly database (March–September 2018). This information included the genomic scaffolds, annotated protein sequences, and GFF3 files containing information about the locations of annotated proteins and other genomic elements. MetaEuk (Github commit 4714106, MMseqs2 submodule version ebb16f3) was run with the following parameters: -e 100 (a lenient maximal E-value of a putative exon against a target protein), --metaeuk-eval 0.0001 (a stricter maximal cutoff for the MetaEuk E-value after joining exons into a gene call), --metaeuk-tcov 0.6 (a minimal cutoff for the ratio between the MetaEuk protein and the target), and --min-length 20, requiring putative exon fragments of at least 20 codons and default MMseqs2 search parameters.
Mapping benchmark predictions to annotated proteins
For each annotated protein, we listed the contig start and end coordinates of the coding part (CDS) of each of its exons. The lowest and highest of these coordinates were considered as the boundaries of the annotated protein, and the stretch between them as its “global” contig length. Similarly, we listed these coordinates and computed the boundaries and global contig length for each MetaEuk prediction. A MetaEuk prediction was globally mapped to an annotated protein if the overlap computed based on their boundaries was at least 80% of the global contig length of either of them and if, in addition, the alignment of their protein sequences mainly consisted of identical amino acids or gaps (i.e., less than 10% mismatches). These criteria allow mapping MetaEuk predictions to an annotated protein, even if they miss some of its exons. Next, we computed the exon level mapping for all globally mapped pairs of MetaEuk predictions and annotated proteins. To that end, we compared their lists of exon contig coordinates. If an exon predicted by MetaEuk covered at least 80% of the contig length of an annotated protein’s exon, we considered the annotated exon as “covered” by the MetaEuk prediction.
Generating typical metagenomic contig lengths
In order to evaluate MetaEuk’s performance on contigs with a length distribution typical for assemblies from metagenomic samples, we recorded the lengths of the assembled contigs used for the analysis described in the “Tara Oceans dataset” section. The 1,351,204 contigs had a minimal length of 5002 bps, 1st quartile of 5661 bps, median of 6763 bps, 3rd quartile of 9020 bps, and a maximal length of 1,524,677 bps. We divided each annotated scaffold into contigs of lengths that were randomly sampled from these recorded lengths. This resulted in 1392, 5061, 1816, 2095, 80, 3153, and 3273 contigs for S. pombe, A. castellanii, Phytomonas sp. isolate EM1, nucleomorph of L. oceanica, P. tricornutum, and A. nidulans, respectively. MetaEuk was run on these contigs in the same way as on the original scaffolds. Since each of the new contigs corresponded to specific locations on the original scaffolds, we could carry out all benchmark assessments, which relied on mapping between MetaEuk predictions and annotated proteins.
Tara Oceans dataset
The 912 metagenomic SRA experiments associated with accession number PRJEB4352 were downloaded from the SRA (August–September 2018). The reads of each experiment were trimmed to remove adapters and low quality sequences using trimmomatic-0.38 [59] with parameters ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36 (SE for single-end samples). The resulting reads were then assembled with MEGAHIT [38] with default parameters. Contigs of at least 5 kbp in length were classified as eukaryotic/non-eukaryotic using EukRep [27], which is trained to be highly sensitive to detecting eukaryotic contigs. MetaEuk was run on the contigs classified as eukaryotic with parameters: -e 100, --metaeuk-eval 0.0001, --min-ungapped-score 35, --min-exon-aa 20, --metaeuk-tcov 0.6, --min-length 40, --slice-search (profile mode), and default MMseqs2 search parameters.
Taxonomic assignment to predictions and contigs
We used MMseqs2 to query the MetaEuk marine protein collection against two taxonomically annotated datasets: Uniclust90 and the MMETSP protein dataset. Taxonomic labels associated with each of the MMETSP identifiers were downloaded from the NCBI website (BioProject PRJNA231566). We retained the hit with the highest bit-score value for each prediction if it had an E-value smaller than 10-5. In addition, we examined the sequence identity between the MetaEuk prediction and the target in order to determine the rank of the taxonomic assignment. Similarly to [20], we used the following sequence identity cutoffs: > 95% (species), > 80% (genus), > 65% (family), > 50% (order), > 40% (class), > 30% (phylum), > 20% (kingdom). Lower values were assigned at the domain level. The predictions on each contig were examined and the best-scoring one was used to confer taxonomic annotation to that contig. The assignment was visualized using Krona [60].
Phylogenetic tree reconstruction
We constructed the tree using the large subunit of RNA polymerases as a universal marker. This subunit contains five RNA_pol_Rpb domains (Pfam IDs: pf04997, pf00623, pf04983, pf05000, pf04998). As detailed below, protein sequences that contained all five domains in the right order were obtained in January–November 2019 from six sources to construct the multiple sequence alignment and tree. The sources were as follows: (1) 75 sequences of the OrthoMCL [61] group OG5_127924. The four-letter taxonomic codes of these sequences were converted to NCBI scientific names, based on information from the OrthoMCL website (http://orthomcl.org/orthomcl/getDataSummary.do). (2) 36 reviewed eukaryotic sequences were downloaded from UniProt [36]. These were used to distinguish between eukaryotic RNA polymerase I (8 sequences), eukaryotic RNA polymerase II (16 sequences), and eukaryotic RNA polymerase III (12 sequences). We then ran an MMseqs2 profile search against the Pfam database (with parameters: -k 5, -s 7) with several query sets and retained results in which all five domains were matched in the right order with a maximal E-value of 0.0001. This allowed us to add the following sources: (3) 674 MMETSP proteins, (4) 100 archaeal proteins, and (5) 100 bacterial proteins. For datasets (4, 5), we first downloaded candidate proteins from the UniProt database by searching for the five domains and restricting taxonomy: archaea (bacteria). We then ran the previously described search procedure and randomly sampled exactly 100 proteins from each group that matched the criterion. (6) 1076 MetaEuk predictions. The joint set of 2061 sequences was aligned using MAFFT v7.407 [44] and a phylogenetic tree was reconstructed by running RAxML v8 [45]. Tree visualization was performed in iTOL [62].