Xander graph structure
Xander requires two sets of input sequences: a set of reference sequences of the targeted genes to build a protein profile HMM and one or more metagenomic read files to build a de Bruijn graph (DG). An HMM can be considered as a directed probabilistic graph with transition and emission probabilities between states. A novel graph structure was created to combine the DG and HMM together into a single combined weighted assembly graph (CAG) (Fig. 1).
For a graph G, let V(G) and E(G) be the vertex set and edge set, respectively. A vertex w in CAG is created for every pair of vertices u and v, where u belongs to V(DG) and v to V(HMM). The total number of vertices in CAG will be |V(DG)| * |V(HMM)|. The edge \( \overrightarrow{w_i{w}_j} \) between vertices w
i
and w
j
in the edge set E(CAG) is made by combining vertices v
i
with u
i
and v
j
with u
j
, respectively:
\( \overrightarrow{w_i{w}_j\ }\in\ E\left(\mathrm{C}\mathrm{A}\mathrm{G}\right)\ \leftrightarrow\ \overrightarrow{v_i{v}_j\ }\in\ E\left(\mathrm{H}\mathrm{M}\mathrm{M}\right)\kern0.5em \mathrm{and}\kern0.5em \overrightarrow{u_i{u}_j}\ \in\ E(DG) \) if v
j
is an insert or match state.
\( \overrightarrow{w_i{w}_j\ }\in\ E\left(\mathrm{C}\mathrm{A}\mathrm{G}\right)\leftrightarrow\ \overrightarrow{v_i{v}_j\ }\in\ E\left(\mathrm{H}\mathrm{M}\mathrm{M}\right)\kern0.5em \mathrm{and}\ {u}_i={u}_j \) if v
j
is a delete state.
The weight of an edge \( \overrightarrow{w_i{w}_j} \) in CAG is defined as the log of the product of the transition and emission probabilities taken from the HMM:
\( \mathrm{weight}\left(\overrightarrow{w_i{w}_j}\right) \) = log [ℙtransition(v
i
→v
j
)] + log [ℙemission(v
j
)] if v
j
is a match state, \( \mathrm{weight}\left(\ \overrightarrow{w_i{w}_j}\ \right)\kern0.5em =\kern0.5em \log \left[{\mathrm{\mathbb{P}}}_{\mathrm{transition}}\left({v}_i\ \to\ {v}_j\right)\right] \) otherwise.
The emission symbol is the terminal character(s) in the substring of length k (kmer) contained in u. The underlying DG graph is constructed in nucleotide space regardless of whether the HMM is modeling protein or nucleotide sequences. When searching with a protein HMM, the DG graph is converted to protein space by walking three vertices in any one direction at a time. The emission symbol then becomes the last three nucleotide characters of the kmer translated to a single amino acid (aa). The vertex chosen to begin graph traversal thus determines the translation reading frame.
By default, Xander builds a graph using all the kmers from the reads regardless of the abundance of the kmers. Xander implements a counting Bloom filter to store the counts of each kmer. This allows optionally filtering out kmers failing to meet the minimum abundance cutoff during the graph-building step.
Assembly approach
Xander assembly starts at a vertex corresponding to a kmer contained in the target gene. Since starting vertices can be in any model position, not just the beginning of the model, a second HMM is built from the reverse of the seed alignment used to build the forward HMM. Using this reverse model, Xander can traverse paths in both directions from a starting vertex. From a starting vertex, Xander searches for the best path through the CAG to a goal vertex, a vertex corresponding to one end of the HMM. The contigs generated by each search direction are output separately and then combined into a single contig.
To identify starting kmers, Xander uses a set of aligned reference sequences from the target genes and the read files. The reference set used at this step can be larger than the one used to build HMMs and can include partial and lower quality sequences to provide broader coverage of the organisms. All overlapping kmers from the reference sequences are stored in a hash table. Each read is decomposed into kmers to search for matches in the hash. If a perfect match is found, the kmer, alignment position, and corresponding implied HMM match states form a starting vertex for the next assembly step. For use with a protein HMM, a peptide kmer length of one third of the input kmer size is used and input reads are translated into all six reading frames. When assembling multiple target gene families, the reference sets can be combined together into a single hash so that search starts can be identified in a single pass over the reads.
To assemble contigs, Xander searches from the starting vertices corresponding to the identified starting kmers using the A* search algorithm [19] to find the best paths through the CAG. The set of goal vertices is defined as any vertex constructed from the terminal HMM model position match or delete state. The scoring function for a path P is defined as follows: \( S(P)={\displaystyle {\sum}_{i=1}^{\left|P\right|}}\mathrm{weight}\left(\overrightarrow{w_{i-1}{w}_i}\right) \) where the weight function returns the weight of the edge between two vertices in P.
The A* algorithm maintains an “open set” of examined partial paths from the starting kmer ending in a vertex w. The open set is sorted by the expected score for the best path through w to a goal vertex, which is equal to the sum of the score of the partial path to w and of the score from w to a goal vertex using a heuristic cost function h(w). Let vertex v be the HMM component of w and m
i
be the match state HMM vertex following v. h(w) is defined as follows:
$$ h(w)= \log \left[\mathrm{\mathbb{P}}\left(v\to {m}_i\right)\right]+{\displaystyle \sum_i^{L-1}} \log \left[\mathrm{\mathbb{P}},\left({m}_i,\to, {m}_{i+1}\right)\right] $$
That is, the sum of the most likely state transitions from v to the end of the model, where ℙ is the probability of the given transition and emission of the most likely amino acid, and L is the length of the HMM. Since this is the best scoring path from v to the goal for any possible sequence, it meets the admissibility criteria for A*.
The emission probabilities produced by HMMER are normalized to a null model, meaning that some normalized emission values are greater than one. To ensure the log-odds edge weights used by the heuristic score and the scoring function were both monotonic, the log of the maximum normalized match emission probability was subtracted from the edge weights for the preceding match → match, insert → match, match → delete, delete → match, and delete → delete edges. These modified weights are used except where we explicitly refer to the unmodified weights below.
The A* algorithm proceeds by successively examining the best scoring vertices from the open set. If a best scoring vertex w is not a goal vertex, the adjacent vertices are added to the open set and w is removed from the open set and added to a closed set. We cache the best paths found from previous searches to speed up subsequent searches. If w is on a cached shortest path, we only open the path to the next vertex on the cached best path.
A path-pruning heuristic modification to the A* search is implemented in Xander to terminate paths that are unlikely to yield contigs that match the model well. In addition to the path score calculated from the monotonic (modified) edge weights, a standard HMM score is calculated from the unmodified weights and maintained for each partial path, along with the position of the maximum HMM scoring vertex w
m
in the path to w. When a vertex w reaches the top of the open set, it is discarded (pruned) if the HMM score along the path to w has not improved within a user-specified number of vertices. In the event a search terminates (the open set is empty) before reaching the end of the model, the path to w
m
, the intermediate vertex with the highest HMM score is returned.
The A* algorithm only finds one single shortest path from a starting vertex. To explore microheterogeneity, Xander implements an option to find multiple high-scoring paths from a single starting vertex using Yen’s K shortest path algorithm [20]. This algorithm iteratively finds the shortest path then the second shortest path to the kth shortest path. This is sped up by the observation that the ith shortest path in the sequence must branch from one of the i − 1 shortest paths already identified. Yen’s algorithm can be further improved by the observation that the ith shortest path must branch from its parent j after the point j branched from its parent [21]. In Xander, we have modified Yen’s algorithm to find the next best path P
i
containing at least one edge not present in the previously found i − 1 best paths. This avoids a combinatorial explosion of best paths from a small number of minor variations. A contig sequence may still be common for multiple paths, representing variations in the alignment between the HMM and the sequence, so additionally, the implementation does not return paths that do not contain any previously unseen kmers. In this way, each of the K paths returned contains new sequence information.
HMP-defined community data
The Human Microbiome Project (HMP)-defined community consists of 22 human-gut-associated microorganisms (Additional file 1: Table S1). Only the 20 bacterial organisms were used in this analysis. The reference genomes were downloaded [22], and the annotations for each organism were downloaded from GenBank. Two whole-genome shotgun sequence datasets for the HMP-defined community were downloaded from NCBI’s Short Read Archive (SRR172902, SRR172903). The two datasets were combined for analysis in this study. The combined set consists of a total of 1037 Mbp of length 75 bp Illumina reads. Quality filtering was performed by trimming reads at a quality score of “B” as recommended by Illumina (CASAVA 1.7 User Guide) using the RDPTools ReadSeq package [23].
Rhizosphere soil data
Metagenomic sequence was produced by the Joint Genome Institute (JGI) as part of the Great Lakes Bioenergy Research Center’s (GLBRC) sustainability research theme. The samples derive from rhizosphere soil collected at Kellogg Biological Station intensive sites [24] in October 2012 from three biofuel crops: Zea mays (corn), Miscanthus × giganteus (Miscanthus), and Panicum virgatum (switchgrass). For each crop, there are seven biological replicates (Additional file 1: Table S2). Sequence reads and metadata are available from JGI [25]. Each read file contains paired-end reads of length 150 bp from one lane of Illumina HiSeq, one file, or lane, per rhizosphere sample. We downloaded the bulk assembly contigs (minimum length 300 bp) from MG-RAST [26]. In brief, these bulk assemblies were created by first merging the paired-end reads, then all seven replicates of each crop were pooled together before assembly using the diginorm/partitioning assembly pipeline [27] (see Additional file 1). To be consistent with the bulk assembly, we used the merged reads as the starting data for Xander assembly.
HMM construction
HMMs were built using the “seed sequences” for the corresponding genes from RDP’s FunGene site [23]. These seed sequences were used to build a forward HMM and a reverse HMM for each gene using a modified version of HMMER3 [28]. Since HMMER3’s default settings are tuned for detecting remote paralogs [29] whereas Xander is targeting close homologs, we used the “-- enone” option to disable sequence weighting. The default priors sometimes caused extensive searching of nonproductive insert and delete paths. HMMER3’s source code was modified to change the prior probabilities for the delete → match and insert → match transitions to 95 % probability and delete → delete and insert → insert transitions to 5 % probability. The modifications to HMMER3 are available as a patch file against version 3.0.
Reference sets
From FunGene release version 7.5.3, we downloaded a set of 263 unique near full-length nitrite reductase (NirK) protein and corresponding nucleotide sequences using the following filters: minimum aa 300; minimum HMM coverage 80; and minimum HMM score 300. The average length of the protein sequences was 410 aa. A set of 1734 near full-length ribosomal protein L2 (RplB) protein and corresponding nucleotide reference sequences (average length of 280 aa) was selected from the same site with these filters: minimum aa: 250; minimum HMM coverage 90; and minimum HMM score of 440. For nitrogenase reductase (NifH), we used the set of 782 near full-length reference sequences with average length of 300 aa that was used as references for the FrameBot tool in a previous publication [30]. These protein reference sets were used with Xander to identify starting kmers and by FrameBot to find the closest matches to the contigs. The nucleotide reference sets were used with UCHIME [31] to detect chimeras. All these reference sets are available from the Xander_assembler package (see Availability and requirements below).
Bulk assembly of rhizosphere soil data
Bulk assemblies were provided by Jiarong Guo. The following section briefly describes the bulk assembly steps [Jiarong Guo, personal communication] for the 21 rhizosphere soil samples deposited in MG-RAST [26]. The raw reads were trimmed starting at bases with the quality score of “#” near the distal end and the paired-end reads were merged using Flash [32] with parameters “-m 10 -M 120 -x 0.20 -r 140 -f 250 -s 25 -t 1.” The sizes of the merged files in FASTA format range from 27 to 57 gigabytes (GB) (Additional file 1: Table S2). The merged reads from all seven replicates of each crop were pooled together before bulk assembly using the diginorm/partitioning assembly pipeline [27]. The minimum length of contigs is 300 bp. The reads were mapped to the contigs using BWA [33].
Targeted gene identification in bulk assembly
Protein sequences were translated from the bulk assembly contigs downloaded from MG-RAST using FragGeneScan [34]. We used HMMER3 to search for NirK, NifH, and RplB from these protein sequences using the forward HMM models used by Xander. Only hits with a minimum HMM score of 50 were retained. Hits were clustered at 99 % aa identity. The closest matches to the representatives were identified using the RDP AlignmentTool package [23] against the protein reference sets.
Xander processing steps for rhizosphere data
For each of the three genes and for each sample (either individual or pooled), we used Xander to assemble one best contig from each starting kmer of length 45. To be comparable to the metagenomic assembly contigs, assembled contigs shorter than 300 nucleotides or with an HMM score less than 50 were discarded. A few post-assembly steps were included as part of the analysis (Fig. 2). We clustered the assembled contigs at 99 % aa identity and chose a set of representative nucleotide and protein contigs (the longest contig from each cluster). The 99 % aa identity cutoff was used for contig clustering throughout the analysis unless otherwise noted. Chimeras were identified using UCHIME against the nucleotide reference set. The closest matching reference sequences to these representative contigs were identified using FrameBot. Read mapping and kmer coverage estimates were also performed as described below.
Gene abundance and kmer coverage estimate
We used the single copy core rplB gene to normalize gene abundance. The relative abundance for a particular gene, e.g., nirK, was estimated as the ratio of the number of reads covering at least one kmer in the nirK gene contigs divided by the number of reads covering at least one kmer in the rplB gene contigs. The same kmer length used for the assembly step was used to estimate coverage.
The mean nucleotide kmer coverage for each representative nucleotide contig was calculated from the reads using a tool included in the Xander package. When a read had a kmer present in multiple contigs, the counts for that kmer were equally divided among the contigs to avoid overcounting. The kmer coverage of a contig is then calculated as the mean counts of all the kmers included in that contig.
Ordination analysis
For each of the nirK and rplB genes, the representative protein contigs from each of the 21 rhizosphere soil samples were aligned using HMMER3 and all samples clustered together using RDP mcClust [23] with the complete linkage algorithm. The operational taxonomic unit (OTU) abundances at 95 % aa identity were corrected using the mean nucleotide read coverage of each contig before principal component analysis (PCA). Vegan package version 2.0-3 in R 2.15 was used to perform PCA.
SAT-Assembler processing
SAT-Assembler [15] was run with the default settings. The HMMs were created with the unmodified HMMER 3.0, with the seed sequences used for creating Xander-specific HMMs.
Computer resources
The MSU High Performance Computing Center (HPCC) [35] cluster was used for the majority of experiments. SAT-Assembler failed to complete assembly of the HMP + corn 1 dataset after 7 days, the maximum allowed runtime on the HPCC (Xander completed all steps in 12 h). We assembled these data with SAT-Assembler by using a faster machine (iMac with 3.2-GHz Intel Core i5 processor with local drive) in 124 h.