Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation
© Celaj et al.; licensee BioMed Central Ltd. 2014
Received: 30 June 2014
Accepted: 17 September 2014
Published: 28 October 2014
Microbiome-wide gene expression profiling through high-throughput RNA sequencing (‘metatranscriptomics’) offers a powerful means to functionally interrogate complex microbial communities. Key to successful exploitation of these datasets is the ability to confidently match relatively short sequence reads to known bacterial transcripts. In the absence of reference genomes, such annotation efforts may be enhanced by assembling reads into longer contiguous sequences (‘contigs’), prior to database search strategies. Since reads from homologous transcripts may derive from several species, represented at different abundance levels, it is not clear how well current assembly pipelines perform for metatranscriptomic datasets. Here we evaluate the performance of four currently employed assemblers including de novo transcriptome assemblers - Trinity and Oases; the metagenomic assembler - Metavelvet; and the recently developed metatranscriptomic assembler IDBA-MT.
We evaluated the performance of the assemblers on a previously published dataset of single-end RNA sequence reads derived from the large intestine of an inbred non-obese diabetic mouse model of type 1 diabetes. We found that Trinity performed best as judged by contigs assembled, reads assigned to contigs, and number of reads that could be annotated to a known bacterial transcript. Only 15.5% of RNA sequence reads could be annotated to a known transcript in contrast to 50.3% with Trinity assembly. Paired-end reads generated from the same mouse samples resulted in modest performance gains. A database search estimated that the assemblies are unlikely to erroneously merge multiple unrelated genes sharing a region of similarity (<2% of contigs). A simulated dataset based on ten species confirmed these findings. A more complex simulated dataset based on 72 species found that greater assembly errors were introduced than is expected by sequencing quality. Through the detailed evaluation of assembly performance, the insights provided by this study will help drive the design of future metatranscriptomic analyses.
Assembly of metatranscriptome datasets greatly improved read annotation. Of the four assemblers evaluated, Trinity provided the best performance. For more complex datasets, reads generated from transcripts sharing considerable sequence similarity can be a source of significant assembly error, suggesting a need to collate reads on the basis of common taxonomic origin prior to assembly.
Innovations in culture-independent microbiology, coupled with rapid advances in high-throughput sequencing (HTS), are beginning to profoundly transform our understanding of the relationships between microbial communities and their environments. For example, it is becoming increasingly clear that the composition of the human gut microbiome plays a major role in the development of many human diseases including obesity, type 1 diabetes, inflammatory bowel disease, and autism [1–7]. To date, studies on the human microbiome have largely focused on the use of 16S rRNA surveys which examine shifts in the composition of microbial communities [8–10]. However, such studies lend only limited insight into microbiome function. Recently, we and others have pioneered the use of microbiome-wide gene expression profiling via RNA sequencing (RNA-Seq) or ‘metatranscriptomics’ as a route to functionally interrogate a microbiome [11–14]. Key to exploiting the full potential of these datasets is the ability to accurately assign and annotate sequence reads to known transcripts , a challenge that is complicated by the inherent complexity associated with microbial communities as well as the lack of a comprehensive set of reference genomes.
In typical RNA-Seq applications, sequence reads are mapped onto a reference genome to yield expression profiles for each gene. In the absence of reference genomes, sequence annotation is typically performed through sequence similarity searches against databases of previously annotated genes or proteins [15, 16]. However, for sequencing technologies capable of generating the hundreds of millions of reads required for metatranscriptomic analyses, resultant read lengths tend to be short (e.g., 75–150 bp), reducing our ability to identify meaningful sequence matches with confidence. Since many reads may derive from the same transcript, assembling reads into longer contiguous sequences (‘contigs’) offers a useful avenue for improving read annotation. However, unlike reads generated from a single organism, RNA-Seq analysis of complex microbial communities poses the additional complication that the multiple species may be represented at significantly different levels of abundance. To date, several tools, based on the use of de Bruijn graphs, have been developed to assemble sequence data de novo: Metavelvet  was originally developed to assemble metagenomic datasets, while Oases  and Trinity  were developed to specifically assemble RNA sequence data. More recently, a dedicated metatranscriptomics assembler has also been described that relies on the use of paired-end reads . Due to the absence of suitable datasets, it is not clear how assemblers, previously developed for assembling other types of sequence data, compare with a dedicated tool for assembling metatranscriptomic datasets and, furthermore, what types of error each may introduce.
One potential source of error in transcript assembly is the incorporation of reads from several similar transcripts such as members of the same gene family or the merging of orthologs from different species. While such errors may impact taxonomic assignments, they may have minimal impact on functional assignments. A more serious source of error is that unrelated transcripts sharing a region of high sequence similarity but distinct abundance and/or function may be merged into a single erroneous contig. In such cases, the expression value of the rarer transcript can be masked by the more abundant transcript and/or, depending on the annotation pipeline, only a single function may be ascribed. Our aim was to undertake a systematic evaluation of current assembly tools across multiple metatranscriptomic datasets to assess their performance and determine if the incidence of contig reconstruction errors is likely to impact downstream analyses.
We focused on a metatranscriptomic data from previous 76-bp single-end RNA sequence reads, as well as a new data set of 76-bp paired-end reads, from a microbial consortium isolated from the large intestine (cecum) of inbred non-obese diabetic (NOD) mice, a model of spontaneous type 1 diabetes . Here we show how different approaches to sequence assembly impacts transcript annotation and how complex datasets may be more prone to annotation error.
Results and discussion
Assembly significantly improves the number of annotated reads
Comparison of assembly algorithms on single-end sequence data
Overlap in assemblies
Metavelvet, k= 39
Metavelvet, k= 51
Oases, k= 27–35
Oases, k= 39–45
Oases, k= 51–53
All single-end contigs
Metavelvet, k = 27
Metavelvet, k = 39
Metavelvet, k = 51
Oases, k = 27–35
Oases, k = 39–45
Oases, k = 51–53
BLAST score >50
Metavelvet, k = 27
Metavelvet, k = 39
Metavelvet, k = 51
Oases, k = 27–35
Oases, k = 39–45
Oases, k = 51–53
Comparison of assembly algorithms on paired-end sequence data
In the previous sections, we examined the performance of assemblers applied to single-end reads. To examine how paired-end sequencing for metatranscriptomic studies could augment assembly and annotation, we generated a set of 29.8 million 76-bp paired-end reads (average insert size 273 bp) from the same rRNA-depleted samples used to generate the single-end reads  (Additional file 1). We observed a high degree of consistency between single- and paired-end reads in terms of: 1) the relative proportion of sample represented in the entire dataset, 2) reads assigned to ribosomal transcripts for each sample, and 3) reads assigned to mouse host transcripts for each sample (Additional file 4). In contrast, the proportion of reads assigned to putative bacterial mRNA transcripts was lower for the paired-end reads and was associated with a rise in frequency of reads filtered on the basis of vector contamination. This is likely related to the processing step that discarded both reads in a pair even if only one member contained significant vector contamination. Interestingly, we note discrepancies between single and paired read data in the phylogenetic distribution of reads (Additional file 4). However, overall there is reasonable correlation between samples (r2 values from 0.75 to 0.99). For both ends of a read, this correlation was even higher (r2 from 0.95 to 1), suggesting that differences arise from biases introduced in sample preparation prior to sequencing, rather than bioinformatics processing.
Next, we evaluated potential performance gains through the use of paired-end reads. The paired-end reads allowed an evaluation of a dedicated metatranscriptomic assembly algorithm, IDBA-MT , as it is compatible only with paired-end reads. Compared to the single-end read data, we were able to assemble more contigs (Figure 3), potentially reflecting the greater number of the paired-end reads (553,115 pairs of reads vs. 516,881 single reads). For example, after dividing the paired ends into two separate datasets of 553,115 reads, Trinity generated 51,244 and 52,549 contigs, respectively. These data contrasted with 48,469 contigs for an assembly based on the 516,881 single-end reads and 58,361 contigs for an assembly based on combining both ends of the 553,115 pairs. However, compared to the single-end assembly, we were able to assemble a higher proportion of reads into contigs, resulting in a concomitant gain in the proportion of reads assembled into annotatable contigs (from 50.3% to 64.4% for the Trinity-based assembly; Figure 3). Greater performance gains were observed for the Metavelvet and Oases algorithms. Again, imposing a 150-bp contig size cutoff on the Trinity assembly resulted in a proportion of mapped reads comparable to Oases and Metavelvet despite a smaller number of overall contigs (Figure 3). In contrast, IDBA-MT did not perform as well as the other methods in either contigs produced or reads mapped to annotatable contigs. These findings are contrary to a previous report that Trinity generated only ~5% more contigs than IDBA-MT . This discrepancy might arise because this latter study combined reads from all 12 paired-end mouse samples resulting in potential coverage saturation that may have produced a convergence in the number of contigs. However, assemblies based on the entire set of sequences from all 12 samples also resulted in a greater number of contigs using Trinity (127,511 contigs) compared to IDBA-MT (16,582 contigs). Instead, this discrepancy likely arises from the use of a default parameter in the Trinity software, which reports contigs only in excess of 200 bp (resulting in 10,823 contigs). To avoid differential biases between the assemblers, we reduced this stipulation to 51 bp (the minimum allowed read length after filtering) in our analyses. The overlap profile of the paired-end read assemblies was similar to the single-end data, again suggesting little benefit in combining results from different assemblers (Additional file 5).
Database evaluation of transcript reconstruction accuracy across assemblers
Applying this procedure to our assemblies, we found that those based on single-end reads contained a low incidence of contigs composed of multiple genes (from 0.56% to 0.24% of contigs for Oases with k = 27–35 and Metavelvet with k = 39, respectively; Figure 4). Perhaps surprisingly, assemblies based on paired-end reads contained a higher incidence of misassembled contigs (from 2.08% to 0.5% of contigs for Oases with k = 27–35 and Metavelvet with k = 51, respectively). Both Trinity- and IDBA-MT-based assemblies gave comparable outcomes (1.31% and 1.03% of contigs, respectively). This increase in misassembled contigs associated with the paired-end reads is likely related to the increased read coverage provided by the dataset, resulting in longer contigs. Whether these misassembled contigs arise from the reconstruction of polycistronic mRNAs or assembly errors remains to be resolved. In any event, it is clear that while the higher coverage of the paired-end reads improves both the number of contigs and number of reads assigned to an annotatable contig, it has not improved accuracy of contig assembly.
Given the low incidence of misassembled contigs independent of assembler as determined by the above heuristic, we next looked for further evidence of potential misassembly through the identification of contigs with only partial matches to known proteins. For contigs (and fragments) possessing significant sequence similarity to a single known protein (as defined by a BLASTX match with a bit score cutoff of 50), we identified those for which the alignment with the protein covered less than 90% of the length of the contig or fragment. Such contigs or fragments may indicate either a gene fusion event which could not be resolved into discrete regions through the initial heuristic, a misassembly involving reads from two or more potentially related transcripts (e.g., members of the same gene family) that result in the generation of a hybrid transcript, or simply a novel gene that has yet to be captured by the non-redundant protein database. This latter type of sequence may be considered as a false-positive misassembly. Due to high rates of sequence divergence in microbiome samples, we expect such events to be a significant source of false-positive misassemblies. For example, a previous study suggests that ~8%–10% of genes associated with a newly sequenced genome are novel (i.e., lack sequence similarity to any known gene) . Accordingly, this second heuristic yielded a higher estimate of potential misassemblies for both single- and paired-end datasets (13%–16% vs. 14%–27%, respectively; Figure 4). Interestingly, we note that increasing the k parameter in both Metavelvet and Oases increased reconstruction accuracy through both metrics.
While the incidence of contigs possessing non-overlapping alignments with two or more proteins was relatively low, we nonetheless propose the implementation of a post-assembly processing step such as that outlined above to convert contigs into discrete fragments associated with distinct sequence alignments. Note also that such a processing step also has the advantage of separating individual ORFs from assemblies of polycistronic mRNA moieties.
Assembly of simulated metatranscriptomic datasets reveals transcript accuracy
The availability of a reference metatranscriptome also allows the evaluation of assembly performance through the DETONATE software package, a transcriptome assembly evaluator . In brief, we used the combined transcriptome of the ten species dataset to train a probabilistic model which is subsequently used to evaluate the assemblies from the NOD mouse datasets. Our results suggest that Trinity gave the worst performance according to this measure (Additional file 6), consistent with the previously reported N50 values. However, while this might suggest that the Trinity assembly least captures the properties of a real metatranscriptome, depending on the annotation pipeline, favoring inclusiveness of assembly may be preferred if it does not introduce errors which confound the interpretation of the final assembly results.
We have shown that assembly of metatranscriptomic reads considerably improves short-read annotation. While only 15.5% of single-end reads obtained from the large intestine could be confidently assigned function, this percentage increases to 50.3 after assembly with Trinity. Furthermore, the number of contigs resulting from the fusion of two unrelated genes during the assembly process was rare in simple (8–10 species) experimental and simulated datasets. In a more complex simulated dataset composed of sequences from 72 species, there were many more assembly errors than expected by the sequence read quality. However, such errors appeared confined to relatively minor sequence variants rather than the merging of two unrelated genes that share a region of sequence similarity. While Trinity did not assemble the most accurate contigs, it significantly outperformed the other three assemblers in terms of the number of reads that could be aligned to a contig with known function. Future work will focus on improving the accuracy of reconstructed contigs in complex metatranscriptomic samples by first grouping reads into taxonomically defined bins, thereby reducing sample complexity prior to assembly. This algorithmic development can be expected to reduce assembly errors that arise from the merging of homologous transcripts from different species and subsequently improve taxonomic assignment and functional annotation of assembled contigs. Furthermore, while the focus of this study was on metatranscriptome assembly as well as the types of assembly errors that could impact downstream functional analyses, future work could focus on a systematic analysis of types of potential misassemblies and how assembler parameters may be optimized to differentiate between gene fusions, transcripts cotranscribed in operons, and genuine misassemblies.
Source and processing of sequencing reads
Single-end sequence reads from a previously published mouse gut metatranscriptome study were obtained from the Sequence Repository Archive (SRA051354) . This dataset includes 12 samples generated from two different body sites, four different mice using a variety of different purification procedures described elsewhere . Paired-end sequences were generated from the same barcoded libraries used to generate the single-end reads following standard Illumina protocols. Sequencing was performed with the Illumina Genome Analyzer II (GaII) platform at the Center for Advanced Genomics (TCAG - Hospital for Sick Children). After deconvolution of the barcodes used for multiplexing, 29,780,781 pairs of 76-bp reads were generated on a single flow cell. This paired-end data set, supporting the results of this article, is available from the Sequence Repository Archive (SRA051354 - http://www.ncbi.nlm.nih.gov/sra/?term=SRA051354) .
Compared with the previous publication, we applied a stricter protocol for removal of adaptor contaminants to optimize assembly performance; reads with adaptor or partial adaptor sequences at their ends may interfere with extension of transcripts during assembly. Adaptor sequences were identified using Cross_match (http://www.phrap.org) to search a database of Illumina adaptor sequences. We subsequently ran a more stringent screen focusing on the specific adaptors: AGATCGGAAGAGCACACGTCTGAACTCCAG and AGATCGGAAGAGCGTCGTGTAGGGAAAGA (minmatch 10, minscore 5). Poor-quality bases were removed by iterating a 5-nt window across the 5′ and 3′ ends of each sequence and removing nucleotides in windows with a mean quality score less than 20; iteration was stopped when the mean quality score was greater than 20. After trimming, reads less than 50 bp in length were discarded; for paired-end reads, if either read of a pair was less than 50 bp in length, then both reads were discarded. Putative rRNA reads were identified through BLAT  sequence similarity searches (bit score ≥50) against an in-house database of rRNA sequences . Again, for paired-end reads, if either read of a pair matched a ribosomal sequence, both reads were annotated as being of rRNA in origin. Putative mouse transcript sequences were identified through BLAT sequence similarity searches (bit score ≥50) against a database of mouse genome and transcriptome sequences obtained from ENSEMBL release 67 . Depending on sample, 3%–29% of the reads could be annotated as being of putative bacterial mRNA origin (Additional file 1). Phylogenetic annotations were performed by running BLASTX sequence similarity searches against the non-redundant protein database , using the highest scoring alignment (bit score >50). Resulting species were assigned to larger taxonomic groups with reference to the National Center for Biotechnology Information (NCBI) taxonomy tree .
Generation of simulated metatranscriptomic datasets
Simulated metatranscriptomic datasets were generated based on sequence abundance data previously generated from the stool of a female participant of the Human Microbiome Project (HMP) (Biosample identifier: SRS011061 ). From an original list of 180 species, 73 were associated with a reference genome available from the Human Microbiome Reference Genome database (HMREFG ; Additional file 7). For each of these 72 species, annotation files were converted from GenBank format to gtf format for use in FluxSimulator  using a custom script. A total of 1.75 million 76-bp single-end and paired-end reads were generated with the proportion of each of the 72 species obtained with reference to the HMP sample. To generate a less complex dataset consisting of ten species, a single species representative was selected from each of the ten most abundant genera identified in the HMP sample. Again, 1.75 million single-end and paired-end reads were generated with the proportion of each of the ten species obtained with reference to the HMP sample. To generate a gold standard assembly for the simulated datasets that takes into account read errors introduced by FluxSimulator, we used a parallelised version of BLAT (https://code.google.com/p/pblat/) to align simulated reads to the set of reference genomes originally used to generate the reads. Since FluxSimulator includes sequence upstream of start codons in the generation of simulated transcripts, it can occasionally result in the generation of reads representing a fusion of two neighboring genes. For the purposes of defining misassemblies, these were ignored.
Sequence assembly and mapping reads back to contigs
For Trinity , we used the following parameters: fastq assembly, 16 CPUs for Inchworm and Butterfly, a maximum heap size of 12 GB, and an insert distance of 270 bp for the paired-end assemblies. Only contigs in excess of 50 bp were reported. For Oases , we used version 2.0.8 and varied the minimum and maximum k-mer parameters with values listed in the text. Insert length was defined as 270 bp for the paired-end assemblies. For Metavelvet , we used version 1.2.01 coupled with Velvet  version 1.2.07. Velveth was initially run on the fastq files, using the -short parameter for the single-end reads and the -shortPaired parameter for the paired-end reads. Values for minimum and maximum k-mer parameters are listed in the text. Subsequently, velvetg was run using the -exp_cov auto parameter for both single- and paired-end reads. The ins_length parameter was set to 270 for the paired-end reads. Finally, meta-velvetg was run setting the -ins_length parameter to 270 for the paired assemblies. IDBA-MT  version 1.0 was run on contigs initially generated using IDBA-UD  version 1.0.9 on paired-end reads using default parameters with an insert length of 270. To map reads to contigs, we applied BWA  with default parameters. To calculate the overlap between reads mapping to different contig sets, we calculated the intersection of the reads mapping to the two different assemblies divided by the size of smallest set of mapped reads.
We would like to acknowledge support from the following sources: Genome Canada, the Ontario Genomics Institute, and the Canadian Institute for Health Research -‘ Leveraging Meta-Transcriptomics For Functional Interrogation Of Microbiomes’ (JP); Juvenile Diabetes Research Foundation - #17-2011-520 (JD and JP); Frederick Banting and Charles Best Canada Graduate Scholarship Doctoral Award (JM). Computing resources were provided by the SciNet HPC Consortium. SciNet is funded by: the Canada Foundation for Innovation under the auspices of Compute Canada, the Government of Ontario, Ontario Research Fund - Research Excellence, and the University of Toronto. We would like to thank Bill Song for informatics support.
- Markle JG, Frank DN, Mortin-Toth S, Robertson CE, Feazel LM, Rolle-Kampczyk U, von Bergen M, McCoy KD, Macpherson AJ, Danska JS: Sex differences in the gut microbiome drive hormone-dependent regulation of autoimmunity. Science. 2013, 339: 1084-1088. 10.1126/science.1233521.View ArticlePubMedGoogle Scholar
- Wen L, Ley RE, Volchkov PY, Stranges PB, Avanesyan L, Stonebraker AC, Hu C, Wong FS, Szot GL, Bluestone JA, Gordon JI, Chervonsky AV: Innate immunity and intestinal microbiota in the development of type 1 diabetes. Nature. 2008, 455: 1109-1113. 10.1038/nature07336.View ArticlePubMedPubMed CentralGoogle Scholar
- Li E, Hamm CM, Gulati AS, Sartor RB, Chen H, Wu X, Zhang T, Rohlf FJ, Zhu W, Gu C, Robertson CE, Pace NR, Boedeker EC, Harpaz N, Yuan J, Weinstock GM, Sodergren E, Frank DN: Inflammatory bowel diseases phenotype, C. difficile and NOD2 genotype are associated with shifts in human ileum associated microbial composition. PLoS One. 2012, 7: e26284-10.1371/journal.pone.0026284.View ArticlePubMedPubMed CentralGoogle Scholar
- Loh G, Blaut M: Role of commensal gut bacteria in inflammatory bowel diseases. Gut Microbes. 2012, 3: 544-555. 10.4161/gmic.22156.View ArticlePubMedPubMed CentralGoogle Scholar
- Machiels K, Joossens M, Sabino J, De Preter V, Arijs I, Eeckhaut V, Ballet V, Claes K, Van Immerseel F, Verbeke K, Ferrante M, Verhaegen J, Rutgeerts P, Vermeire S: A decrease of the butyrate-producing species Roseburia hominis and Faecalibacterium prausnitzii defines dysbiosis in patients with ulcerative colitis. Gut. 2014, 63: 1275-1283. 10.1136/gutjnl-2013-304833.View ArticlePubMedGoogle Scholar
- Ridaura VK, Faith JJ, Rey FE, Cheng J, Duncan AE, Kau AL, Griffin NW, Lombard V, Henrissat B, Bain JR, Muehlbauer MJ, Ilkayeva O, Semenkovich CF, Funai K, Hayashi DK, Lyle BJ, Martini MC, Ursell LK, Clemente JC, Van Treuren W, Walters WA, Knight R, Newgard CB, Heath AC, Gordon JI: Gut microbiota from twins discordant for obesity modulate metabolism in mice. Science. 2013, 341: 1241214-10.1126/science.1241214.View ArticlePubMedGoogle Scholar
- Hsiao EY, McBride SW, Hsien S, Sharon G, Hyde ER, McCue T, Codelli JA, Chow J, Reisman E, Petrosino JF, Patterson PH, Mazmanian SK: Microbiota modulate behavioral and physiological abnormalities associated with neurodevelopmental disorders. Cell. ᅟ, 155: 1451-1463.View ArticleGoogle Scholar
- Human Microbiome Project C: Structure, function and diversity of the healthy human microbiome. Nature. 2012, 486: 207-214. 10.1038/nature11234.View ArticleGoogle Scholar
- Morgan XC, Tickle TL, Sokol H, Gevers D, Devaney KL, Ward DV, Reyes JA, Shah SA, LeLeiko N, Snapper SB, Bousvaros A, Korzenik J, Sands BE, Xavier RJ, Huttenhower C: Dysfunction of the intestinal microbiome in inflammatory bowel disease and treatment. Genome Biol. 2012, 13: R79-10.1186/gb-2012-13-9-r79.View ArticlePubMedPubMed CentralGoogle Scholar
- Vaishampayan PA, Kuehl JV, Froula JL, Morgan JL, Ochman H, Francino MP: Comparative metagenomics and population dynamics of the gut microbiota in mother and infant. Genome Biol Evol. 2010, 2: 53-66. 10.1093/gbe/evp057.View ArticlePubMedPubMed CentralGoogle Scholar
- Booijink CC, Boekhorst J, Zoetendal EG, Smidt H, Kleerebezem M, de Vos WM: Metatranscriptome analysis of the human fecal microbiota reveals subject-specific expression profiles, with genes encoding proteins involved in carbohydrate metabolism being dominantly expressed. Appl Environ Microbiol. 2010, 76: 5533-5540. 10.1128/AEM.00502-10.View ArticlePubMedPubMed CentralGoogle Scholar
- Xiong X, Frank DN, Robertson CE, Hung SS, Markle J, Canty AJ, McCoy KD, Macpherson AJ, Poussier P, Danska JS, Parkinson J: Generation and analysis of a mouse intestinal metatranscriptome through Illumina based RNA-sequencing. PLoS One. 2012, 7: e36009-10.1371/journal.pone.0036009.View ArticlePubMedPubMed CentralGoogle Scholar
- Mason OU, Hazen TC, Borglin S, Chain PS, Dubinsky EA, Fortney JL, Han J, Holman HY, Hultman J, Lamendella R, Mackelprang R, Malfatti S, Tom LM, Tringe SG, Woyke T, Zhou J, Rubin EM, Jansson JK: Metagenome, metatranscriptome and single-cell sequencing reveal microbial response to Deepwater Horizon oil spill. ISME J. 2012, 6: 1715-1727. 10.1038/ismej.2012.59.View ArticlePubMedPubMed CentralGoogle Scholar
- Weckx S, Allemeersch J, Van der Meulen R, Vrancken G, Huys G, Vandamme P, Van Hummelen P, De Vuyst L: Metatranscriptome analysis for insight into whole-ecosystem gene expression during spontaneous wheat and spelt sourdough fermentations. Appl Environ Microbiol. 2011, 77: 618-626. 10.1128/AEM.02028-10.View ArticlePubMedPubMed CentralGoogle Scholar
- Hollibaugh JT, Gifford S, Sharma S, Bano N, Moran MA: Metatranscriptomic analysis of ammonia-oxidizing organisms in an estuarine bacterioplankton assemblage. Isme j. 2011, 5: 866-878. 10.1038/ismej.2010.172.View ArticlePubMedPubMed CentralGoogle Scholar
- Mitra S, Rupek P, Richter DC, Urich T, Gilbert JA, Meyer F, Wilke A, Huson DH: Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG. BMC Bioinformatics. 2011, 12 Suppl 1: S21-View ArticlePubMedGoogle Scholar
- Namiki T, Hachiya T, Tanaka H, Sakakibara Y: MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012, 40: e155-10.1093/nar/gks678.View ArticlePubMedPubMed CentralGoogle Scholar
- Schulz MH, Zerbino DR, Vingron M, Birney E: Oases: robust de novo RNA-seq assembly across the dynamic range of expression levels. BIOINFORMATICS. 2012, 28: 1086-1092. 10.1093/bioinformatics/bts094.View ArticlePubMedPubMed CentralGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, Adiconis X, Fan L, Raychowdhury R, Zeng Q, Chen Z, Mauceli E, Hacohen N, Gnirke A, Rhind N, di Palma F, Birren BW, Nusbaum C, Lindblad-Toh K, Friedman N, Regev A: Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech. 2011, 29: 644-652. 10.1038/nbt.1883.View ArticleGoogle Scholar
- Leung HC, Yiu SM, Parkinson J, Chin FY: IDBA-MT: de novo assembler for metatranscriptomic data generated from next-generation sequencing technology. J Comput Biol. 2013, 20: 540-550. 10.1089/cmb.2013.0042.View ArticlePubMedGoogle Scholar
- Thomas T, Gilbert J, Meyer F: Metagenomics - a guide from sampling to data analysis. Microb Inform Exp. 2012, 2: 3-10.1186/2042-5783-2-3.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Peregrin-Alvarez JM, Parkinson J: The global landscape of sequence diversity. Genome Biol. 2007, 8: R238-10.1186/gb-2007-8-11-r238.View ArticlePubMedPubMed CentralGoogle Scholar
- Griebel T, Zacher B, Ribeca P, Raineri E, Lacroix V, Guigo R, Sammeth M: Modelling and simulating generic RNA-Seq experiments with the flux simulator. Nucleic Acids Res. 2012, 40: 10073-10083. 10.1093/nar/gks666.View ArticlePubMedPubMed CentralGoogle Scholar
- Kumar S, Blaxter ML: Comparing de novo assemblers for 454 transcriptome data. BMC Genomics. 2010, 11: 571-10.1186/1471-2164-11-571.View ArticlePubMedPubMed CentralGoogle Scholar
- Barrett T, Clark K, Gevorgyan R, Gorelenkov V, Gribov E, Karsch-Mizrachi I, Kimelman M, Pruitt KD, Resenchuk S, Tatusova T, Yaschenko E, Ostell J: BioProject and BioSample databases at NCBI: facilitating capture and organization of metadata. Nucleic Acids Res. 2012, 40: D57-D63. 10.1093/nar/gkr1163.View ArticlePubMedPubMed CentralGoogle Scholar
- Li B, Fillmore N, Bai Y, Collins M, Thomson JA, Stewart R, Dewey C: Evaluation of de novo transcriptome assemblies from RNA-Seq data. BioRxiv. 2014, ᅟ: ᅟ-http://dx.doi.org/10.1101/006338,Google Scholar
- Kent WJ: BLAT–the BLAST-like alignment tool. Genome Res. 2002, 12: 656-664. 10.1101/gr.229202. Article published online before March 2002.View ArticlePubMedPubMed CentralGoogle Scholar
- Leinonen R, Sugawara H, Shumway M: The sequence read archive. Nucleic Acids Res. 2011, 39: D19-D21. 10.1093/nar/gkq1019.View ArticlePubMedPubMed CentralGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fitzgerald S, Gil L, Giron CG, Gordon L, Hourlier T, Hunt S, Johnson N, Juettemann T, Kahari AK, Keenan S, Kulesha E, Martin FJ, Maurel T, McLaren WM, Murphy DN, Nag R, Overduin B, Pignatelli M, Pritchard B, Pritchard E, Riat HS: Ensembl 2014. Nucleic Acids Res. 2014, 42: D749-D755. 10.1093/nar/gkt1196.View ArticlePubMedPubMed CentralGoogle Scholar
- UniProt C: The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010, 38: D142-D148.View ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Edgar R, Federhen S, Feolo M, Geer LY, Helmberg W, Kapustin Y, Landsman D, Lipman DJ, Madden TL, Maglott DR, Miller V, Mizrachi I, Ostell J, Pruitt KD, Schuler GD, Sequeira E, Sherry ST, Shumway M, Sirotkin K, Souvorov A, Starchenko G, Tatusova TA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009, 37: D5-D15. 10.1093/nar/gkn741.View ArticlePubMedPubMed CentralGoogle Scholar
- Human Microbiome Project C: A framework for human microbiome research. Nature. 2012, 486: 215-221. 10.1038/nature11209.View ArticleGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008, 18: 821-829. 10.1101/gr.074492.107.View ArticlePubMedPubMed CentralGoogle Scholar
- Peng Y, Leung HC, Yiu SM, Chin FY: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012, 28: 1420-1428. 10.1093/bioinformatics/bts174.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26: 589-595. 10.1093/bioinformatics/btp698.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.