Metatranscriptomic analysis of diverse microbial communities reveals core metabolic pathways and microbiome-specific functionality
© Jiang et al. 2016
Received: 19 August 2015
Accepted: 19 December 2015
Published: 12 January 2016
Metatranscriptomics is emerging as a powerful technology for the functional characterization of complex microbial communities (microbiomes). Use of unbiased RNA-sequencing can reveal both the taxonomic composition and active biochemical functions of a complex microbial community. However, the lack of established reference genomes, computational tools and pipelines make analysis and interpretation of these datasets challenging. Systematic studies that compare data across microbiomes are needed to demonstrate the ability of such pipelines to deliver biologically meaningful insights on microbiome function.
Here, we apply a standardized analytical pipeline to perform a comparative analysis of metatranscriptomic data from diverse microbial communities derived from mouse large intestine, cow rumen, kimchi culture, deep-sea thermal vent and permafrost. Sequence similarity searches allowed annotation of 19 to 76 % of putative messenger RNA (mRNA) reads, with the highest frequency in the kimchi dataset due to its relatively low complexity and availability of closely related reference genomes. Metatranscriptomic datasets exhibited distinct taxonomic and functional signatures. From a metabolic perspective, we identified a common core of enzymes involved in amino acid, energy and nucleotide metabolism and also identified microbiome-specific pathways such as phosphonate metabolism (deep sea) and glycan degradation pathways (cow rumen). Integrating taxonomic and functional annotations within a novel visualization framework revealed the contribution of different taxa to metabolic pathways, allowing the identification of taxa that contribute unique functions.
The application of a single, standard pipeline confirms that the rich taxonomic and functional diversity observed across microbiomes is not simply an artefact of different analysis pipelines but instead reflects distinct environmental influences. At the same time, our findings show how microbiome complexity and availability of reference genomes can impact comprehensive annotation of metatranscriptomes. Consequently, beyond the application of standardized pipelines, additional caution must be taken when interpreting their output and performing downstream, microbiome-specific, analyses. The pipeline used in these analyses along with a tutorial has been made freely available for download from our project website: http://www.compsysbio.org/microbiome.
KeywordsMetatranscriptomics Bioinformatics Systems biology Next generation sequencing RNA sequencing Microbiome
Next generation sequencing technologies have revolutionized the study of complex microbial communities (microbiomes). In the context of human health, composition of the intestinal microbiome has been linked with type I diabetes, inflammatory bowel disease and obesity [1–3]. Many such studies focus on microbial community composition using marker genes such as 16S ribosomal RNA (rRNA) to survey the relative abundance of individual taxa [4–6]. Since multiple combinations of microbial taxa can confer similar metabolic outputs, efforts have begun to define microbiome function through untargeted RNA sequencing (metatranscriptomics) [7–10]. For example, metatranscriptomic analyses have recently revealed the expression of specialized fermentation genes in the production of kimchi  and methylamine degradation pathways in the rumen of the cow .
Illumina sequencing platforms have emerged as leading technologies for metatranscriptomic analysis. In addition to the volume of sequence reads generated, annotation of these complex data is further challenged due to the relatively short sequence lengths . Overcoming these issues requires identification and removal of sequence reads from library adaptors, ribosomal RNA or other sequencing artefacts, transcript assembly, assignment of reads to known functions and taxa and tools that allow the intuitive visualization of the results. To date, metatranscriptomic studies have tended to use a variety of customized scripts and tools to perform filtering, assembly and sequence similarity searches. For example, a kimchi transcriptome dataset used BLASTN sequence similarity searches to filter rRNA reads, the SEED database  for functional annotation and BWA software  to map reads to reference genomes of six representative lactic acid bacterial strains previously associated with the kimchi microbial community . Results were visualized with heatmaps showing the relative expression of genes involved in carbohydrate metabolism. A bovine metatranscriptome study focused on the rumen , assembled sequence reads using the SHE-RA software  performed taxonomic assignments with BLASTX searches against the Genbank RefSeq protein database [15, 16] and functional annotations using the SEED database. Thus, in the absence of analyses being performed using a single standardized software solution, it has been difficult to compare the results of different studies and identify microbiome-specific taxonomic and functional signatures.
A key question is how availability of high quality reference genomes and the complexity of a microbial community impact sequence annotation and inference of biological insight. The broad functional classification schemes in resources such as KEGG, COG and SEED [17–19] provide limited molecular level characterization. Moreover, the field needs to develop statistical approaches that capture significant gene expression differences across metatranscriptomes. To address these limitations, we developed and applied a single standardized pipeline analysis to compare five microbiomes from diverse habitats: deep-sea, permafrost, cow’s rumen, kimchi and mouse cecal content. Our results demonstrate how integration of taxonomic and functional data within a novel visualization framework can provide insight into the taxonomic contributions to biochemical pathways.
Results and discussion
Annotation of metatranscriptomic datasets reflect depth of available reference genomes
Assembled contigs and unassembled reads, representing putative mRNA sequences, were then parsed through a hierarchical annotation pipeline, with unannotated reads passing to the next annotation step. This analysis included (1) mapping of sequences to a reference set of 4443 prokaryotic genomes using the BWA algorithm that relies on near perfect sequence matches (defined here as no more than two base pair mismatches—see Methods) ; (2) sequence similarity searches against the same set of reference genomes using a less stringent BLAT algorithm ; and (3) sequence similarity searches against the protein non-redundant database  using BLASTX . Of the five datasets, the cow rumen samples produced the lowest frequency (19 %) and the kimchi dataset featured the highest frequency (72 %) of mapped reads (Fig. 1c). This latter result is a consequence of 51 % of putative mRNA reads that were mapped to two reference genomes, Lactobacillus sakei and Weissella koreensis. The high proportion of BLAT mappings compared to BWA results suggests genetic variation from the reference strains. BLAT-based mapping identified 24 % of the deep sea, and 12 % of the permafrost datasets, but mapping of the mouse gut and cow rumen samples reads performed better with the least stringent BLASTX algorithm (Fig. 1c). These findings highlight the lack of representative reference genomes for these microbiomes, such that many sequence reads map to homologs from distant relatives of the actual species present in the samples.
These results are broadly consistent with the original reports of these datasets but also highlight important differences produced by the selected analytical pipelines. For example, the cow rumen study , which relied on BLASTX sequence similarity searches with a score cut-off less than e-5, reported ~400,000 reads of putative mRNA origin compared to 452,708 reported here. However, we do note some significant discrepancies. The original study of the kimchi microbiome  applied the BWA algorithm to map 3.9 million reads to six reference strains; here, using the BWA/BLAT/BLASTX pipeline, we mapped 4.8 million reads to bacterial mRNA transcripts. For the deep-sea microbiome, the original analysis applied a combination of the Velvet and Oases assembly algorithms to construct 78,000 contigs with an average contig size of 243 bp . Subsequent sequence similarity searches using the BWA algorithm identified ~81,000 predicted genes, of which only 18,500 were protein coding. In the current study, we identified 643,000 contigs with an N50 of 110 bp with the Trinity assembly. Further, we identified 243,000 unique transcripts by inclusion of 3.0 million reads not assigned to a contig. These differences reflect the often arbitrary choice of parameters and algorithms, usually in the absence of rigorous benchmarking, that can impact coverage and accuracy, and highlight the need for standardized pipelines.
Pathway enrichment analysis identifies tissue specific gene expression in the mouse gut microbiome
In previous studies of the cow rumen, deep-sea and kimchi microbiomes, gene expression was assessed by direct comparisons of raw or normalized read counts [8–10]. In the absence of standardized statistical models to identify significant changes in gene expression from metatranscriptomic datasets, we evaluated three methods previously employed to detect changes in gene expression: DEseq2 , EdgeR  and ANOVA-like differential expression analysis (ALDEx2—). We compared microbial expression patterns between three cecal wall-associated (cecal wall) and four cecal lumen flush derived (cecal flush) microbiomes from four NOD strain mice of identical age and sex which had been prepared with the same RNA extraction protocol. Of the 20,160 non-mouse transcripts identified in these samples (11,231 and 11,015 for cecal wall and cecal flush, respectively), 2087 were shared between sample types. Only five transcripts displayed significant differences in expression between the two types of microbiome samples (Additional file 2). This reflects the large variation observed across animals and tissue samples as defined by a biological coefficient of variation (BCV) of 1.11, where the BCV is a measure of how the (unknown) true abundance of the gene varies between replicate RNA samples (see Methods).
Pathways enriched in transcripts displaying large (>fivefold) differences in relative expression between mouse cecal wall and cecal flush samples
Fold change in expression
Differentially expressed genes
Matched ECs/total ECs in pathway
Genes up-regulated in cecal wall
Genes up-regulated in cecal flush
Carbon fixation in photosynthetic organisms
One carbon pool by folate
Starch and sucrose metabolism
Alanine, aspartate and glutamate metabolism
Citrate cycle (TCA cycle)
Amino sugar and nucleotide sugar metabolism
Valine, leucine and isoleucine biosynthesis
Drug metabolism—other enzymes
Other glycan degradation
Short read data reveals microbiome-specific taxonomic signatures
Next, we examined the contribution of distinct genera to each microbiome (Fig. 2b). Within these ‘abundant’ genera, the deep-sea dataset displayed the largest number of unique taxa (13) while the kimchi dataset displayed the fewest (2; Leuconostoc and Weissellla). Indeed, the kimchi dataset appears dominated by three main taxa. On the other hand, Lactobacillus was well represented across four of the five datasets; although present in the deep-sea dataset, it does not comprise one of the defined, abundant genera in this dataset. We note that Lactobacillus is one of the twelve most abundant genera in our reference datasets (45 genomes) and assignment of a large proportion of reads to this genus may simply reflect that bias, potentially acting as a surrogate taxon for species not represented within our reference datasets. In any event, despite such biases, our pipeline reveals each habitat to possess a unique taxonomic signature with the presence of specific abundant taxa adapted to individual environmental conditions. For example, Weissella is a genus of lactic acid bacteria, first identified in kimchi at 2002, that are regarded as one of the three main genera that are strongly associated with fermentation of kimchi based on both transcriptome or 16S rRNA study [9, 32, 33]. This analysis also shows the value in using a higher level of taxonomic resolution. For example, from Fig. 2a, both the cow rumen and mouse samples reveal the presence of reads from Bacteroidetes; however, deeper analysis reveals such reads to be dominated by Prevotella in the cow rumen sample compared to Bacteroides and Parabacteroides in the mouse samples.
Finally, we examined the performance of the hierarchical annotation pipeline to assign reads to discrete species for each sample (Fig. 2c). To reduce the influence of species with matches involving only a limited number of genes, only species represented by 100 or more transcripts were included in these analyses with the exception of the permafrost sample, the latter due to the low number of putative mRNA reads. The kimchi sample was associated with the simplest community, with 10 species accounting for ~93 % of total reads of putative mRNA origin. These assignments were remarkably consistent with a previous report , with similar abundances for five of the top six most represented taxa. Emphasizing the findings at the genus level, there was no overlap in the ten most abundant species in the mouse and cow datasets despite the phylum/sub-phylum similarities (Fig. 1a). The mouse microbiome samples were obtained from germ-free animals colonized with altered Schaedler flora (ASF) which were defined, before the advent of high through-put sequence analysis, to contain eight known species [34, 35]: Lactobacillus acidophilus, Lactobacillus murinis, Parabacteroides distasonis, Mucispirillum schaedleri, three members of Clostridium cluster XIV and a poorly characterized Firmicute species. Of these previously defined species, only P. diastonis appears significantly represented in our samples. However, previous studies have suggested that Lactobacillus animalis, identified within the samples, is identical to L. murinis , while reads assigned to the poorly classified ‘Clostridium sp.’ may represent the species associated with Clostridium cluster XIV. The additional species presented in Fig. 2c likely represent close relatives to the remaining unaccounted ASF taxa. Conversely and again consistent with a previous study, amongst the top ten most abundant species represented in the cow dataset were those that have previously been associated with the rumen  including bacteria that degrade cellulose and other carbohydrates (Prevotella spp. and Fibrobacter spp.) and those that utilize fatty acids (Succinivibrionaceae spp. and Treponema spp.) as well as the protozoan Oxytricha trifallax, a relative of Oxytricha granulifera, previously reported to occupy the rumen . Similarly, the deep-sea dataset was represented by species previously associated with the marine environment [10, 31] including Alteromonas macleodi, the ammonia oxidizing archaeon—Candidatus nitrospumilis, methanotrophs (Methylomonas methanica and Methylomicrobium alcaliphilum) and the sulphur oxidizing SUP05 . Across samples, we note a varying proportion (from 7 to 87 % for kimchi and cow rumen datasets, respectively) of reads mapping to ‘Others’. These include reads from species with few transcripts and likely false positives, as well as reads associated with a more diverse community. For example, we note that in the deep-sea dataset, 504 species were represented by 100 or more transcripts, with species represented by 10 or fewer transcripts contributing only 3.9 % of the reads, suggesting a highly diverse microbiome. On the other hand, only 67 species were represented by 100 or more transcripts in the cow rumen dataset, with 45 % of the reads contributed by species with 10 or fewer transcripts, suggesting a higher number of false positive assignments. Beyond resorting to more complex phylogenetic mapping solutions such as the naïve Bayes classifier , more sophisticated approaches to resolving such issues of false positive assignments could include examining BLAST-based sequence similarity matches to taxa beyond the first match reported. For example, one source of false positives is reads that map to highly conserved regions of sequences. Such reads are likely to possess many sequence similarity matches with the same BLAST score cut-offs. Through considering abundant taxa identified through mappings to other reads, it is possible to devise an algorithm that selects the most likely match, from a list of matches sharing the same score. In the next section, we explore these issues further through comparing the performance of 16S- and mRNA-derived reads to assess diversity within and between samples.
Consistency of diversity analyses between 16S rRNA and mRNA datasets
Diversity analysis within mice samples and between five samples
Shannon index (mRNA)
Simpson index (mRNA)
Fisher’s alpha (mRNA)
Shannon index (16S rRNA)
Chao1 index (mRNA)
Chao1 index (16S rRNA)
Mouse cecal wall
Mouse cecal flush
Comparing between sequence types, we find broad consistency between the results for the 16S rRNA and mRNA based analyses, with the exception of the mouse samples. For the latter datasets, while the 16S rRNA gene analyses yielded lower diversity metrics for the mouse datasets (reflecting the limited number of taxa associated with the altered Schaedler flora (ASF) used to inoculate germ free mice), the mRNA-based analyses yielded comparatively higher diversity metrics. This is likely due to the challenge of mapping the putative mRNA reads in these datasets to their correct taxa in the absence of ASF reference genomes used for mapping. Instead, reads appear to have been assigned to multiple closely related taxa. We note for example that this does not arise for the kimchi dataset for which there is good representation of reference genomes. Although the 16S rRNA- and mRNA-based diversity and richness analyses are largely consistent, excluding the permafrost dataset, we find that from 56 % (kimchi) to 81 % (mouse) of genera identified from 16S rRNA reads overlap with reads of mRNA origin (Additional File 4). At the same time, we also note many genera predicted by the mRNA reads compared to the 16S rRNA reads, with the former predicting from 83 % (kimchi) to 478 % (deep sea) additional genera. Such additional predictions likely arises from a combination of the lack of a complete set of reference datasets for both mRNA or 16S rRNA reads, as well as mispredictions from the taxonomic annotation pipeline as noted above. Nevertheless, given the consistency in diversity and richness metrics between sequence types for the cow rumen, kimchi and deep-sea datasets, these results suggest that even short-read data derived from mRNA can reveal significant taxonomic differences that reflect genuine differences in habitat.
In the following sections, we show how this information may be leveraged to identify distinct taxonomic contributions towards biochemical activities within a microbiome.
Functional interrogation of metatranscriptome datasets reveals a conserved core of essential metabolic functions supplemented with habitat-specific pathways
Pathways significantly enriched in ‘core’ microbiome enzymes
Core enzymes in pathway
Total enzymes in pathway
Alanine, aspartate and glutamate metabolism
Valine, leucine and isoleucine biosynthesis
Phenylalanine, tyrosine and tryptophan biosynthesis
Pentose phosphate pathway
Carbon fixation pathways in prokaryotes
One carbon pool by folate
Fatty acid biosynthesis
Citrate cycle (TCA cycle)
Amino sugar and nucleotide sugar metabolism
Drug metabolism—other enzymes
Cysteine and methionine metabolism
Polyketide sugar unit biosynthesis
In addition to the core enzymes, we also identified the unique expression of enzymes providing habitat-specific biochemical functions (Additional files 5 and 6). For example, the deep-sea dataset includes enzymes involved in phosphonate metabolism, a significant component of organic phosphorous in the marine environment . Similarly, the glucosyltransferase, levansucrase (EC: 126.96.36.199), was uniquely associated with the kimchi dataset. Levansucrase is involved in the synthesis of glucose polymers and was previously isolated and characterized from a key member of the kimchi community, Leuconostoc mesenteroides . Unique to the cow rumen dataset were pectate di- and tri-saccharide lyases, reflecting the presence of pectin in animal feed and thought to be responsible for supporting the growth of Trepnonema sp. .
In the next section, we combine the taxonomic and metabolic annotation data to examine the contributions of specific taxa to biochemical activities in the sampled microbiomes.
Integration of taxonomic and functional annotations: I. Metabolic networks
While previous microbiome studies have associated shifts in taxonomic distributions and/or biochemical functions with disease states or other evolving habitats, such as the process of fermentation [9, 44, 45], our understanding of the contribution of specific taxa to these functions is limited. In the previous sections, we demonstrated the capacity of short sequence reads associated with metatranscriptomic datasets to provide both taxonomic and functional insights. In the following sections, we show how the integration of such information can be used to derive a more complete understanding of how different taxa contribute towards the biochemical activities of a microbiome.
Comparisons across samples further reveal that as noted above, many pathways are conserved but the taxa responsible for these pathways as well as their relative expression are not conserved (Fig. 3c, Fig. 4 and Additional files 7, 8, 9, 10 and 11). For example within the TCA cycle, relative to the cow rumen dataset, the other three samples feature high expression of enzymes that together comprise the pyruvate dehydrogenase complex involved in anaerobic fermentation, e.g. dihydrolipoyl acetyltransferase (EC: 188.8.131.52), dihydrolipoyl dehydrogenase (EC: 184.108.40.206) and pyruvate decarboxylase (EC: 220.127.116.11). However, whereas Actinobacteria, Bacteroides and Gammaproteobacteria contribute significant reads to these enzymes in the mouse dataset, these enzymes are represented largely by Gammaproteobacteria in the deep-sea dataset and by Leuconostocaceae and Lactobacillaceae in the kimchi dataset. Furthermore, within a sample, we identify pathway sections that feature distinct taxonomic profiles. For example in the mouse intestinal dataset, Clostridiales contribute significantly to pyruvate carboxylase (EC: 18.104.22.168) as well as members of the TCA cycle. Also, noteworthy is the relatively high expression of phosphoenolpyruvate carboxykinase (EC: 22.214.171.124) in the mouse intestinal and cow rumen datasets. Previously associated with Ruminococcus flavefaciens, a Clostridiales bacterium found in the rumen  and Bacteroides fragilis found in the human gut , this enzyme is believed to be involved in the fermentation of cellulose to succinate in the rumen and catalyses phosphoenolpyruvate to oxaloacetate with the concomitant formation of ATP in human gut, may act as a ‘feeder’ reaction for carbon from the TCA cycle to drive various biosynthetic and oxidative processes such as gluconeogenesis and serine synthesis .
Focusing on glycolysis/gluconeogenesis (Additional file 10), as for the TCA cycle, we found that taxonomic groups that dominate the entire datasets also dominate specific enzyme activities. However again, sections of the pathway can be dominated by specific taxa. For example, aldose 1-epimerase (EC: 126.96.36.199) in the cow rumen and l-lactate dehydrogenase (EC: 188.8.131.52) in kimchi are predominantly expressed by Bacterioidetes and Lactobacillaceae, respectively. Further, even apparently minor taxa appear to provide specific functionality, suggestive of keystone roles within the community. For example, in the mouse intestinal dataset, both alcohol dehydrogenase (EC: 184.108.40.206) and aldose 1-epimerase (EC: 220.127.116.11) are predominantly expressed by Lactobacillaceae despite representing only 1.9 % of putative mRNA reads. As a final example of taxonomic contributions to metabolic functionality, we find that for the mouse intestinal dataset, Bacterioidetes and Gammaproteobacteria tend to dominate aspartate metabolism, while Closteridiales dominate glutamate metabolism (Additional File 11). As for the TCA cycle, while the majority of enzymes are well expressed in the mouse intestinal dataset, for the Kimchi dataset, expressions of genes within these pathways are more heterogeneous. This raises an important caveat, notably that the ability to map reads to the enzymes is dependent on the availability of suitable sequences in the reference databases. Hence, an inability to assign reads to asparagine synthase (EC 18.104.22.168) in the kimchi dataset may be due to the inability of sequence searches to map reads from the orthologous genes in the kimchi microbiome to known examples of this enzyme in the reference database.
Integration of taxonomic and functional annotations provides molecular level insights into the biochemical contributions of individual taxa: II. Protein-protein interaction networks
In the deep-sea microbiome dataset, many transporter components were associated with alphaproteobacteria, although leucine, isoleucine and valine transport components (livF-HJK and M) were broadly represented across phyla. In the mouse dataset, alphaproteobacteria were also the main contributors of transporters including dipeptide ABC transporter (dppBD), glutathione ABC transporter (yliABC), leucine ABC transporter (LivMF), glycerol-3-phosphate ABC transporter (ugpC) and xlycose ABC transporter (xylFH). xylF was largely represented by clostridiales and ‘other bacteria’ in the cow rumen dataset, suggesting that the contribution of alphaproteobacteria in the mouse data does not reflect annotation bias. The mouse intestine samples also display Gammaproteobacteria and Actinobacteria contributions to transporter components. Finally, the lack of Bacteroidetes representation in transporter components may reflect the reduced complement of these genes previously noted for members of this phylum [50, 51].
Many genes involved in cell wall biogenesis and cell division were expressed within all datasets (Fig. 5b). Of these, secA, prlA(secY) and ftsZ were the most highly expressed in each dataset. SecA mediate critical roles in protein translocation, and ftsZ is involved in organizing the initial stages of cell division. Within the mouse datasets, few reads from Bacteroidetes were assigned to ftsZ, suggesting that the ortholog(s) within this taxon display significant divergence from their E. coli counterpart. For example, the conserved C-terminus of E. coli ftsZ is absent in Bacteroidetes . Genes encoding proteins involved in later steps of cell division (e.g. ftsN, ftsB, ftsQ and zipA) were largely restricted to representation by Gammaproteobacteria, suggesting these sequences are highly specialized within this taxon. Genes involved in the synthesis of cell wall components (e.g. mur and mrd) were well represented across the datasets, with the mouse and kimchi datasets featuring clear patterns of taxonomic contributions. For example, within the mouse dataset, murCEG were well represented by Bacteroidetes, while for the kimchi dataset, mrdA and mrdB were largely represented by the Lacteobacillaceae, potentially representing altered cell wall composition in these taxa.
Unlike cell wall biogenesis and cell division, genes involved in flagella assembly, chemotaxis and hydogenases were poorly represented in the four datasets (Additional files 12 and 13). For example, both cow and kimchi datasets lacked significant expression of many flagella and chemotaxis genes reflecting an absence of flagella in many of the major taxa in these microbiomes (e.g. Lactobacillus spp. and Leuconostoc spp. in kimchi). Indeed, for kimchi, expression was largely limited to flgJ, a peptidoglycan hydrolase required for flagella formation and likely reflects significant local sequence similarity with other proteins such as N-acetylmuramoyl-l-alanine amidase from L. sakei which shares a conserved, ~200 residue lysine motif with flgJ. In the mouse, we noted little representation from Bacteriodetes, with most expression dominated by Closteridiales. As noted above, the restriction of certain components to Gammaproteobacteria may reflect their relative sequence diversity and/or specialized functions. Finally, we note that four genes were dominated by representation from the Alphaproteobacteria: motA, mbhA, cheY and flip. Such abundance may at least in part be due to variable copy numbers of these genes in this taxon, for example, cheY is present in six copies in the Rhodobacter sphaeroides genome .
Finally, we also explored the expression and taxonomic representation of genes involved in NADH dehydrogenase and hydrogenase complexes (Additional file 13). As for flagella assembly and chemotaxis, many components were not represented within the four samples. For example, 16 of the 50 genes that comprise these complexes lack expression in the kimchi, cow rumen and deep-sea datasets. Indeed, within the kimchi dataset, only tpiA, dps and iscS are well represented. This is likely related to local sequence conservation between the Fe-binding motif of dps and the cysteine desulfurization and conservative C-terminal of iscS, resulting in misannotations. Curiously, while both the cow and deep-sea datasets feature relatively homogenous patterns of taxonomic representation in their respective NADH dehydrogenase subunits, those in the mouse dataset appear largely incongruent.
In this study, we present a standard bioinformatics pipeline to process, annotate and analyse metatranscriptomic datasets. Applied to five disparate metatranscriptomic datasets (mouse cecum, cow rumen, kimchi, deep sea and permafrost), this pipeline captures both common and microbiome-specific taxonomic and functional signatures. In general, each microbiome is dominated by members of four bacterial phyla (Firmicutes, Proteobacteria, Bacteroidetes, Actinobacteria) and one archaeal phylum; however, each microbiome features distinct differences in the relative representation at higher phylogenetic levels (i.e. families and genera). Diversity analyses reveals that mRNA taxonomic representation is broadly congruent with 16S taxonomic representation, with the proviso that a lack of suitable reference genomes can result in mRNA datasets overestimating diversity. Comparisons of microbiome metabolic capacities revealed a core of 592 enzymes common to the four well-sampled microbiomes (i.e. ignoring permafrost), largely associated with housekeeping functions such as carbohydrate, amino acid and nucleotide metabolism. While the concept of ‘core’ bacterial functions have previously been described for individual taxa (e.g. ), this concept has yet to be explored from a metatranscriptomic viewpoint. Such conserved pathways provide a valuable benchmark to assess the quality and coverage of metatranscriptomic datasets. Furthermore, we identified microbiome-specific enzymes reflecting distinct differences in habitat. We choose to compare mouse cecal flush and cecal wall samples to determine whether gene expression is substantially different in the wall-adherent compared to luminal microbiome. Analyses with three established tools identified only a limited set of differentially expressed genes between the cecal wall and cecal flush samples. However, a gene set enrichment approach applying a fold-change metric identified several pathways of differentially expressed genes at these two locations suggesting that biogeographical differences require additional study in mammalian gut microbiomes. Finally, integration of phylogenetic and functional annotations within a systems context provides a powerful route to identify the relationship between taxonomic representation within a microbiome and their contribution to biochemical activities; while dominant taxa appear broadly represented across biochemical pathways, key contributions may be performed by a more limited set of less abundant taxa.
Metatranscriptomic datasets and initial processing
Thirty million pairs of 76 bp reads derived from the luminal contains of the cecal wall and cecal flush of four non-obese diabetic (NOD). Mice were born and maintained in germ-free environment and subsequently colonized with altered Schaedler flora (ASF), a defined community of eight known bacterial species: L. acidophilus, L. murinus, B. distasonis, M. schaedleri, Eubacterium plexicaudatum, an uncharacterized fusiform bacterium and two uncharacterized clostridium species (12 samples total—SRX134832-40, SRX134842 and SRX134844-45; ).
Thirty-five million 101 bp reads derived from a 29 day fermentation of kimchi (SRX128705; ).
Fourteen million pairs of 100 bp reads derived from the rumen of a Holstein dairy cow fed a fat-supplemented diet (SRX196213; )
One hundred three million pairs of 100 bp reads derived from the Guaymas Basin hydrothermal vent (SRX1347659; ).
One hundred thirty-one million pairs of 150 bp reads derived from permafrost soil (SRX119222).
For each sequence, low quality segments (Phred score <15 ) were trimmed using an in-house script and reads <50 bp discarded. Next adaptor contaminants were filtered using cross-match (http://www.phrap.org) with parameters minmatch = 10 and minscore = 20. In addition, due to the large number of low quality reads in the permafrost sample, we applied an in-house script to remove those containing 10 consecutive N’s and/or X’s. rRNA reads were identified and removed by first applying BWA , with a bitscore cut-off of >50, against a database of rRNA genes collated from the SILVA, Greengenes and NCBI databases [57–59]. Additional reads of rRNA and tRNA origin were identified using the Infernal software  with the Rfam database as a reference . For mouse, cow and kimchi datasets, reads of murine, bovine and plant origin were identified and removed through BLAT searches (bitscore cutoff >50) against the mouse genome and transcriptome (build GRCm38 downloaded from Ensemble ), the cow genome and transcriptome (build 6.1 downloaded from NCBI ) and a set of 25 plant genomes and 274 plant transciptomes obtained from the PlantGDB database , respectively.
Assembly and annotation
To increase efficiency of annotation, putative mRNA reads were assembled by the de novo assembly package Trinity . Reads were mapped back to contigs using the Bowtie alignment tool . Sequence annotation was performed using a tiered set of sequence alignment tools: BWA , BLAT  and BLAST . BWA and BLAT alignments were performed using default parameters against a reference database of 4443 prokaryotic genomes (including 1918 plasmid, 152 archaeal and 2373 bacterial genomes) downloaded from the NCBI (February, 2013). For BWA, this translates to no more than two mismatches over the entire alignment, although we note that previous studies suggest that different parameter settings result in highly similar output . Reads that could not be aligned through BWA and BLAT were subject to BLASTX sequence similarity searches against the protein non-redundant database obtained from the NCBI (February, 2013). Two thresholds were used: (1) for reads shorter than 100 nts, read alignments were considered if sequence identity was ≥85 % over >65 % of the read length; and (2) for reads longer than 100 nts, we applied a more stringent bit score cut-off of 60. Enzyme annotations for genes and proteins matching sequence reads was performed using: (1) DETECT enzyme prediction tool  and (2) BLASTP sequence similarity searches against a set of enzymes curated by UniProt (e-value <1e-10) . Where DETECT and BLASTP annotations conflicted, DETECT predictions were assumed to be more reliable . Transcript expression was normalized using reads per kilobase of transcript per million mapped reads (RPKM ).
Analysis of differential expression
Where 1/μ gi is the squared CV for the Poisson distribution and φ g is the squared CV of the unobserved expression values. We therefore estimated the variation per pairwise replicates using the Kruskal-Wallis test and removed three samples displaying extreme variation (p < 0.001): SRX134837, SRX134838 and SRX134842. In addition to the algorithms applied above, we also applied a pathway enrichment analysis of genes displaying at least a fivefold change of expression (defined by the average expression for the four remaining samples (two cecal flush and two cecal wall)) [27, 28]. Here, we applied the hypergeometric test by computing two-tailed p values for differentially expressed genes for reference pathway sets obtained from the KEGG .
Analysis of microbial composition and diversity
Taxonomic classifications of transcripts were derived from the tiered set of read annotation searches with reference to the NCBI taxonomy database. For the comparative tree based analysis presented in Fig. 2b, we included only those genera represented by greater than 100 reads across all five microbiomes (966 genera total). To normalize genus representation across microbiomes, each genus was divided by the average number of reads assigned to each of the 966 genera and only genera with normalized read values in excess of 10 were visualized. Visualization was performed using MEGAN5  in conjunction with the Interactive Tree of Life (iTOL) software  to modify and annotate the resulting phylogenetic tree.
Where S obs is the total number of species observed in all samples, n 1 is the number of singletons (species captured once) and n 2 is the number of doubletons (species captured twice). Diversity indices based on these values were calculated using EstimateS v 9.1  using 100 bootstrap replicates.
Metabolic networks were constructed as previously described : enzymes (EC numbers) are represented as nodes and substrates connecting two enzymes are represented as edges in the network. Enzyme-substrate relationships were inferred from KEGG . Protein-protein interaction (PPI) networks were constructed by homology mapping of E. coli homologs of identified bacterial transcripts using BLAST sequence similarity searches (E-value <1e-10) and layering expression data onto a previously generated network of PPIs for E. coli . To compare expression across microbiomes, RPKM values of each enzyme/E. coli homolog was normalized by employing the min-max scaling method. Networks were visualized using Cytoscape version 3.2.1  and iPath .
This work was supported by grants to JP from the Canadian Institutes of Health Research, Genome Canada and the Ontario Genomics Institute—‘Leveraging Meta-Transcriptomics For Functional Interrogation Of Microbiomes’; the Natural Sciences and Engineering Research Council (RGPIN-2014-06664) and to JSD from the and the Juvenile Diabetes Research Foundation. YJ is a fellow of CIHR STAGE (Strategic Training for Advanced Genetic Epidemiology)—CIHR Training Grant (GET-101831) in Genetic Epidemiology and Statistical Genetics. Computing resources were provided by the SciNet HPC Consortium. SciNet is funded by the Canada Foundation for Innovation under the auspices of Compute Canada, the Government of Ontario, Ontario Research Fund—Research Excellence, and the University of Toronto. The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript. We thank Drs. Daniel Frank and Charles Robertson for their valuable discussions during the design and implementation of the pipeline and during the drafting of the manuscript.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Frank DN, St Amand AL, Feldman RA, Boedeker EC, Harpaz N, Pace NR. Molecular-phylogenetic characterization of microbial community imbalances in human inflammatory bowel diseases. Proc Natl Acad Sci U S A. 2007;104:13780–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Giongo A, Gano KA, Crabb DB, Mukherjee N, Novelo LL, Casella G, et al. Toward defining the autoimmune microbiome for type 1 diabetes. Isme j. 2011;5:82–91.PubMedPubMed CentralView ArticleGoogle Scholar
- Ley RE, Backhed F, Turnbaugh P, Lozupone CA, Knight RD, Gordon JI. Obesity alters gut microbial ecology. Proc Natl Acad Sci U S A. 2005;102:11070–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Eckburg PB, Bik EM, Bernstein CN, Purdom E, Dethlefsen L, Sargent M, et al. Diversity of the human intestinal microbial flora. Science. 2005;308:1635–8.PubMedPubMed CentralView ArticleGoogle Scholar
- Wen L, Ley RE, Volchkov PY, Stranges PB, Avanesyan L, Stonebraker AC, et al. Innate immunity and intestinal microbiota in the development of type 1 diabetes. Nature. 2008;455:1109–13.PubMedPubMed CentralView ArticleGoogle Scholar
- Subramanian S, Huq S, Yatsunenko T, Haque R, Mahfuz M, Alam MA, et al. Persistent gut microbiota immaturity in malnourished Bangladeshi children. Nature. 2014;510:417–21.PubMedPubMed CentralGoogle Scholar
- Xiong X, Frank DN, Robertson CE, Hung SS, Markle J, Canty AJ, et al. Generation and analysis of a mouse intestinal metatranscriptome through Illumina based RNA-sequencing. PLoS One. 2012;7, e36009.PubMedPubMed CentralView ArticleGoogle Scholar
- Poulsen M, Schwab C, Jensen BB, Engberg RM, Spang A, Canibe N, et al. Methylotrophic methanogenic Thermoplasmata implicated in reduced methane emissions from bovine rumen. Nat Commun. 2013;4:1428.PubMedView ArticleGoogle Scholar
- Jung JY, Lee SH, Jin HM, Hahn Y, Madsen EL, Jeon CO. Metatranscriptomic analysis of lactic acid bacterial gene expression during kimchi fermentation. Int J Food Microbiol. 2013;163:171–9.PubMedView ArticleGoogle Scholar
- Baker BJ, Sheik CS, Taylor CA, Jain S, Bhasi A, Cavalcoli JD, et al. Community transcriptomic assembly reveals microbes that contribute to deep-sea carbon and nitrogen cycling. Isme j. 2013;7:1962–73.PubMedPubMed CentralView ArticleGoogle Scholar
- Celaj A, Markle J, Danska J, Parkinson J. Comparison of assembly algorithms for improving rate of metatranscriptomic functional annotation. Microbiome. 2014;2:39.PubMedPubMed CentralView ArticleGoogle Scholar
- Mitra S, Rupek P, Richter DC, Urich T, Gilbert JA, Meyer F, Wilke A, Huson DH: Functional analysis of metagenomes and metatranscriptomes using SEED and KEGG. BMC Bioinformatics 2011, 12 Suppl 1:S21.Google Scholar
- Li H, Durbin R. Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010;26:589–95.PubMedPubMed CentralView ArticleGoogle Scholar
- Rodrigue S, Materna AC, Timberlake SC, Blackburn MC, Malmstrom RR, Alm EJ, et al. Unlocking short read sequencing for metagenomics. PLoS One. 2010;5, e11840.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Maglott DR. NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res. 2007;35:D61–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Tatusova T, Ciufo S, Fedorov B, O’Neill K, Tolstoy I. RefSeq microbial genomes database: new representation and annotation strategy. Nucleic Acids Res. 2014;42:D553–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Kanehisa M, Goto S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28:27–30.PubMedPubMed CentralView ArticleGoogle Scholar
- Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, et al. The COG database: an updated version includes eukaryotes. BMC Bioinformatics. 2003;4:41.PubMedPubMed CentralView ArticleGoogle Scholar
- Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, et al. The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res. 2005;33:5691–702.PubMedPubMed CentralView ArticleGoogle Scholar
- Grabherr MG, Haas BJ, Yassour M, Levin JZ, Thompson DA, Amit I, et al. Full-length transcriptome assembly from RNA-Seq data without a reference genome. Nat Biotech. 2011;29:644–52.View ArticleGoogle Scholar
- Kent WJ. BLAT—the BLAST-like alignment tool. Genome Res. 2002;12:656–64.PubMedPubMed CentralView ArticleGoogle Scholar
- Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2006;34:D173–80.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215:403–10.PubMedView ArticleGoogle Scholar
- Love MI, Huber W, Anders S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 2014;15:550.PubMedPubMed CentralView ArticleGoogle Scholar
- Robinson MD, McCarthy DJ, Smyth GK. edgeR: a Bioconductor package for differential expression analysis of digital gene expression data. Bioinformatics. 2010;26:139–40.PubMedPubMed CentralView ArticleGoogle Scholar
- Fernandes AD, Reid JN, Macklaim JM, McMurrough TA, Edgell DR, Gloor GB. Unifying the analysis of high-throughput sequencing datasets: characterizing RNA-seq, 16S rRNA gene sequencing and selective growth experiments by compositional data analysis. Microbiome. 2014;2:15.PubMedPubMed CentralView ArticleGoogle Scholar
- Ishii S, Suzuki S, Norden-Krichmar TM, Tenney A, Chain PS, Scholz MB, et al. A novel metatranscriptomic approach to identify gene expression dynamics during extracellular electron transfer. Nat Commun. 2013;4:1601.PubMedView ArticleGoogle Scholar
- St-Pierre C, Brochu S, Vanegas JR, Dumont-Lagace M, Lemieux S, Perreault C. Transcriptome sequencing of neonatal thymic epithelial cells. Sci Rep. 2013;3:1860.PubMedPubMed CentralView ArticleGoogle Scholar
- Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444:1027–31.PubMedView ArticleGoogle Scholar
- Mahowald MA, Rey FE, Seedorf H, Turnbaugh PJ, Fulton RS, Wollam A, et al. Characterizing a model human gut microbiota composed of members of its two dominant bacterial phyla. Proc Natl Acad Sci U S A. 2009;106:5859–64.PubMedPubMed CentralView ArticleGoogle Scholar
- Lesniewski RA, Jain S, Anantharaman K, Schloss PD, Dick GJ. The metatranscriptome of a deep-sea hydrothermal plume is dominated by water column methanotrophs and lithotrophs. ISME J. 2012;6:2257–68.PubMedPubMed CentralView ArticleGoogle Scholar
- Lee JS, Lee KC, Ahn JS, Mheen TI, Pyun YR, Park YH. Weissella koreensis sp. nov., isolated from kimchi. Int J Syst Evol Microbiol. 2002;52:1257–61.PubMedGoogle Scholar
- Kim M, Chun J. Bacterial community structure in kimchi, a Korean fermented vegetable food, as revealed by 16S rRNA gene analysis. Int J Food Microbiol. 2005;103:91–6.PubMedView ArticleGoogle Scholar
- Dewhirst FE, Chien CC, Paster BJ, Ericson RL, Orcutt RP, Schauer DB, et al. Phylogeny of the defined murine microbiota: altered Schaedler flora. Appl Environ Microbiol. 1999;65:3287–92.PubMedPubMed CentralGoogle Scholar
- Robertson BR, O’Rourke JL, Neilan BA, Vandamme P, On SL, Fox JG, et al. Mucispirillum schaedleri gen. nov., sp. nov., a spiral-shaped bacterium colonizing the mucus layer of the gastrointestinal tract of laboratory rodents. Int J Syst Evol Microbiol. 2005;55:1199–204.PubMedView ArticleGoogle Scholar
- de Menezes AB, Lewis E, O’Donovan M, O’Neill BF, Clipson N, Doyle EM. Microbiome analysis of dairy cows fed pasture or total mixed ration diets. FEMS Microbiol Ecol. 2011;78:256–65.PubMedView ArticleGoogle Scholar
- Wright AD, Dehority BA, Lynn DH. Phylogeny of the rumen ciliates Entodinium, Epidinium and Polyplastron (Litostomatea:Entodiniomorphida) inferred from small subunit ribosomal RNA sequences. J Eukaryot Microbiol. 1997;44:61–7.PubMedView ArticleGoogle Scholar
- Walsh DA, Zaikova E, Howes CG, Song YC, Wright JJ, Tringe SG, et al. Metagenome of a versatile chemolithoautotroph from expanding oceanic dead zones. Science. 2009;326:578–82.PubMedView ArticleGoogle Scholar
- Rosen GL, Reichenberger ER, Rosenfeld AM. NBC: the Naïve Bayes Classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics. 2011;27:127–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, et al. From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006;34:D354–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Villarreal-Chiu JF, Quinn JP, McGrath JW. The genes and enzymes of phosphonate metabolism by bacteria, and their distribution in the marine environment. Front Microbiol. 2012;3:19.PubMedPubMed CentralView ArticleGoogle Scholar
- Kang HK, Seo MY, Seo ES, Kim D, Chung SY, Kimura A, et al. Cloning and expression of levansucrase from Leuconostoc mesenteroides B-512 FMC in Escherichia coli. Biochim Biophys Acta. 2005;1727:5–15.PubMedView ArticleGoogle Scholar
- Liu J, Pu Y-Y, Xie Q, Wang J-K, Liu J-X. Pectin induces an in vitro rumen microbial population shift attributed to the pectinolytic Treponema group. Curr Microbiol. 2015;70:67–74.PubMedView ArticleGoogle Scholar
- Markle JG, Frank DN, Mortin-Toth S, Robertson CE, Feazel LM, Rolle-Kampczyk U, et al. Sex differences in the gut microbiome drive hormone-dependent regulation of autoimmunity. Science. 2013;339:1084–8.PubMedView ArticleGoogle Scholar
- Ridaura VK, Faith JJ, Rey FE, Cheng J, Duncan AE, Kau AL, et al. Gut microbiota from twins discordant for obesity modulate metabolism in mice. Science. 2013;341:1241214.PubMedView ArticleGoogle Scholar
- Schocke L, Weimer PJ. Purification and characterization of phosphoenolpyruvate carboxykinase from the anaerobic ruminal bacterium Ruminococcus flavefaciens. Arch Microbiol. 1997;167:289–94.PubMedView ArticleGoogle Scholar
- Macy JM, Ljungdahl LG, Gottschalk G. Pathway of succinate and propionate formation in Bacteroides fragilis. J Bacteriol. 1978;134:84–91.PubMedPubMed CentralGoogle Scholar
- Yang J, Kalhan SC, Hanson RW. What is the metabolic role of phosphoenolpyruvate carboxykinase? J Biol Chem. 2009;284:27025–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Peregrin-Alvarez JM, Xiong X, Su C, Parkinson J. The modular organization of protein interactions in Escherichia coli. PLoS Comput Biol. 2009;5, e1000523.PubMedPubMed CentralView ArticleGoogle Scholar
- Jangir PK, Singh A, Shivaji S, Sharma R. Genome sequence of the alkaliphilic bacterium Nitritalea halalkaliphila type strain LW7, isolated from Lonar Lake, India. J Bacteriol. 2012;194:5688–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Bauer M, Kube M, Teeling H, Richter M, Lombardot T, Allers E, et al. Whole genome analysis of the marine Bacteroidetes‘Gramella forsetii’ reveals adaptations to degradation of polymeric organic matter. Environ Microbiol. 2006;8:2201–13.PubMedView ArticleGoogle Scholar
- Vaughan S, Wickstead B, Gull K, Addinall SG. Molecular evolution of FtsZ protein sequences encoded within the genomes of archaea, bacteria, and eukaryota. J Mol Evol. 2004;58:19–29.PubMedView ArticleGoogle Scholar
- Ferre A, De La Mora J, Ballado T, Camarena L, Dreyfus G. Biochemical study of multiple CheY response regulators of the chemotactic pathway of Rhodobacter sphaeroides. J Bacteriol. 2004;186:5172–7.PubMedPubMed CentralView ArticleGoogle Scholar
- Peregrin-Alvarez JM, Sanford C, Parkinson J. The conservation and evolutionary modularity of metabolism. Genome Biol. 2009;10:R63.PubMedPubMed CentralView ArticleGoogle Scholar
- Leinonen R, Sugawara H, Shumway M. The sequence read archive. Nucleic Acids Res. 2011;39:D19–21.PubMedPubMed CentralView ArticleGoogle Scholar
- Ewing B, Hillier L, Wendl MC, Green P. Base-calling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 1998;8:175–85.PubMedView ArticleGoogle Scholar
- Sayers EW, Barrett T, Benson DA, Bryant SH, Canese K, Chetvernin V, et al. Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2009;37:D5–15.PubMedPubMed CentralView ArticleGoogle Scholar
- McDonald D, Price M, Goodrich J, Nawrocki E, DeSantis T, Probst A, et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2011;6:610–8.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, et al. SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007;35:7188–96.PubMedPubMed CentralView ArticleGoogle Scholar
- Nawrocki EP, Eddy SR. Infernal 1.1: 100-fold faster RNA homology searches. Bioinformatics. 2013;29:2933–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA family database. Nucleic Acids Res. 2003;31:439–41.PubMedPubMed CentralView ArticleGoogle Scholar
- Flicek P, Amode MR, Barrell D, Beal K, Billis K, Brent S, et al. Ensembl 2014. Nucleic Acids Res. 2014;42:D749–55.PubMedPubMed CentralView ArticleGoogle Scholar
- Pruitt KD, Tatusova T, Brown GR, Maglott DR. NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012;40:D130–5.PubMedPubMed CentralView ArticleGoogle Scholar
- Duvick J, Fu A, Muppirala U, Sabharwal M, Wilkerson MD, Lawrence CJ, et al. PlantGDB: a resource for comparative plant genomics. Nucleic Acids Res. 2008;36:D959–65.PubMedPubMed CentralView ArticleGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.PubMedPubMed CentralView ArticleGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.PubMedPubMed CentralView ArticleGoogle Scholar
- Tam S, Tsao M-S, McPherson JD. Optimization of miRNA-seq data preprocessing. Brief Bioinform. 2015;16:950–63.PubMedPubMed CentralView ArticleGoogle Scholar
- Hung SS, Wasmuth J, Sanford C, Parkinson J. DETECT—a density estimation tool for enzyme classification and its application to Plasmodium falciparum. Bioinformatics. 2010;26:1690–8.PubMedView ArticleGoogle Scholar
- UniProt C. The Universal Protein Resource (UniProt) in 2010. Nucleic Acids Res. 2010;38:D142–8.View ArticleGoogle Scholar
- Mortazavi A, Williams BA, McCue K, Schaeffer L, Wold B. Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat Methods. 2008;5:621–8.PubMedView ArticleGoogle Scholar
- McCarthy DJ, Chen Y, Smyth GK. Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation. Nucleic Acids Res. 2012;40:4288–97.PubMedPubMed CentralView ArticleGoogle Scholar
- Huson DH, Auch AF, Qi J, Schuster SC. MEGAN analysis of metagenomic data. Genome Res. 2007;17:377–86.PubMedPubMed CentralView ArticleGoogle Scholar
- Letunic I, Bork P. Interactive Tree Of Life (iTOL): an online tool for phylogenetic tree display and annotation. Bioinformatics. 2007;23:127–8.PubMedView ArticleGoogle Scholar
- Chao A, Shen T-J. Nonparametric estimation of Shannon’s index of diversity when there are unseen species in sample. Environ Ecol Stat. 2003;10:429–43.View ArticleGoogle Scholar
- Magurran AE. Measuring biological diversity. Hoboken, New Jersey: Wiley-Blackwell; 2004.Google Scholar
- Chao A. Estimating the population size for capture-recapture data with unequal catchability. Biometrics. 1987;43:783–91.PubMedView ArticleGoogle Scholar
- Thureborn P, Lundin D, Plathan J, Poole AM, Sjöberg B-M, Sjöling S. A metagenomics transect into the deepest point of the Baltic Sea reveals clear stratification of microbial functional capacities. PLoS One. 2013;8, e74983.PubMedPubMed CentralView ArticleGoogle Scholar
- Gołębiewski M, Deja-Sikora E, Cichosz M, Tretyn A, Wróbel B. 16S rDNA pyrosequencing analysis of bacterial community in heavy metals polluted soils. Microb Ecol. 2014;67:635–47.PubMedPubMed CentralView ArticleGoogle Scholar
- Davenport CF, Neugebauer J, Beckmann N, Friedrich B, Kameri B, Kokott S, et al. Genometa—a fast and accurate classifier for short metagenomic shotgun reads. PLoS One. 2012;7, e41224.PubMedPubMed CentralView ArticleGoogle Scholar
- Huang Y, Niu B, Gao Y, Fu L, Li W. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics. 2010;26:680–2.PubMedPubMed CentralView ArticleGoogle Scholar
- Colwell RK. EstimateS: Statistical estimation of species richness and shared species from samples. 2005.Google Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–504.PubMedPubMed CentralView ArticleGoogle Scholar
- Yamada T, Letunic I, Okuda S, Kanehisa M, Bork P. iPath2.0: interactive pathway explorer. Nucleic Acids Res. 2011;39:W412–5.PubMedPubMed CentralView ArticleGoogle Scholar