The Amazon continuum dataset: quantitative metagenomic and metatranscriptomic inventories of the Amazon River plume, June 2010

Background The Amazon River is by far the world’s largest in terms of volume and area, generating a fluvial export that accounts for about a fifth of riverine input into the world’s oceans. Marine microbial communities of the Western Tropical North Atlantic Ocean are strongly affected by the terrestrial materials carried by the Amazon plume, including dissolved (DOC) and particulate organic carbon (POC) and inorganic nutrients, with impacts on primary productivity and carbon sequestration. Results We inventoried genes and transcripts at six stations in the Amazon River plume during June 2010. At each station, internal standard-spiked metagenomes, non-selective metatranscriptomes, and poly(A)-selective metatranscriptomes were obtained in duplicate for two discrete size fractions (0.2 to 2.0 μm and 2.0 to 156 μm) using 150 × 150 paired-end Illumina sequencing. Following quality control, the dataset contained 360 million reads of approximately 200 bp average size from Bacteria, Archaea, Eukarya, and viruses. Bacterial metagenomes and metatranscriptomes were dominated by Synechococcus, Prochlorococcus, SAR11, SAR116, and SAR86, with high contributions from SAR324 and Verrucomicrobia at some stations. Diatoms, green picophytoplankton, dinoflagellates, haptophytes, and copepods dominated the eukaryotic genes and transcripts. Gene expression ratios differed by station, size fraction, and microbial group, with transcription levels varying over three orders of magnitude across taxa and environments. Conclusions This first comprehensive inventory of microbial genes and transcripts, benchmarked with internal standards for full quantitation, is generating novel insights into biogeochemical processes of the Amazon plume and improving prediction of climate change impacts on the marine biosphere.


Background
The Amazon River runs nearly 6,500 km across the South American continent before emptying into the Western Tropical North Atlantic Ocean; in terms of both volume and watershed area it is the world's largest riverine system [1]. The river carries a significant load of terrestrially-derived nutrients to the ocean, and this has global consequences on marine primary productivity and carbon sequestration [2,3]. Productive phytoplankton blooms harboring cyanobacteria, coastal diatom species, and oceanic diatoms with endosymbiotic diazotrophs take advantage of the riverine nutrient supplements and enhance carbon export from the upper ocean to deeper waters via sinking particles [3,4]. Heterotrophic bacteria also remineralize organic nutrients in the plume, further fueling primary production and increasing the flux of organic material to deep water.
We inventoried the microbial genes and transcripts at six stations in the Amazon River plume aboard the R/V Knorr between 22 May and 25 June, 2010 ( Figure 1) using Illumina sequencing with 150 × 150 bp overlapping paired-end reads. Metagenomic and metatranscriptomic data have typically been analyzed within a relative framework (that is, % of metagenome and % of metatranscriptome), but this approach is problematic for dynamic communities because a change in the abundance of one type of gene or transcript imposes a change in the percent contribution of the others. By incorporating internal standards, we are able to assess meta-omics datasets within an absolute framework that facilitates comparisons of communities sampled at different times and places in the environment. In the Amazon plume sequence libraries, known copy numbers of internal standards were added at the initiation of sample processing and consisted of genomic DNA from an exotic bacterium for the metagenomes (Thermus thermophilus HB8) and artificial mRNAs and poly(A)-tailed mRNAs for the metatranscriptomes; these standards were identified, counted, and removed from the natural sequences during quality control steps.
For each station, metagenomes and non-selective metatranscriptomes were each obtained in duplicate for two discrete size fractions (0.2 to 2.0 μm and 2.0 to 156 μm), while poly(A)-selective metatranscriptomes were obtained in duplicate only for the 2.0 to 156 μm size fraction (to increase coverage of the eukaryotic community), resulting in a total of 60 datasets (6 stations x 5 data types × 2 replicates) ( Table 1). The data collection consisted of 360 million reads following quality control (removal of poor quality reads, removal of rRNAs from metatranscriptomes, removal of internal standards, and joining of overlapping 150 bp paired ends) and provides an unprecedented view of the metabolic functions of the Bacteria, Archaea, and Eukarya mediating carbon and nutrient cycling in the Amazon River plume.

Methods
Detailed sample collection and processing methodology can be found in Additional file 1. Sample sites in the Amazon River plume were chosen to represent a range of salinity, nutrient concentrations, and microbial communities (Additional file 2). Microbial cells were collected by filtration and preserved in RNAlater (Applied Biosystems, Austin, TX, USA). During sample processing, internal standards were added to each sample prior to cell lysis. Samples collected for non-selective metatranscriptomics were processed by extracting total RNA, removing residual DNA, depleting rRNA, linearly amplifying the remaining transcripts, and making double-stranded cDNA for library preparation and sequencing. Poly(A)-selective metatranscriptome samples were processed similarly except that poly(A)-tailed mRNAs were selectively isolated, eliminating the need for rRNA depletion steps. Metagenomic samples were processed by extracting DNA and removing residual proteins and RNA. Following sample processing, cDNA or DNA was sheared and libraries were constructed for paired-end sequencing (150 × 150) using either the Genome Analyzer IIx, HiSeq 2000, MiSeq, or HiSeq 2500 platform (Illumina Inc., San Diego, CA).
From 60 samples, we obtained 8.21 × 10 8 raw sequences containing 1.23 × 10 11 nt. Following sequence quality control, 3.59 × 10 8 reads with a mean length of 195 bp were obtained. Internal standards were quantified and removed, along with any remaining rRNA sequences. Remaining reads were annotated against the RefSeq Protein database or a custom marine database using RAP-Search2 [5], and abundance per liter was calculated based on internal standard recovery [6] (Additional file 2).
Biological and chemical data measured concurrently with sample collection provides environmental context for sequence data. These metadata include temperature, salinity, oxygen concentration, irradiance, chlorophyll concentration, nutrient concentrations, and bacterial abundance and production (Additional file 2). Datasets describing the phytoplankton communities and other features of the June 2010 plume ecosystem have been previously published [1,4,7,8].

Quality assurance
The She-ra program [9] was used to join the paired-end Illumina reads using the default parameters and a quality metric score of 0.5. Seqtrim [10] was used to trim the joined reads using the default parameters. rRNA and internal standard sequences were identified in the metatranscriptomes using a Blastn search against a custom  database containing representative rRNA sequences and internal standard sequences; sequences with a bit score ≥ 50 were identified as either rRNA or internal standards and removed from the datasets. Internal standards were identified in metagenomes by first performing a Blastn search (bit score cutoff ≥ 50) against the T. thermophilus HB8 genome. Hits were subsequently queried against the RefSeq protein database using Blastx (bit score cutoff ≥ 40) to identify and quantify all T. thermophilus HB8 protein encoding reads, and these reads were removed from the datasets.

Initial findings
Metagenomic reads from surface waters of the six Amazon River plume stations were assigned to bacterial, archaeal, eukaryotic, and viral taxa based on best hits to reference genomes. Among autotrophic bacteria, Synechococcus was the largest contributor to the metagenomes at locations closest to the river mouth (Stations 10, 3; approximately 1.5 × 10 12 genes L −1 ) and was replaced by Prochlorococcus at more oceanic locations (Stations 25, 27) ( Table 2). Among heterotrophic bacteria, SAR86 had the largest gene abundance closest to the river mouth (Station 10; approximately 8.6 × 10 11 genes L −1 ). SAR11 clade members (HTCC7211, HIMB5) were also abundant here, and became the dominant contributor of heterotrophic bacterial genes at more oceanic stations (up to 5.7 × 10 12 genes L −1 ) ( Table 2). Genes binning to SAR324 genomes were abundant at three stations (Station 2, 3, and 23; Table 2), with the Amazon plume sequences aligning with heterotrophic members of this group [11]. Station 2 had a distinctive bacterial community relative to the other plume stations, dominated by genes from Verrucomicrobia related to Coraliomargarita akajimensis DSM 45221 and strain DG1235 and with substantial contributions from SAR116 taxa (IMCC1322, HIMB100). Coraliomargarita akajimensis DSM 45221 was also among the most abundant genome bins at Station 25 (Table 2). Among eukaryotic taxa, diatoms and the green alga Micromonas contributed the greatest number of genes at lower salinities, while Haptophytes (binning to Phaeocystis antarctica), dinoflagellates (binning to Alexandrium tamarense CCMP1771) and relatives of the green alga Pyraminomonas obovata CCMP722 increased in importance at more Bacterial, archaeal, and viral reads were annotated against the NCBI RefSeq database. Eukaryotic reads were annotated against a custom database containing marine eukaryotic genomes and transcriptomes from NCBI and 112 of the Marine Microbial Eukaryote Transcriptome Sequencing Project datasets that were public at the time of analysis (http://marinemicroeukaryotes.org).
saline stations (Table 2). Among Archaea, members of the ammonia-oxidizing genus Nitrosopumilus and related genera contributed the most genes at stations closest to the river mouth, although they were 100-fold lower in numbers compared to the most abundant bacterial taxa. There were very few archaeal genes at the outermost stations (Stations 25 and 27), and these binned largely to methanogen sequences. The viral sequences were dominated by cyanobacterial phages ( Table 2). Patterns of gene and transcript abundance provided insights into transcriptional activity by taxon and habitat (that is, cells that were free-living versus those that were particle-associated) for the dominant bacterial groups. Particle-associated Verrucomicrobia (Order Puniceicoccales) maintained cellular transcript inventories of up to 14 transcripts/gene for particle-associated cells and averaged 2 transcripts/gene overall ( Figure 2). In contrast, members of the Flavobacteria class averaged < 0.5 transcripts/gene. Particle-associated cells in each of these major taxa typically had more transcripts per gene copy than did free-living cells (averaging 2.0 versus 0.15 transcripts/gene) (Figure 2). Abundance of transcripts originating from particle-associated versus free-living bacteria varied along the plume, with mRNAs from free-living cells contributing only 30 to 60% of the metatranscriptome in landward stations, but > 90% at outer plume stations. Environmental data (Additional file 2) indicate that Station 10 had the lowest salinity (22.6) and Station 27 the highest (36.0). Station 10 was the most strongly influenced by riverine inputs, particularly of inorganic nitrogen.

Future directions
The Amazon River plume is immense in scale and sensitive to anthropogenic forcing. This multi-omics dataset is the first of four high-throughput metagenomic and metatranscriptomic sequence collections being produced for the Amazon River Continuum as part of the ANACONDAS and ROCA projects (http://amazoncontinuum.org). These projects aim to improve predictive capabilities for climate change impacts on the marine biosphere, focusing on the Amazon ecosystem, and to better our understanding of feedbacks on the carbon cycle. Processes in the river and ocean are tightly linked from physical, biological, and biogeochemical perspectives. Thus, the complete data collection will include two datasets from the Amazon plume (June 2010 and July 2013) and two from the Amazon River (Óbidos to Macapá and Belém; June 2011 and July 2013). These high-coverage, size-discrete, and replicated datasets are all benchmarked with internal genomic and mRNA standards for comparative quantitative metagenomics and metatranscriptomics. Insights from these meta-omics datasets are enhancing predictive capabilities regarding the interplay between marine microbial communities, biogeochemical cycling, and carbon sequestration in the ocean.  Figure 2 Inventories of genes and transcripts for eight bacterial taxa in surface waters of the Amazon plume. Symbols represent the mean of duplicate analyses at six stations, color-coded by taxon and size fraction (particle-associated or free-living). Lines indicate a 1:1 ratio of transcripts:genes (black) or 10:1 and 1:10 ratios (gray). The purple line indicates the ratio of transcripts:genes for exponentially growing laboratory cultures of Escherichia coli [12,13]. Dominant bacterial groups are as follows: Oscillatoriales = Trichodesmium; Prochlorales = Prochlorococcus; Chroococcales = Synechococcus; Nostocales = Richelia; Puniceicoccales = Verrucomicrobia.