Comparative genomics of planktonic Flavobacteriaceae from the Gulf of Maine using metagenomic data
© Tully et al.; licensee BioMed Central Ltd. 2014
Received: 16 April 2014
Accepted: 20 August 2014
Published: 5 September 2014
The Gulf of Maine is an important biological province of the Northwest Atlantic with high productivity year round. From an environmental Sanger-based metagenome, sampled in summer and winter, we were able to assemble and explore the partial environmental genomes of uncultured members of the class Flavobacteria. Each of the environmental genomes represents organisms that compose less than 1% of the total microbial metagenome.
Four partial environmental genomes were assembled with varying degrees of estimated completeness (37%–84% complete) and were analyzed from a perspective of gathering information regarding niche partitioning between co-occurring organisms. Comparative genomics revealed potentially important niche partitioning genomic variations, including iron transporters and genes associated with cell attachment and polymer degradation. Analysis of large syntenic regions helped reveal potentially ecologically relevant variations for Flavobacteriaceae in the Gulf of Maine, such as arginine biosynthesis, and identify a putative genomic island incorporating novel exogenous genes from the environment.
Biogeographic analysis revealed flavobacteria species with distinct abundance patterns suggesting the presence of local blooms relative to the other species, as well as seasonally selected organisms. The analysis of genomic content for the Gulf of Maine Flavobacteria supports the hypothesis of a particle-associated lifestyle and specifically highlights a number of putative coding sequences that may play a role in the remineralization of particulate organic matter. And lastly, analysis of the underlying sequences for each assembled genome revealed seasonal and nonseasonal variants of specific genes implicating a dynamic interaction between individuals within the species.
KeywordsMetagenomics Microbial ecology Comparative genomics
The Flavobacteriaceae have been indicated as major constituents of microbial communities attached to detritus and planktonic organisms[1, 2]. Through this interaction, they are presumed to be important players in the “microbial loop”, a trophic model by which dissolved organic matter is consumed by microorganisms that are in turn consumed by other organisms, breaking down large organic molecules (e.g., proteins, chitin, and other polysaccharides) and making them accessible to other microbes. The microbial loop in the surface ocean is estimated to degrade and recycle about 50% of surface primary production. Within the microbial loop, individual species typically play specialized roles in polymeric degradation; frequently, they have specialization in expressed enzymes that target only specific compounds and are limited in the range of compounds. Therefore, a change in the substrates within an environment influences microbial composition. Particle-associated microorganisms thrive in limited, ephemeral niches attached to marine particles, with the plankton possibly acting as a source from which individuals are drawn to compose future assemblages.
Metagenomics has previously been used to reconstruct microbial genomic potential and to predict microbial structure from diverse environments[8–12]. Many metagenomic studies involved the reconstruction of single organisms (or species-level populations), linking metabolic functions with phylogeny[8, 13, 14]. Expanding our understanding of the genomic content of species from the environment is crucial in determining the putative roles these organisms play within in the ecosystem. However, extrapolating 16S rRNA gene sequence similarity to putative functional roles is confounded due to large genomic variations in between related organisms, with as much as 60% dissimilarity in gene content[15, 16]. These variations in genomic content play a key role in determining the ecological niche that a given species can occupy and are likely the underpinning of microbial speciation events. To date, a limited number of environmental datasets[18–20] have been used to assess in situ population-level variations of microorganisms, combining the analysis comparative genomics of related species that occupy the same spatial and temporal environments and selection pressures acting on individual genes.
The Gulf of Maine (GoMA) is an oceanic province of high biological significance, with a high degree of primary productivity and historically important commercial fisheries (supplemental figures can be found in Additional file1: Figure S1). From an environmental metagenome from the GoMA, we were able to reconstruct four partial environmental genomes (phylogenetic bins) and explore the functional capacity of Flavobacteria present in the microbial community. As with other genomes from metagenomic source material, these partial genomes represent a composite of all the organisms with similar genotypes within the community. We are able to highlight genomic variations that potentially could allow for each individual Flavobacteria to exploit specific microhabitats within the environment resulting in non-overlapping niches. Competitive exclusion is alleviated at the species level or subspecies level when genomic diversification results in decreased competition for limited resources. One possible mechanism by which organisms can alleviate competitive exclusion at the species level or subspecies level is through diversification of genomic content. These sorts of comparative genomic techniques have been predominantly performed on cultured isolates[23, 24], and while such analyses have been performed on environmental microbial metagenomic datasets, this comparative genomic analysis is still underrepresented in metagenomic studies compared to studies analyzing total gene content.
Results and discussion
Sequencing, community structure, assembly, and binning assessment
Over 2.8 million Sanger sequences from samples collected in January (winter) and August (summer) of 2006 were analyzed (Additional file1: Figure S1,, supplemental tables can be found in Additional file2: Table S1). In total, 2,238 16S rRNA gene fragments of at least 250 bp in length were identified, of which 2,213 could be classified using the RDP classifier (see supplemental materials and methods in Additional file3). The Bacteroidetes represented the third largest group in both summer and winter (15.0% and 15.6% sequences, respectively), following the Alpha- and Gammaproteobacteria. From within the Bacteroidetes, 75 sequences were identified as members of the class Flavobacteria (3.4%). When viewed at phyla and class level, most microbial groups showed similar abundances in both summer and winter (Additional file1: Figure S2). However, this relationship was not maintained as the taxonomic resolution increased. Many of the sub-phyla and sub-class populations showed some seasonal variation (see below).
Summary of the identified phylogenetic bins used in this study
Total size (bps)
Number of scaffolds
Percentage (%) complete
Percentage (%) duplication
Approximate genome size (bp)
Spatial and temporal differentiation
Data pertaining to the seasonality and spatial heterogeneity of sequences from each bin
Percent of reads in the assemblies by seasona
Percent abundance of reads by season
Percent abundance of reads by siteb
All four bins are present in both summer and winter. FlavA is significantly more abundant in the summer than FlavH or FlavI (t test, p < 0.002), and conversely, FlavI is significantly more abundant than FlavA or FlavH in the winter (t test, p < 0.002). FlavA and FlavG are composed of sequences primarily derived from the summer, for which FlavA is almost exclusively present at this time, while FlavG has more presence during the winter. FlavI is made up of primarily winter sequences (abundances in the summer, FlavA > FlavG > FlavH > FlavI). Each of the four environmental genomes was a rare member of the total community; derived from organisms that collectively compose about 1% of the total microbial metagenome.
It is unknown the degree to which Flavobacteriaceae share overlapping ecological niches or the extent to which species specialize in substrate degradation or distinct microhabitat conditions. Using the GoMA metagenomic data, we explored the possibility that the organisms represented in each bin may temporally overlap but are spatially separated. Previous work has shown that microbial bacterioplankton community can change over short distances (10 km) and possess the largest difference in community structure at the 6-month temporal scale. Our data support both homogenous mixing across a large spatial-scale and spatial separation, depending on the population. In August, FlavH had near-identical abundance at two different sites (GoMA12 and GoMA14; 0.213% and 0.215%, respectively; p = 0.0234, 10,000 permutation t test using DAAG package in R), separated by 221 km. FlavA was far more abundant at site GoMA12 and much less abundant at site GoMA13 (1.218% and 0.331%, respectively; p < 0.002, 10,000 permutation t test DAAG package). Generally, for each season, the most abundant organism for that season is spatially heterogeneous, while the less abundant organisms are more homogeneous (Table 2). The heterogeneous nature of the high-abundance organisms may represent localized blooms related to the current conditions and available substrates, while the low-abundance organisms represent a background population adapted for different conditions and substrates. The data presented here cannot be used to directly explore this hypothesis, though it can be used to look at how organisms competing for limited resources may selectively occupy distinct niches by utilizing a different suite of genomic adaptations.
Phylogenetic bin functionality
Functionality is difficult to determine due to the incomplete nature of the phylogenetic bins. Parsing bins for genes found in other Flavobacteriaceae, illustrates the patchy nature of the metagenomic coverage; however, it does reveal some commonality of genes and functions amongst all four bins. For example, all bins possess genes involved in gliding motility. The bins possess either phosphoenolpyruvate (PEP) carboxylase (FlavH), malic enzyme (FlavG and FlavH), or both (FlavA), key genes that perform anaplerotic carbon fixation, an energy intensive process that may be important for replenishing cellular intermediates in conjunction with proteorhodopsin activity (Additional file2: Table S4). FlavA does not include an identified proteorhodopsin gene, though it may simply be in a sequencing gap. The other three have green-light adapted proteorhodopsin (Additional file1: Figure S3B). FlavG, FlavH, and FlavI lack genes related to cobalamin (vitamin B12) metabolism, while FlavA possesses an outer membrane receptor and ATP:cob(I)alamin adenosyltransferase (PduO), which converts vitamin B12 into the coenzyme form. Genes involved with the processing of thiamin (vitamin B1) can be found in FlavA (thiamin-monophosphate kinase and thiamin biosynthesis lipoprotein, ApbE) and FlavG (three copies of ApbE), while none were found in FlavH and FlavI. FlavA, FlavH, and FlavI share superoxide dismutases that have a copper-zinc (Cu-Zn) coenzymatic active site, while FlavG lacks an identified superoxide dismutase. FlavA also possesses a second dismutase with a manganese coenzymatic site.
Further, the genomic content within the phylogenetic bins was searched for putative genes that would potentially confer unique ecophysiological phenotypes within the GoMA Flavobacteria. Only putative coding DNA sequences (CDSs) that lacked sequence similarity with CDSs in the available, fully sequenced Flavobacteriaceae (Additional file2: Table S5) and the GenBank nonredundant (NR) database were explored. Putative CDSs without functional annotation were not included in this analysis. While (conserved) hypothetical genes are likely to provide novel functional mechanisms, it is beyond the scope of this analysis to attempt to determine such functions. Additionally, putative proteins with divergent amino acid sequences were not assessed as it was difficult to determine what appropriate cutoff of sequence dissimilarity should be applied as slight variations in amino acid sequence can alter specificity and substrates, and conversely, high variation can have limited impacts on function.
Putative CDSs without sequence similarity to the other Flavobacteriaceae and the GenBank NR were identified for each phylogenetic bin, ranging from 16 in FlavI (2.1% of predicted CDSs) to 46 in FlavA (2.1% of predicted CDSs) (Additional file2: Table S2). Each identified gene was assessed for relevance for this analysis and putative CDSs with necessary cellular functions (e.g., SSU ribosomal protein S16P) were removed. Due to the incomplete nature of these bins, the following identified genes cannot be attributed to a single organism; however, collectively, these putative genes represent genomic content unique to Flavobacteria in the GoMA.
Novel genes with putative ecophysiological relevance
When the CDSs of FlavA were compared to the other Flavobacteria and the GenBank NR, a number of novel genes were identified that suggests that attachment to particles may be important. Three CDSs identified have possible roles in cell adhesion: a YapH protein (a large protein family that include adhesins), a fat protein possibly involved in cell-cell attachment (UniProtKB ID. Q7UY44), a protein rich in Ca2+-binding sites, and a CDS with multiple VCBS domains (PF13517). Further, two CDSs were identified as containing TSP (thrombospondin) type-3 repeats, calcium binding proteins commonly found on the outer membrane.
FlavG had three novel CDSs of interest with no similarity to the other Flavobacteria or the GenBank NR: a novel chitinase, an outer membrane lipoprotein that is part of the resistance-nodulation-cell division (RND) efflux system, and calcium-binding protein of the repeats-in-toxin (RTX) family. All three of these genes could potentially be cell surface proteins that interact with or degrade particulate organic matter. Chitinases must act on molecules much larger than can be transported in to the cell. Though RTX proteins have a wide range of functions, one of the more common functions is to act as a surface layer protein, possibly as a peptidase or lipase. These genes suggest interaction with and degradation of particulate organic matter.
Both FlavH and FlavI are smaller bins and have a limited number of novel CDSs. For FlavH, one of the genes without a similar sequence is a lysine exporter (LysE) that has been shown to specifically remove excess lysine from the cellular environment. FlavI genes without sequence similarity in the GenBank NR include a Fe2+ siderophore transporter, suggesting diversity in iron transport systems, a gene of unknown function related to a cartilage oligomeric matrix protein, found in humans and some bacteria, and a glycoprotein that contains extracellular and calcium-binding domains.
Of particular interest were both peptidases and glycoside hydrolases (GHs). These gene groups have a large diversity of form and function in regards to degrading peptides and large carbohydrates, respectively, and can play key roles in differentiating metabolic potential for organisms and microbial communities.
The number of peptidases and glycoside hydrolases identified within each bin
Cell exterior peptidasesa
Total glycoside hydrolases
Unique gylcoside hydrolasesb
All of the bins, except for FlavI, have at least one peptidase without similarity to the Flavobacteria genomes or the GenBank NR database (Table 3). These five peptidases were either assigned a destination to the outer membrane-associated or extracellular (60%) or, if assigned an unknown destination, had a lipoprotein signal, suggesting that all of the unique peptidases were interacting with proteins in the environment. All of these peptidases belong to the clan MA (CL0126) in the Pfam database. This clan consists of 52 families of peptidases, all of which share a two Zn-dependent co-catalytic site within the motif HEXXH. One of the peptidases from FlavA could be assigned more specifically to family Peptidase M28 (PF04389), which preferentially releases basic amino acids from the N-terminal end of peptides.
Unique glycoside hydrolases
The number of identified GHs for the phylogenetic bins was similar to the numbers found in other marine flavobacterial genomes (18–58 GHs) (Table 3). Each bin, except FlavH, had at least one unique GH when compared to the other Flavobacteria genomes and the GenBank NR. Three of the unique GHs were derived from FlavA. The first unique GH contained a domain assigned to GH Family 88 (PF07470), annotated as a rhamnogalacturonide degradation protein, indicating it may degrade pectin polysaccharides. The second was assigned to the GH Family 18 (PF00704), which has a number of possible functions attributed to it, such as chitinases, acetylglucosaminidases, and xylanase inhibitors. The third was annotated as a short-chain dehydrogenase/reductase and was assigned as a GH with a bacterial neuraminidase repeat (BNR). FlavI also contained a unique CDS with putative BNR function. Neuraminidases specifically degrade neuraminic acids, commonly found glycoproteins on cell surfaces. Activity of the putative neuraminidases would require the Flavobacteria populations to be attached to particles with glycoproteins proteins present, further supporting the putative role of Flavobacteria attached to particles, either cellular or detrital in nature. FlavG contained a single unique GH that was annotated as an α-l-fucosidase, which will cleave hexose deoxy sugars on cell surfaces.
These genes represent those unique to the GoMA Flavobacteria in relation to the other sequenced Flavobacteriaceae and all other microbial genomes in the GenBank NR. Many of the identified putative functions and/or protein domains are indicative of interacting with the extracellular environment. These predicted gene functions support a hypothesis of marine Flavobacteria playing a role in the degradation of organic molecules and being associated with marine aggregates, such as detritus and planktonic protozoa.
Homologous region 1 (HR1) is 15 genes and ~21 kbp in length (Figure 2). Several of the genes appear to have an operon-like nature and pertain to amino acid biosynthesis and other essential cellular functions. Interspaced between the syntenic regions of HR1 reside a number of across-bin variations. The largest variation is the inclusion of genes related to arginine biosynthesis in FlavA, FlavH, and FlavI. Arginine biosynthesis genes are orthologous and syntenic in both FlavA and FlavH; however, both of these bins contain an insertion (relative to FlavI) annotated as pyrroline-5-carboylate reductase. When arginine biosynthesis genes are compared to homologs in the genomes of the Flavobacteriaceae used to identify unique Flavobacteria genes in the GoMA (Additional file2: Table S5), it can be seen that the whole pathway is syntenic for most of the organisms, with exception of Kordia algicida OT-1, which, like FlavI, lacks pyrroline-5-carboylate reductase. However, unlike the GoMA phylogenetic bins, the surrounding genomic architecture varies quite markedly, including Flavobacteria sp. MS024-2A. This suggests that the arginine biosynthesis operon is conserved in members of the Flavobacteriaceae, though its relative location in each genome can vary and that the lack of the operon in FlavG may be the result of the rearrangement of the entire operon.
Homologous region 2 (HR2) is far more variable in size, with 17 genes anchoring the syntenic region, and a number of gene differences in FlavA, FlavG, and FlavH that increase the length of the region from 31 kb in FlavI to 64 kb in FlavA (Figure 3). The conserved portion of HR2 indicates a role in cell division. The variable region appears to be amenable to gene gain/loss, and many of the genes present in each bin confer potential functions that would be beneficial to the proposed particle-associated lifestyle of marine Flavobacteriaceae, such as glycoside hydrolases and TonB-receptors. Interestingly, the genes within the variable region have limited overlap in putative function between the bins. This may indicate that this type of variable region is prone to gene gain/loss, potentially as a mechanism for incorporating horizontally transferred functions into the genome, and like other genomic islands, this region is adjacent to a tRNA, specifically tRNA-Arg. The horizontally transferred genes may confer metabolic specialization and diversification, similar to those genomic islands seen in Prochlorococcus.
In addition to evaluating differences in genomic content, we analyzed gene variations within the underlying population structure, using the individual reads that assembled into each putative CDS. The large assemblies and binning protocol allows for analysis of in vivo variations in putative CDS of an environmental population. Additionally, Sanger sequencing has lower error rates than 454 and Illumina sequencing platforms (San Diego, CA, USA) allowing for higher confidence in the accuracy of individual base pairs. Further, Sanger sequencing is not subject to amplification bias (though cloning biases do apply) allowing for a more accurate interpretation of the abundance of different gene variants. Due to the disparity between sequence depth (~2 Gbp) and the total genomic content of the sample (~2 Tbp, assuming 5 × 106 cells · ml-1 and 2 Mbp · genome-1), it is possible to treat each sequence read within a scaffold as being derived from an individual microbial cell. As such, analysis of individual reads reveal a more accurate abundance of identified SNPs. Analysis of SNP variations has rarely been performed on metagenomic datasets because of the low coverage of less abundant organisms in complex environmental metagenomes and the lack of coherent phylogenetic groups to pursue concerted effort within a single population. As such, these analyses usually rely on genomes derived from cultured and sequenced closely related strains. As these problems are addressed, such analyses will be even more relevant for understanding variation in the environment, as the gene variations present within microbial populations may confer an unknown degree of niche differentiation. Many of the identified variant differences from this study are synonymous, but the nonsynonymous changes may have significant effects on protein function and activity.
Breakdown of variant data for each bin
Number of CDS with >7× coverage
Number of CDS without SNPs
Number of CDS with multiple variants
Number of winter only variants
Number of summer only variants
Number of variants composed of both seasons
The presence of multiple closely related individuals within each phylogenetic bin allows for a comparison of selection pressures using dN/dS values. dN/dS is calculated based on the number of nonsynonymous substitutions per nonsynonymous site to the number of synonymous substitutions per synonymous site. Differences can be used as an indicator of selective pressure acting on a protein-coding gene. Further, dS can be used to approximate the time since divergence, if there is the assumption of a constant mutation rate across all genes. This procedure is not commonly applied to environmentally derived genomes and datasets[18–20] but has the potential to elucidate further understanding of how selective pressures are manifested in the environment.
The analysis of the flavobacterial component of the GoMA metagenome has been used to understand how community and population diversity are separated both spatially and temporally in the planktonic environment. We were able to identify Flavobacteria populations that were abundant in the different seasons and illustrated that the abundant populations exhibited a heterogeneous distribution between the sites compared to the homogeneous distribution of the less abundant organisms. Using comparative genomics, it is possible to identify potentially key differences in the gene content of these flavobacterial populations such as unique iron transporters (FlavI) and specialization in cell attachment (FlavA) and polymer exploitation (FlavG). Further, we were able to see similar gene content over large segments of syntenic regions for all four bins, including observations regarding the variable nature and gene content of the arginine biosynthesis operon and the identification of a genomic island that may play a role in the acquisition of new functions. We were able to start examining the selection pressures on specific genes and could see evidence of recombination between populations. The utility of metagenomics as a tool for microbial ecology will only increase as methods for analyzing large-scale genomic data can be applied to both hypothesis-generating and hypothesis-testing scenarios, such as testing the functional variation of similar microbial population;, examining the presence, abundance, and extent of gene variations in the underlying populations; and exploring the impacts such variations may have on community function.
As previously described, three 200 L water samples were collected from 1.5-m depth from the GoMA in January and August of 2006 (Additional file2: Table S1), size-fractionated by sequential filtering, immediately frozen in liquid N2, and stored at -80°C. Sequencing of the metagenome has been described previously and follows established protocols. In brief, DNA was extracted from organisms collected in the 0.1–0.8 μm filter range, inserted into medium range insert vectors (3–10 Kbp), and sequenced via paired-end Sanger sequencing. A total of 2,827,702 reads were generated for six samples containing over 2,235 Mbp of sequence data (mean read length 1,008 bp). All sequences were subjected to quality screening using phred and an initial co-assembly of all the sequence data with the Celera assembler at the J. Craig Venter Institute following established protocols.
Single-copy, non-16S small subunit rRNA phylogenetic markers, as identified in from Flavobacterium psychrophilum JIP02, were searched against all GOM scaffolds using BLASTN (all values used in BLAST searches can be found in Additional file2: Table S7). All scaffolds were then grouped using an oligonucleotide frequency calculator. All scaffolds >3,000 bp in length had tri-, tetra-, penta-, and hexanucleotide frequencies determined. Scaffolds were clustered using a hierarchical clustering method. Scaffolds with phylogenetic markers were used to identify all clusters of scaffolds with a Pearson's correlation ≥0.90. Such clusters were considered for further analysis. Clusters with cumulative length at least as long as previous Flavobacteriales genomes (>1.9 Mbp) (n = 3) were individually re-assembled using the Celera assembler (see supplemental materials and methods for parameters in Additional file3) and, collectively, all scaffolds generated from these assemblies >5,000 bp were grouped using the oligonucleotide frequency calculator. Only clusters of scaffolds containing >10,000 reads were identified for further assessment. The reads compromising each cluster were re-assembled with the Celera assembler separately. If the estimated percent duplication of single-copy genes (see below) was high, the reads comprising each cluster were further sub-divided until the duplication estimate was <10%. The subdivided clusters were assembled separately using the Celera assembler. All assembled clusters remaining were determined to represent phylogenetic bins (i.e., partial environmental genomes). Putative phylogeny was determined by identifying the 16S SSU rRNA gene. If a bin had a 16S rRNA gene without similarity to the Flavobacteriaceae, it was excluded from further analysis.
Annotation and bin refinement
Phylogenetic bins were annotated using the RAST annotation server. The accuracy of the scaffolds within each bin was further refined using a process similar to. In brief, each putative CDS was assigned a phylogenetic identity based on a top hit using BLASTP to query the NCBI GenBank NR database. If the percent of CDS on a scaffold with a top hit to the phylum Bacteroidetes was <60%, the scaffold was removed from the bin. Duplicated scaffolds were identified using a BLASTP search of all putative CDS within a bin against the CDS of Flavobacteriales sp. ALC-1. If the smaller of the two scaffolds had ≥85% similar genes to a longer scaffold, the shorter scaffold was removed from the bin. All duplicated scaffolds were confirmed by progressiveMauve alignments.
Estimation of bin completeness and duplication
Identification of catalytic domains
Catalytic domains for GHs and carbohydrate-binding modules (CBMs) identified from and peptidases were identified by comparing the MEROPS database of catalytic peptidase units against the PFAM V26.0 to identify the best PFAM model for each peptidase unit. These lists were used to generate libraries from PFAM V26.0. Each library was used to search the putative CDS from each bin using HMMER. The peptidases of each bin were compared via BLASTP against each other, the pangenome, and the GenBank NR. The predicted destinations for the peptidases were determined using PSORTb (v3.0.2), signal peptides determined by SignalP (v4.1), and lipoprotein signals determined by LipoP (v1.0).
Population analysis of variable sites
Using Geneious V5.6.2 (Biomatters; http://www.geneious.com), the underlying sequences for each scaffold were trimmed and mapped to the respective scaffolds (see supplemental materials and methods for settings in Additional file3). SNPs were identified and counted for sites with ≥4× read coverage, where at least two of the reads contained a mutation at that site. Alignments were performed using CLUSTAL W and checked manually for bases with low quality scores (cutoff = 30). Low quality base pairs without supporting reads were corrected to reflect the consensus base at these sites. Putative CDS of interest were identified with coverage >7×. Using the convention of 8× coverage as the cutoff used to demarcate draft level genomes using Sanger sequences, a 7× cutoff was selected to increase the number of analyzed CDS, while preventing the analysis of CDS without suitable coverage for the analysis. Each CDS was manually checked for different variants based on SNP patterns. For CDS with more than two variants, pairwise calculations were made using the variant with the most supporting reads as the reference. If two variants had the same number of supporting reads, the reference was determined to be the sequence with the highest nucleotide similarity to the consensus. Amino acid alignments were performed using CLUSTAL W. PAL2NAL (V14) was used to align the corresponding nucleotide sequences. Codon table 11 (bacterial, archaeal, and plant plastid) was used for translations. The “Remove gaps, inframe stop codons” setting was turned on. dS, dN, and dN/dS values were computed in PAL2NAL using the codeml program within PAML. The dS values are a sufficient indicator of time since divergence, with lower values suggesting less time since divergence.
Bin functionality variation
Putative CDS from a single bin were compared via BLASTP to the other bins, the pangenome, and the GenBank NR to identify unique genes. By concentrating on the CDSs without similarity in other Flavobacteriaceae or the GenBank NR protein database, it is possible to highlight genes that may have an impact on ecophysiology. The list of putative genes of interest was reduced by removing CDS with duplicate annotation genes and manually curated for genes with potential ecophysiological impacts. progressiveMauve was used to compare the scaffolds of the four phylogenetic bins and identify large regions of synteny and homology.
Phylogenetic and proteorhodopsin marker trees
Putative proteorhodopsin nucleotide sequences were identified for each bin and aligned with CLUSTAL W to sequences obtained from GenBank. The list of phylogenetic markers presented in Santos and Ochman was used to identify putative marker sequences in each bin. Marker genes present in at least three of the four bins were selected (DNA primase (DnaG), ATP-dependent DNA helicase (RecG), DNA polymerase III, alpha subunit (DnaE), ATP-dependent DNA helicase (RecG), translation elongation factor (EF-G) (FusA), and DNA-directed RNA polymerase subunit beta’ (RpoC)) and aligned with CLUSTAL W to sequences obtained from IMG Genomes. It should be noted that RpoC is not as robust of a phylogenetic marker as the corresponding subunit beta (RpoB). Alignments were trimmed, concatenated in the case of the DnaG-FusA, DnaG-DnaE-RecG, and DnaG-RpoB trees, and a maximum likelihood tree was constructed using PHYML with 1,000 bootstraps (see supplemental materials and methods in Additional file3).
Raw reads can be found in the NCBI GenBank Trace Archive (TA) ID No. 2307942905–2310786347. Scaffolds of putative Flavobacteria will be deposited after review and acceptance.
Gulf of Maine
Coding DNA sequence
Conserved single-copy gene
Bacterial neuraminidase repeat.
This work was supported by the National Science Foundation Microbial Sequencing Grant 0412119. The authors gratefully acknowledge NOAA ecosystem process division scientists Jon Hare and Jerry Prezario for ship time on NOAA Fisheries R/V Delaware II (Cruise No. DE 06–02) and R/V Albatross IV (Cruise No. AL 06–07). We thank Dr. Shannon Williamson for collecting the samples. We thank Robert Friedman and Yu-Hui Rogers for technical and scientific support in the sequencing efforts. We thank Matt Lewis and Dr Aaron Halpern, who processed the GoMA samples. We would also like to thank Dr. Jason Sylvan for reviewing the manuscript before submission and offering constructive suggestions.
- Grossart H-P, Levold F, Allgaier M, Simon M, Brinkhoff T: Marine diatom species harbour distinct bacterial communities. Environ Microbiol. 2005, 7: 860-873. 10.1111/j.1462-2920.2005.00759.x.View ArticlePubMedGoogle Scholar
- Kirchman DL: The ecology of Cytophaga-Flavobacteria in aquatic environments. FEMS Microbiol Ecol. 2002, 39: 91-100.PubMedGoogle Scholar
- Cottrell MT, Kirchman DL: Natural assemblages of marine proteobacteria and members of the Cytophaga-Flavobacter cluster consuming low- and high-molecular-weight dissolved organic matter. Appl Environ Microbiol. 2000, 66: 1692-1697. 10.1128/AEM.66.4.1692-1697.2000.View ArticlePubMedPubMed CentralGoogle Scholar
- Fuhrman JA, Azam F: Thymidine incorporation as a measure of heterotrophic bacterioplankton production in marine surface waters: evaluation and field results. Mar Biol. 1982, 66: 109-120. 10.1007/BF00397184.View ArticleGoogle Scholar
- Martinez J, Smith DC, Steward GF, Azam F: Variability in ectohydrolytic enzyme activities of pelagic marine bacteria and its significance for substrate processing in the sea. Aquat Microb Ecol. 1996, 10: 223-230.View ArticleGoogle Scholar
- Teeling H, Fuchs BM, Becher D, Klockow C, Gardebrecht A, Bennke CM, Kassabgy M, Huang S, Mann AJ, Waldmann J, Weber M, Klindworth A, Otto A, Lange J, Bernhardt J, Reinsch C, Hecker M, Peplies J, Bockelmann FD, Callies U, Gerdts G, Wichels A, Wiltshire KH, Glöckner FO, Schweder T, Amann R: Substrate-controlled succession of marine bacterioplankton populations induced by a phytoplankton bloom. Science. 2012, 336: 608-611. 10.1126/science.1218344.View ArticlePubMedGoogle Scholar
- Heidelberg JF, Heidelberg KB, Colwell RR: Bacteria of the gamma-subclass Proteobacteria associated with zooplankton in Chesapeake Bay. Appl Environ Microbiol. 2002, 68: 5498-5507. 10.1128/AEM.68.11.5498-5507.2002.View ArticlePubMedPubMed CentralGoogle Scholar
- Baker BJ, Tyson GW, Webb RI, Flanagan J, Hugenholtz P, Allen EE, Banfield JF: Lineages of acidophilic archaea revealed by community genomic analysis. Science. 2006, 314: 1933-1935. 10.1126/science.1132690.View ArticlePubMedGoogle Scholar
- DeLong EF, Preston CM, Mincer T, Rich V, Hallam SJ, Frigaard N-U, Martinez A, Sullivan MB, Edwards R, Brito BR, Chisholm SW, Karl DM: Community genomics among stratified microbial assemblages in the ocean’s interior. Science. 2006, 311: 496-503. 10.1126/science.1120250.View ArticlePubMedGoogle Scholar
- Engel P, Martinson VG, Moran NA: Functional diversity within the simple gut microbiota of the honey bee. Proc Natl Acad Sci. 2012, 109: 11002-11007. 10.1073/pnas.1202970109.View ArticlePubMedPubMed CentralGoogle Scholar
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu D, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers Y-H, Smith HO: Environmental genome shotgun sequencing of the Sargasso Sea. Science. 2004, 304: 66-74. 10.1126/science.1093857.View ArticlePubMedGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004, 428: 37-43. 10.1038/nature02340.View ArticlePubMedGoogle Scholar
- Tripp HJ, Bench SR, Turk KA, Foster RA, Desany BA, Niazi F, Affourtit JP, Zehr JP: Metabolic streamlining in an open-ocean nitrogen-fixing cyanobacterium. Nature. 2010, 464: 90-94. 10.1038/nature08786.View ArticlePubMedGoogle Scholar
- Tully BJ, Nelson WC, Heidelberg JF: Metagenomic analysis of a complex marine planktonic thaumarchaeal community from the Gulf of Maine. Environ Microbiol. 2011, 14: 254-267.View ArticlePubMedGoogle Scholar
- Coleman ML, Sullivan MB, Martiny AC, Steglich C, Barry K, DeLong EF, Chisholm SW: Genomic islands and the ecology and evolution of Prochlorococcus. Science. 2006, 311: 1768-1770. 10.1126/science.1122050.View ArticlePubMedGoogle Scholar
- Tenaillon O, Skurnik D, Picard B, Denamur E: The population genetics of commensal Escherichia coli. Nat Rev Micro. 2010, 8: 207-217. 10.1038/nrmicro2298.View ArticleGoogle Scholar
- Coleman ML, Chisholm SW: Ecosystem-specific selection pressures revealed through comparative population genomics. Proc Natl Acad Sci. 2010, 107: 18634-18639. 10.1073/pnas.1009480107.View ArticlePubMedPubMed CentralGoogle Scholar
- Konstantinidis KT, Braff J, Karl DM, DeLong EF: Comparative metagenomic analysis of a microbial community residing at a depth of 4,000 meters at Station ALOHA in the North Pacific Subtropical Gyre. Appl Environ Microbiol. 2009, 75: 5345-5355. 10.1128/AEM.00473-09.View ArticlePubMedPubMed CentralGoogle Scholar
- Simmons SL, DiBartolo G, Denef VJ, Goltsman DSA, Thelen MP, Banfield JF: Population genomic analysis of strain variation in Leptospirillum group II bacteria involved in acid mine drainage formation. Plos Biol. 2008, 6: e177-10.1371/journal.pbio.0060177.View ArticlePubMedPubMed CentralGoogle Scholar
- Denef VJ, Banfield JF: In situ evolutionary rate measurements show ecological success of recently emerged bacterial hybrids. Science. 2012, 336: 462-466. 10.1126/science.1218389.View ArticlePubMedGoogle Scholar
- Townsend DW, Thomas AC, Mayer LM, Thomas MA: Oceanography of the northwest Atlantic continental shelf (1, W). The Sea: The Global Coastal Ocean. Edited by: Robinson AR, Brink KH. 2006, Cambridge: Harvard University Press, 119-168.Google Scholar
- Podell S, Ugalde JA, Narasingarao P, Banfield JF, Heidelberg KB, Allen EE: Assembly-driven community genomics of a hypersaline microbial ecosystem. PLoS ONE. 2013, 8: e61692-10.1371/journal.pone.0061692.View ArticlePubMedPubMed CentralGoogle Scholar
- Kettler GC, Martiny AC, Huang K, Zucker J, Coleman ML, Rodrigue S, Chen F, Lapidus A, Ferriera S, Johnson J, Steglich C, Church GM, Richardson P, Chisholm SW: Patterns and implications of gene gain and loss in the evolution of Prochlorococcus. PLoS Genet. 2007, 3: e231-10.1371/journal.pgen.0030231.View ArticlePubMedPubMed CentralGoogle Scholar
- Tuanyok A, Leadem BR, Auerbach RK, Beckstrom-Sternberg SM, Beckstrom-Sternberg JS, Mayo M, Wuthiekanun V, Brettin TS, Nierman WC, Peacock SJ, Currie BJ, Wagner DM, Keim P: Genomic islands from five strains of Burkholderia pseudomallei. BMC Genomics. 2008, 9: 566-10.1186/1471-2164-9-566.View ArticlePubMedPubMed CentralGoogle Scholar
- Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol. 2004, 6: 938-947. 10.1111/j.1462-2920.2004.00624.x.View ArticlePubMedGoogle Scholar
- Woyke T, Xie G, Copeland A, González JM, Han C, Kiss H, Saw JH, Senin P, Yang C, Chatterji S, Cheng J-F, Eisen JA, Sieracki ME, Stepanauskas R: Assembling the marine metagenome, one cell at a time. PLoS ONE. 2009, 4: e5299-10.1371/journal.pone.0005299.View ArticlePubMedPubMed CentralGoogle Scholar
- Iglewicz B, Hoaglin D: How to detect and handle outliers. ASQC Basic References in Quality Control: Statistical Techniques. Volume 16. Edited by: Mykytka EF. 1993, Ottawa: ASQC Quality Press, 1-87.Google Scholar
- Hewson I, Steele JA, Capone DG, Fuhrman JA: Temporal and spatial scales of variation in bacterioplankton assemblages of oligotrophic surface waters. Mar Ecol Prog Ser. 2006, 311: 67-77.View ArticleGoogle Scholar
- Fuhrman JA, Hewson I, Schwalbach MS, Steele JA, Brown MV, Naeem S: Annually reoccurring bacterial communities are predictable from ocean conditions. Proc Natl Acad Sci USA. 2006, 103: 13104-13109. 10.1073/pnas.0602399103.View ArticlePubMedPubMed CentralGoogle Scholar
- González JM, Pinhassi J, Fernández-Gómez B, Coll-Lladó M, González-Velázquez M, Puigbò P, Jaenicke S, Gómez-Consarnau L, Fernández-Guerra A, Goesmann A, Pedrós-Alió C: Genomics of the proteorhodopsin-containing marine flavobacterium Dokdonia sp. strain MED134. Appl Environ Microbiol. 2011, 77: 8676-8686. 10.1128/AEM.06152-11.View ArticlePubMedPubMed CentralGoogle Scholar
- Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers Y-H, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S: The sorcerer II global ocean sampling expedition: Northwest Atlantic through eastern tropical pacific. Plos Biol. 2007, 5: e77-10.1371/journal.pbio.0050077.View ArticlePubMedPubMed CentralGoogle Scholar
- Johnson CLV, Pechonick E, Park SD, Havemann GD, Leal NA, Bobik TA: Functional genomic, biochemical, and genetic characterization of the Salmonella pduO gene, an ATP:Cob(I)alamin adenosyltransferase gene. J Bacteriol. 2001, 183: 1577-1584. 10.1128/JB.183.5.1577-1584.2001.View ArticlePubMedPubMed CentralGoogle Scholar
- Ng PC, Henikoff S: Predicting the effects of amino acid substitutions on protein function. Annu Rev Genom Human Genet. 2006, 7: 61-80. 10.1146/annurev.genom.7.080505.115630.View ArticleGoogle Scholar
- Kvansakul M, Adams JC, Hohenester E: Structure of a thrombospondin C-terminal fragment reveals a novel calcium core in the type 3 repeats. EMBO J. 2004, 23: 1223-1233. 10.1038/sj.emboj.7600166.View ArticlePubMedPubMed CentralGoogle Scholar
- Cohen-Kupiec R, Chet I: The molecular biology of chitin digestion. Curr Opin Biotechnol. 1998, 9: 270-277. 10.1016/S0958-1669(98)80058-X.View ArticlePubMedGoogle Scholar
- Linhartová I, Bumba L, Mašín J, Basler M, Osička R, Kamanová J, Procházková K, Adkins I, Hejnová-Holubová J, Sadílková L, Morová J, Sebo P: RTX proteins: a highly diverse family secreted by a common mechanism. FEMS Microbiol Rev. 2010, 34: 6-View ArticleGoogle Scholar
- Vrljic M, Sahm H, Eggeling L: A new type of transporter with a new type of cellular function: L-lysine export from Corynebacterium glutamicum. Mol Microbiol. 1996, 22: 815-826. 10.1046/j.1365-2958.1996.01527.x.View ArticlePubMedGoogle Scholar
- Paulsson M, Heinegard D: Purification and structural characterization of a cartilage matrix protein. Biochem J. 1981, 197: 367-375.View ArticlePubMedPubMed CentralGoogle Scholar
- Chevrier B, DOrchymont H, Schalk C, Tarnus C, Moras D: The structure of the Aeromonas proteolytica aminopeptidase complexed with a hydroxamate inhibitor - involvement in catalysis of Glu151 and two zinc ions of the co-catalytic unit. Eur J Biochem. 1996, 237: 393-398. 10.1111/j.1432-1033.1996.0393k.x.View ArticlePubMedGoogle Scholar
- Cantarel BL, Coutinho PM, Rancurel C, Bernard T, Lombard V, Henrissat B: The Carbohydrate-Active EnZymes database (CAZy): an expert resource for glycogenomics. Nucleic Acids Res. 2009, 37: D233-D238. 10.1093/nar/gkn663.View ArticlePubMedPubMed CentralGoogle Scholar
- Gaskell A, Crennell S, Taylor G: The three domains of a bacterial sialidase: a beta-propeller, an immunoglobulin module and a galactose-binding jelly-roll. Structure. 1995, 3: 1197-1205. 10.1016/S0969-2126(01)00255-6.View ArticlePubMedGoogle Scholar
- Rocha EPC, Smith JM, Hurst LD, Holden MTG, Cooper JE, Smith NH, Feil EJ: Comparisons of dN/dS are time dependent for closely related bacterial genomes. J Theor Biol. 2006, 239: 226-235. 10.1016/j.jtbi.2005.08.037.View ArticlePubMedGoogle Scholar
- Ewing B, Green P: Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 1998, 8: 186-194.View ArticlePubMedGoogle Scholar
- Santos SR, Ochman H: Identification and phylogenetic sorting of bacterial lineages with universally conserved genes and proteins. Environ Microbiol. 2004, 6: 754-759. 10.1111/j.1462-2920.2004.00617.x.View ArticlePubMedGoogle Scholar
- Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O: The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics. 2008, 9: 75-10.1186/1471-2164-9-75.View ArticlePubMedPubMed CentralGoogle Scholar
- Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Rapp BA, Wheeler DL: GenBank. Nucleic Acids Res. 2000, 28: 15-18. 10.1093/nar/28.1.15.View ArticlePubMedPubMed CentralGoogle Scholar
- Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS ONE. 2010, 5: e11147-10.1371/journal.pone.0011147.View ArticlePubMedPubMed CentralGoogle Scholar
- Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011, 39 (suppl): W29-W37. 10.1093/nar/gkr367.View ArticlePubMedPubMed CentralGoogle Scholar
- Haft DH, Selengut JD, White O: The TIGRFAMs database of protein families. Nucleic Acids Res. 2003, 31: 371-373. 10.1093/nar/gkg128.View ArticlePubMedPubMed CentralGoogle Scholar
- Rawlings ND, Barrett AJ, Bateman A: MEROPS: the database of proteolytic enzymes, their substrates and inhibitors. Nucleic Acids Res. 2011, 40: D343-D350.View ArticlePubMedPubMed CentralGoogle Scholar
- Yu NY, Wagner JR, Laird MR, Melli G, Rey S, Lo R, Dao P, Sahinalp SC, Ester M, Foster LJ, Brinkman FSL: PSORTb 3.0: improved protein subcellular localization prediction with refined localization subcategories and predictive capabilities for all prokaryotes. Bioinformatics. 2010, 26: 1608-1615. 10.1093/bioinformatics/btq249.View ArticlePubMedPubMed CentralGoogle Scholar
- Petersen TN, Brunak S, Heijne Von G, Nielsen H: SignalP 4.0: discriminating signal peptides from transmembrane regions. Nat Meth. 2011, 8: 785-786. 10.1038/nmeth.1701.View ArticleGoogle Scholar
- Juncker AS, Willenbrock H, Heijne Von G, Brunak S, Nielsen H, Krogh A: Prediction of lipoprotein signal peptides in Gram-negative bacteria. Protein Sci. 2003, 12: 1652-1662. 10.1110/ps.0303703.View ArticlePubMedPubMed CentralGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: Clustal-W - improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994, 22: 4673-4680. 10.1093/nar/22.22.4673.View ArticlePubMedPubMed CentralGoogle Scholar
- Suyama M, Torrents D, Bork P: PAL2NAL: robust conversion of protein sequence alignments into the corresponding codon alignments. Nucleic Acids Res. 2006, 34: W609-W612. 10.1093/nar/gkl315.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang Z: PAML 4: phylogenetic analysis by maximum likelihood. Mol Biol Evol. 2007, 24: 1586-1591. 10.1093/molbev/msm088.View ArticlePubMedGoogle Scholar
- Markowitz VM, Korzeniewski F, Palaniappan K, Szeto E, Werner G, Padki A, Zhao X, Dubchak I, Hugenholtz P, Anderson I, Lykidis A, Mavromatis K, Ivanova N, Kyrpides NC: The integrated microbial genomes (IMG) system. Nucleic Acids Res. 2006, 34 (Database issue): D344-D348.View ArticlePubMedPubMed CentralGoogle Scholar
- Guindon S, Lethiec F, Duroux P, Gascuel O: PHYML Online—a web server for fast maximum likelihood-based phylogenetic inference. Nucleic Acids Res. 2005, 33: W557-W559. 10.1093/nar/gki352.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.