Groundwater samples were collected from the Ohio Department of Natural Resources (ODNR) Observation Well Network used to monitor ground water level fluctuations in three different counties (Fig. 1a). These three wells are located within separate buried valley aquifers, valleys that have been back filled with glacial sands and gravels and some till. The observation wells were sampled on a quarterly basis over a 2-year period from July 2014 to July 2016. Additionally, one private drinking water well, located in a sand and gravel aquifer within a thick till sequence in western Licking County (Fig. 1a), was sampled once in June 2016.
Wells were purged of more than 250 L of water to ensure that aquifer-derived water was being sampled (dedicated pumps were placed at the screened interval for the ODNR wells). Approximately 38 L of post-purge groundwater was pumped sequentially through a 0.2-μm then 0.1-μm Supor PES Membrane Filter (Pall Corporation, NY, USA). The filters were then immediately flash frozen and kept on dry ice until they were returned to the Ohio State University where they were stored at −80 °C until DNA extraction. During sampling, oxidation-reduction potential (ORP), temperature, and pH were measured using a handheld Myron Ultrameter II (Myron L Company, CA, USA). General observational data for each of these groundwater wells is provided in Additional file 1: Table S1.
DNA extraction, sequencing, and processing
DNA was extracted from roughly a quarter of each Supor PES membrane filter by using the PowerSoil DNA Isolation Kit (Mo Bio Laboratories, Inc., Carlsbad, CA, USA). Final DNA concentrations were determined by using a Qubit fluorometer (Invitrogen, Carlsbad, CA, USA).
To generate 16S rRNA gene data, the V4 region of 16S rRNA genes was amplified and sequenced by using the universal bacterial/archaeal primer set 515F/806R on an Illumina MiSeq instrument at Argonne National Laboratory. The resulting reads were processed through the QIIME pipeline (V1.7.0) and clustered into operational taxonomic unit (OTU) classifications at 97% similarities. Taxonomies were assigned using the SILVA database as well as a CPR 16S rRNA gene database curated by Brown et al. .
Metagenomic data for nine samples (0.2-μm filters from July 2014, October 2014, and April 2016 for Greene and Athens, October 2014 for Licking, July 2016 for the Licking Private Well and then a 0.1-μm filter from October 2014 for Greene) was collected by shotgun sequencing on an Illumina HiSeq 2500 at the Genomics Shared Resource at the Ohio State University. Raw reads were trimmed and filtered based upon read quality using sickle pe with default parameters . Resulting reads were subsequently assembled into larger contigs and then scaffolds using idba_ud with default parameters .
The data generated during 16S rRNA gene sequencings can be obtained from NCBI using accession number SRX2896383. The 71 genomes and annotated protein sequences are publicly hosted at http://chimera.asc.ohio-state.edu/Danczak_Genomes_and_Protein_sequences.html.
16S rRNA gene sequencing analysis
In order to obtain results specific to this radiation, non-CPR taxonomies were first removed from the OTU table. A stacked bar chart was generated in R v3.3.2  using ggplot2 to visualize differential relative abundances (ggplot, ggplot2 package v2.2.1) . Beta diversity was calculated using Bray-Curtis dissimilarity  and subsequently plotted using non-metric multidimensional scaling (NMDS) to observe CPR community differences (metaMDS, vegan package v2.4-2) . Species scores were then plotted as loading arrows in order to understand which members were responsible for the observed separation.
Metagenomic binning and annotation
In order to determine average coverage across a scaffold, trimmed reads were mapped to assembled scaffolds using bowtie2 (bowtie2 --fast) . Using the read-mapping information, assembled scaffolds ≥ 2500 bp were binned using MetaBAT (metabat --superspecific) . Genome completeness in each bin was determined by analyzing the presence of 31 conserved bacterial genes using AMPHORA2 . Given that these conserved genes are considered single copy, genomes with multiple copies were considered to be contaminated (or “misbinned”). Bins with a genome completion > 50% and a misbin < 10% were used in downstream analyses. The open reading frames (ORFs) for the nine entire metagenomes and resulting bins were predicted using MetaProdigal  and subsequently annotated by comparing predicted ORFs to the KEGG, UniRef90, and InterProScan databases [27,28,29,30,31] using USEARCH to scan for single and reverse best hit (RBH) results . The KEGG Automatic Annotation Server (KAAS)  was also used to visualize pathway completeness.
The presence of different glycoside hydrolase (GH) families was determined in both the above bins and 2155 previously studied genomes obtained from ggkbase [4, 6] by using the dbCAN hidden Markov model (HMM) (hmmsearch --cpu 4 --tblout result.hres --noali -E 0.00001 -o result_hmm.txt dbCAN-fam-HMMs.hmm input.faa) [34,35,36]. Using the hmmsearch output, GH family gene counts were obtained for each genome by selecting the HMM results with the lowest e-values. Differences in GH profiles for both bins and genomes were visualized in R using both a presence/absence heatmap (ggplot, ggplot2 package v2.2.1)  and a redundancy analysis (rda, vegan package v2.4-2)  with bins from this study collapsed into their putative phylogenies. Given that putative nirK function in Parcubacteria is newly proposed, a MUSCLE alignment  of nirK sequences from this study, from previously obtained Parcubacteria, and from other bacteria was generated using default parameters to observe the presence/absence of conserved residues . The nirK sequences not from this study were obtained from NCBI and represent diverse taxonomies capable of nitrite reduction. These alignments were then trimmed in Geneious 9.1.5  to remove unaligned ends and portions with > 95% gaps. A RAxML tree with 100 bootstraps was then generated to determine evolutionary relatedness and functional potential (raxmlHPC-PTHREADS –f a –m PROTCAT –x 12345 –p 12345 –N 100 –T 20 –s input –n output)  and visualized in R using ggtree (ggtree, ggtree package v1.6.9) .
Metagenomic abundance calculations
The rough microbial diversity within the nine metagenomic samples was determined using an approach modified from Anantharaman et al. . First, the ribosomal protein S3 (rps3 or rpsC) was pulled from each sample using an HMM derived from AMPHORA2 . Ribosomal protein S3 sequences for each metagenome were subsequently clustered together at 99% using USEARCH . Reads were then mapped to the resulting rps3 clusters using Bowtie2, and coverage was subsequently determined as a proxy for abundance. Taxonomic assignments for each rps3 cluster were obtained by comparing them against a database of rps3 proteins derived from Hug et al.  using blastp with an e-value cutoff of 10−15.
Phylogenetic analysis for bins
Given the inconsistent presence of various phylogenetic marker genes in different bins, numerous genes were analyzed to approximate phylogeny. The 16S rRNA gene was extracted from bins using SSU-Align with default parameters  and then aligned to two separate 16S rRNA gene databases [4, 6] using SINA on the SILVA website [43, 44]. The sequence identity to all genes within the SILVA NR99 database was also determined for potential phyla proposition . Unaligned ends and regions with 95% or greater gaps were then trimmed using Geneious 9.1.5  with a RAxML tree (100 bootstraps) subsequently generated (raxmlHPC-PTHREADS –f a –m GTRGAMMA –x 12345 –p 12345 –N 100 –T 20 –s input –n output)  and visualized using the ggtree package in R (ggtree, ggtree package v1.6.9) .
Phylogeny was further identified by building rps3 and gyrA trees following the protocols outlined above for nirK (i.e., pulled sequences using HMMs, aligned in MUSCLE and generated a tree using RAxML). For high-resolution phylogeny of 45 bins, a concatenated ribosomal tree was generated from 16 ribosomal proteins (rpL2, 3, 4, 5, 6, 14, 15, 16, 18, 22, 24, rps3, 8, 10, 17, 19). First, each ribosomal protein sequence was pulled from the bins using the appropriate HMM either derived from AMPHORA2 or TIGRFAM release 15.0 in hmmsearch [25, 34, 45]. Alignments were then performed independently for each sequence using MUSCLE with default parameters . Completed alignments were concatenated using Geneious 9.1.5 and trimmed to remove unaligned ends and portions with 95% gaps or greater . RAxML was then used in order to generate a phylogenetic tree with 100 bootstraps with the same command given above, again visualized in R with ggtree (ggtree, ggtree package v1.6.9) [40, 41].
New phyla were proposed based upon three separate parameters. Firstly, if the 16S rRNA gene sequence was less than 75% similar to previously described sequences and at least two new genomes formed a monophyletic clade within the concatenated ribosomal tree, a new phylum was proposed [6, 46]. However, given high phylogenetic variability within the CPR, if a very obvious monophyletic clade was observed on the concatenated tree, up to roughly 80% similarity was tolerated. Following naming conventions for the CPR, proposed names for the phyla were based upon researchers who have made significant contributions to our understanding of the diversity and physiology of CPR microorganisms .
All alignments generated during the phylogenetic analysis (i.e., rps3, gyrA, 16S rRNA, and concatenated ribosomal proteins) can be found in the supplemental information (Additional files 2, 3, 4, and 5).