Microbial phylogenetic profiling with the Pacific Biosciences sequencing platform
© Fichot and Norman; licensee BioMed Central Ltd. 2013
Received: 29 November 2012
Accepted: 21 January 2013
Published: 4 March 2013
High-throughput sequencing of 16S rRNA gene amplicons has revolutionized the capacity and depth of microbial community profiling. Several sequencing platforms are available, but most phylogenetic studies are performed on the 454-pyrosequencing platform because its longer reads can give finer phylogenetic resolution. The Pacific Biosciences (PacBio) sequencing platform is significantly less expensive per run, does not rely on amplification for library generation, and generates reads that are, on average, four times longer than those from 454 (C2 chemistry), but the resulting high error rates appear to preclude its use in phylogenetic profiling. Recently, however, the PacBio platform was used to characterize four electrosynthetic microbiomes to the genus-level for less than USD 1,000 through the use of PacBio’s circular consensus sequence technology. Here, we describe in greater detail: 1) the output from successful 16S rRNA gene amplicon profiling with PacBio, 2) how the analysis was contingent upon several alterations to standard bioinformatic quality control workflows, and 3) the advantages and disadvantages of using the PacBio platform for community profiling.
KeywordsPacBio Pacific Biosciences Amplicon 16S rRNA Circular consensus sequence (ccs) Microbial diversity
Bioinformatic workflows used to preprocess raw sequences before phylogenetic analysis have quality control features designed to reduce sequencing or PCR errors in the dataset. For example, workflows remove reads that contain: an ambiguous base call, an average quality score below a threshold, multiple mismatches to a primer/barcode sequence, fewer than a specified number of bases, or chimeras. As this workflow was not wholly sufficient for use with PacBio-generated data output, the necessary alterations are discussed further in the text, all of which can be executed with open-source software such as mothur or QIIME.
Unlike sequences originating from 454, the user cannot control which template strand (+/−) is sequenced in PacBio. Therefore, strand orientation must be recognized and unified if alignments or OTU-based analyses are to be performed.
Retrieving quality scores
The overall quality score (Phred + 33) can be retrieved from the fastq output with the aid of scripts that translate the ASCII code into a Phred score. In addition to scripts available in mothur and QIIME, freely available Perl scripts, such as fq_all2std.pl, can be used with the std2qual command to retrieve Phred scores and sequences from PacBio fastq output files.
In 454 sequences, a base call receiving a quality score of 0 is assigned the ambiguous base designation ‘N’. However, in PacBio sequences, a base call receiving a quality score of 0 is assigned an actual nucleotide. Therefore, culling reads based on ambiguous base calls does not remove the intended reads. Instead, scripts need to search the quality score file in order to remove these reads.
Stretches of low quality
Quality decreases toward the end of 454-generated reads, but because of the nature of generating a ccs read from a long read with randomly distributed errors, PacBio ccs reads are not only the full amplicon length, but the read quality does not positionally decrease (Figure 2). Instead, regions of lower quality in PacBio sequences appeared to be homopolymer associated, but not all homopolymers had low quality (Figure 2). Removing sequences based on the read’s average quality score was not the most rigorous or appropriate way to remove ‘bad’ ccs sequences because the abundant, heretofore unseen high quality scores (≤93) of PacBio reads mask regions of lower quality. Using a rolling window approach to reduce errors (removing reads when the average quality score over a window of specified bases drops below a threshold) was used in Marshall et al. with a window spanning twice the size of the average homopolymer. As seen in Figure 2, large window sizes include regions of extraordinarily high quality scores, thus masking low quality regions. Using small windows would remove sequences with legitimate homopolymers (Figure 2, see positions 364 to 388).
Depending on the diversity in the biological system being analyzed and the researcher’s resources/requirements (funds, phylogenetic resolution, coverage, number of samples, etc.), one sequencing platform may be more appropriate than another for phylogenetic profiling[1, 12–15]. Illumina provides higher coverage than 454 or PacBio, but PacBio and 454 are advantageous when a higher phylogenetic resolution is needed (longer sequence reads). PacBio is also advantageous for labs with less resource because it could enable less expensive routes to data exploration without sacrificing phylogenetic resolution. In addition, PacBio’s relatively low cost per run may benefit studies that require only a few samples to be sequenced, where the cost per sample on other platforms can be prohibitive. Currently, several US academic institutions offer PacBio sequencing services and charge approximately USD 350 to USD 440 for library preparation and USD 200 to USD 400 for each cell used in sequencing. For the same price, neither Illumina’s GAIIx/HiSeq2000 nor 454 GS FLX is available, but several benchtop instruments provide sequencing services for run costs equivalent to PacBio.
In terms of systemic error, each platform also has advantages and disadvantages. Illumina and 454 have low error rates, but the errors are positional (they increase distally, with guanine-cytosine (GC) content, or with homopolymers)[8, 9, 15]. In contrast, PacBio has high error rates, but through the use of ccs reads and because errors are randomly distributed, the error rates are greatly reduced.
While no study has documented a head-to-head or in silico comparison of community amplicons sequenced with PacBio ccs and another platform such as Illumina or 454, there is evidence that PacBio may not add extensive platform-based bias to community profiles. In sequencing three microbial genomes containing either 19%, 50% or 69% GC content with PacBio, Carneiro et al. found that the read coverage was relatively unaffected by GC content, with Quail et al. finding similar results. Conversely, Ion Torrent, Illumina, and 454 all have a noticeable GC bias[15–17], but Aird et al. attribute this to bias introduced in PCR amplification during the library preparation step. Unlike Illumina and 454, the library preparation for PacBio does not include an amplification step, which avoids this as a potential sequencer-based error source. On the other hand, one known bias of the PacBio platform is the preferential loading of shorter sequences into zero-mode waveguides (ZMWs, essentially ‘wells’), thus biasing the resulting community toward members having shorter sequences; but if amplicons are used, this bias is minimized. A comparison between platforms to determine PacBio-specific bias in community profiling is a necessary next step.
Overall, the PacBio sequencing platform was sufficient for phylogenetic profiling of electrosynthetic microbiomes to the genus level with taxonomic binning. The low read quality typical of PacBio was overcome by using circular consensus sequences (ccs). In addition, quality control workflows were adjusted for PacBio-specific issues, the most notable of which was the formation of ‘PacBio chimeras,’ features that are a potential artifact of PacBio library preparation but are not detected with UCHIME. Just as with every sequencing platform, future advances by PacBio in technology and chemistry will enable longer (hence more accurate and numerous) reads, while further understanding of PacBio biases will enable more accurate data for phylogenetic profiling.
EF is research specialist and RN is a molecular microbial ecologist and assistant professor in the Environmental Health Sciences Department of the University of South Carolina.
American standard code for information interchange
circular consensus sequences
operational taxonomic unit
single-molecule real time
United States dollar
The authors would like to thank WJ Jones at EnGenCore, LLC, for helpful discussions. They would also like to thank CW Marshall, DE Ross, and HD May who coauthored the manuscript upon which this commentary is based. Funding was provided by the US Department of Energy, Advanced Research Project Agency, Energy award #DE-AR0000089 to RN.
- Loman NJ, Constantinidou C, Chan JZM, Halachev M, Sergeant M, Penn CW, Robinson ER, Pallen MJ: High-throughput bacterial genome sequencing: an embarrassment of choice, a world of opportunity. Nat Rev Microbiol. 2012, 10 (9): 599-606. 10.1038/nrmicro2850.View ArticlePubMedGoogle Scholar
- Werner JJ, Zhou D, Caporaso JG, Knight R, Angenent LT: Comparison of Illumina paired-end and single-direction sequencing for microbial 16S rRNA gene amplicon surveys. ISME J. 2012, 6 (7): 1273-1276. 10.1038/ismej.2011.186.PubMed CentralView ArticlePubMedGoogle Scholar
- Koren S, Schatz MC, Walenz BP, Martin J, Howard JT, Ganapathy G, Wang Z, Rasko DA, McCombie WR, Jarvis ED: Hybrid error correction and de novo assembly of single-molecule sequencing reads. Nat Biotechnol. 2012, 30 (7): 692-700.Google Scholar
- Marshall CW, Ross DE, Fichot EB, Norman RS, May HD: Electrosynthesis of commodity chemicals by an autotrophic microbial community. Appl Environ Microbiol. 2012, 78 (23): 8412-8420. 10.1128/AEM.02401-12.PubMed CentralView ArticlePubMedGoogle Scholar
- Jumpstart Consortium Human Microbiome Project Data Generation Working Group: Evaluation of 16S rDNA-based community profiling for human microbiome research. PLoS One. 2012, 7 (6): e39315-10.1371/journal.pone.0039315.PubMed CentralView ArticleGoogle Scholar
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75 (23): 7537-7541. 10.1128/AEM.01541-09.PubMed CentralView ArticlePubMedGoogle Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Pena AG, Goodrich JK, Gordon JI: QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010, 7 (5): 335-336. 10.1038/nmeth.f.303.PubMed CentralView ArticlePubMedGoogle Scholar
- Schloss PD, Gevers D, Westcott SL: Reducing the effects of PCR amplification and sequencing artifacts on 16S rRNA-based studies. PLoS One. 2011, 6 (12): e27310-10.1371/journal.pone.0027310.PubMed CentralView ArticlePubMedGoogle Scholar
- Carneiro M, Russ C, Ross M, Gabriel S, Nusbaum C, DePristo M: Pacific Biosciences sequencing technology for genotyping and variation discovery in human data. BMC Genomics. 2012, 13 (1): 375-10.1186/1471-2164-13-375.PubMed CentralView ArticlePubMedGoogle Scholar
- von Wintzingerode F, Gobel UB, Stackebrandt E: Determination of microbial diversity in environmental samples: pitfalls of PCR-based rRNA analysis. FEMS Microbiol Rev. 1997, 21 (3): 213-229. 10.1111/j.1574-6976.1997.tb00351.x.View ArticlePubMedGoogle Scholar
- Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R: UCHIME improves sensitivity and speed of chimera detection. Bioinformatics. 2011, 27 (16): 2194-2200. 10.1093/bioinformatics/btr381.PubMed CentralView ArticlePubMedGoogle Scholar
- Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, Pallen MJ: Performance comparison of benchtop high-throughput sequencing platforms. Nat Biotechnol. 2012, 30: 434-439. 10.1038/nbt.2198.View ArticlePubMedGoogle Scholar
- Liu L, Li Y, Li S, Hu N, He Y, Pong R, Lin D, Lu L, Law M: Comparison of next-generation sequencing systems. J Biomed Biotechnol. 2012, 2012: 11-Google Scholar
- Glenn TC: Field guide to next-generation DNA sequencers. Mol Ecol Resour. 2011, 11 (5): 759-769. 10.1111/j.1755-0998.2011.03024.x.View ArticlePubMedGoogle Scholar
- Quail MA, Smith M, Coupland P, Otto TD, Harris SR, Connor TR, Bertoni A, Swerdlow HP, Gu Y: A tale of three next generation sequencing platforms: comparison of Ion Torrent, Pacific Biosciences and Illumina MiSeq sequencers. BMC Genomics. 2012, 13: 341-10.1186/1471-2164-13-341.PubMed CentralView ArticlePubMedGoogle Scholar
- Dohm JC, Lottaz C, Borodina T, Himmelbauer H: Substantial biases in ultra-short read data sets from high-throughput DNA sequencing. Nucleic Acids Res. 2008, 36 (16): e105-10.1093/nar/gkn425.PubMed CentralView ArticlePubMedGoogle Scholar
- Pinto AJ, Raskin L: PCR biases distort bacterial and archaeal community structure in pyrosequencing datasets. PLoS One. 2012, 7 (8): e43093-10.1371/journal.pone.0043093.PubMed CentralView ArticlePubMedGoogle Scholar
- Aird D, Ross MG, Chen W-S, Danielsson M, Fennell T, Russ C, Jaffe DB, Nusbaum C, Gnirke A: Analyzing and minimizing PCR amplification bias in Illumina sequencing libraries. Genome Biol. 2011, 12 (2): R18-10.1186/gb-2011-12-2-r18.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.