Deriving accurate microbiota profiles from human samples with low bacterial content through post-sequencing processing of Illumina MiSeq data
- Jake Jervis-Bardy1, 2, 3Email author,
- Lex E X Leong3,
- Shashikanth Marri2,
- Renee J Smith3, 4,
- Jocelyn M Choo3,
- Heidi C Smith-Vaughan1,
- Elizabeth Nosworthy1,
- Peter S Morris1,
- Stephen O’Leary5,
- Geraint B Rogers†2, 3 and
- Robyn L Marsh†1
© Jervis-Bardy et al.; licensee BioMed Central. 2015
Received: 22 December 2014
Accepted: 3 April 2015
Published: 5 May 2015
The rapid expansion of 16S rRNA gene sequencing in challenging clinical contexts has resulted in a growing body of literature of variable quality. To a large extent, this is due to a failure to address spurious signal that is characteristic of samples with low levels of bacteria and high levels of non-bacterial DNA. We have developed a workflow based on the paired-end read Illumina MiSeq-based approach, which enables significant improvement in data quality, post-sequencing. We demonstrate the efficacy of this methodology through its application to paediatric upper-respiratory samples from several anatomical sites.
A workflow for processing sequence data was developed based on commonly available tools. Data generated from different sample types showed a marked variation in levels of non-bacterial signal and ‘contaminant’ bacterial reads. Significant differences in the ability of reference databases to accurately assign identity to operational taxonomic units (OTU) were observed. Three OTU-picking strategies were trialled as follows: de novo, open-reference and closed-reference, with open-reference performing substantially better. Relative abundance of OTUs identified as potential reagent contamination showed a strong inverse correlation with amplicon concentration allowing their objective removal. The removal of the spurious signal showed the greatest improvement in sample types typically containing low levels of bacteria and high levels of human DNA. A substantial impact of pre-filtering data and spurious signal removal was demonstrated by principal coordinate and co-occurrence analysis. For example, analysis of taxon co-occurrence in adenoid swab and middle ear fluid samples indicated that failure to remove the spurious signal resulted in the inclusion of six out of eleven bacterial genera that accounted for 80% of similarity between the sample types.
The application of the presented workflow to a set of challenging clinical samples demonstrates its utility in removing the spurious signal from the dataset, allowing clinical insight to be derived from what would otherwise be highly misleading output. While other approaches could potentially achieve similar improvements, the methodology employed here represents an accessible means to exclude the signal from contamination and other artefacts.
Keywords16S rRNA Respiratory MiSeq Contamination Pair-end reads QIIME Otitis media
The development of high-throughput, low-cost, sequencing has greatly expanded the ability of researchers to investigate complex bacterial systems associated with the human body. In particular, 16S rRNA gene amplicon sequencing has been used widely, most commonly in the characterisation of samples from ‘high biomass’ sites such as the gastrointestinal tract. Samples from such contexts are comparable in richness and complexity to some of the environmental microbial systems for which high-throughput sequencing technology was pioneered, allowing the technology to be applied with relatively minor modifications. However, amplicon- sequencing approaches are also being applied increasingly to anatomical [1,2] and environmental sites  that contain very low levels of bacteria such as the distal airways in the absence of an infection [1,2]. Here, a number of factors can have a major impact on the data generated. These include a reduction in PCR amplification efficiency due to high levels of human nucleic acids and low levels of bacterial 16S rRNA gene copies that result in increased sampling bias. Of particular concern is the contribution of low levels of signal from non-bacterial DNA and bacterial DNA present as reagent contamination , which would not substantially affect sequencing data in samples containing high concentrations of bacterial template. Such a spurious signal can substantially distort community profiles from samples with low bacterial load [4-6]. When 16S rRNA gene sequencing is applied to such contexts without due consideration of these factors, it can give rise to a conclusion that is potentially misleading [4,7]. The impact of failing to perform necessary data processing steps to render data accurate and clinically informative can be particularly problematic in studies that rely on commercial-sequencing providers. The absence of these steps, which are not standard for commercial-sequencing firms, has resulted in an increasing body of literature, whose quality is highly variable . Nevertheless, some well-conducted studies have attempted to address these issues, providing clinically robust conclusions .
Our aim was to develop a methodology that allows non-specialist researchers to derive accurate and clinically informative data from Illumina MiSeq-based pair-end 16S rRNA gene profiles generated from challenging respiratory contexts. Rather than distinguishing between a genuine signal and a spurious signal based on subjective ‘balance of probability’ assessments, our approach is based on defined parameters that can be applied objectively and uniformly. Further, it was our intention that, wherever possible, this methodology would be based on commonly available software, with a minimal requirement for specialist bioinformatic expertise. By applying our methodology to a collection of nasopharyngeal (NP) swabs, adenoid biopsies, adenoid swabs and middle ear fluid (MEF) samples from indigenous Australian children with otitis media with effusion (OME, the presence of middle ear fluid behind an intact tympanic membrane without signs or symptoms of infection), we illustrate the beneficial impact of this workflow on bacterial community data from a challenging clinical context.
Results and discussion
The total number of sequences successfully assembled from paired-end reads across the sample set was 2,094,672. Following quality filtering, truncation and chimera removal, a total number of 1,706,072 sequences advanced to operational taxonomic unit (OTU) picking and taxonomy assignment.
Reference database selection for OTU picking and taxonomy assignment
Selection of OTU picking and taxonomic assignment strategy
We investigated three OTU picking approaches available in quantitative insights into microbial ecology (QIIME) as follows: de novo OTU picking, open-reference and closed-reference OTU picking (Figure 1, Step 2). Classical de novo OTU clustering and taxonomic assignment resulted in 108,099 individual OTUs clustered at 97% similarity with the majority of these OTUs ‘unclassified’ according to the representative set of sequences. Using sequence classification tool Kraken v0.10.5, 61.27% of all reads in the dataset were found to be non-bacterial sequences aligned to the human genome GRcH38 (Additional file 1: Figure S2) . The de novo strategy was unable to eliminate these reads as a non-bacterial signal and instead classified the reads as unclassified. Using open-reference OTU picking, the percentage of the unclassified reads decreased to 2.5% with only 2,096 OTUs identified. The open-reference OTU picking method successfully eliminated the non-bacterial human signal associated with the de novo strategy while retaining unclassified bacterial reads. For example, retained unclassified bacterial reads such as Alloiococcus were >60% similar (the pre-filter cutoff) and <95% similar (the taxonomic assignment cutoff) to the SILVA database. Closed-reference OTU picking performed poorly as a number of high relative abundance OTUs were unclassified and eliminated from the final-output OTU table. In closed-reference OTU picking, all reads <97% similar to the SILVA database were discarded, meaning taxa such as Alloiococcus were unclassified and eliminated from the dataset, despite the open-reference method returning a cumulative relative abundance of 42.7% for Alloiococcus in the MEF samples. Accordingly, the open-reference method for OTU picking was employed for all further analyses.
Impact of pre-filtering on adenoid sample types
Identification and removal of contaminants based on OTU distribution relative to biomass
The relative abundance of reagent contaminants was significantly higher in the adenoid biopsies (median = 46%, IQR 37%) compared to the adenoid swabs (median = 18%, IQR 21%, P = 0.032, Mann–Whitney U test) (Figure 2B), consistent with our observation that the biopsies had reduced bacterial amplification efficiency. In addition, MEF samples also had a high relative abundance of reagent contaminants (median = 26%, IQR 24%). The high relative abundance of reagent contaminants in the MEF samples was also in the context of low biomass, with only eight out of twenty-two MEF samples successfully amplifying above the limit of detection in the total bacterial load qPCR (bacterial load in these swabs ranged from 1.7 × 104 to 9.6 × 104 copies ml−1 of MEF). By comparison, NP swabs had a contamination median relative abundance of only 0.2% (IQR 1.1%). Not surprisingly, this was in the context of the highest observed bacterial loads, with ten out of eleven swabs amplifying successfully (bacterial load in these swabs ranged from 2.7 × 104 to 5.9 × 106 copies ml−1 of swab and a mean of 22,431 non-contaminant bacterial sequences (SD 9,276)). Nasal cavity and NP swabs are the mainstay of upper-respiratory microbiome studies  and were less subject to the effect of non-bacterial DNA and reagent contaminants that we have observed in the other sample types (MEF and adenoid biopsy samples). The NP swabs therefore provided us with a baseline to assess how other lower biomass sample types behave using identical laboratory and bioinformatic methods.
It is important that the removal of contaminant taxa is performed at the OTU level. A number of genera known to be common colonisers of the upper-respiratory tract (Stenotrophomonas and Pseudomonas) have been identified as common reagent contaminants. Analysis of the distribution of these taxa at a genus level, where multiple OTUs are included in one correlation plot, could be misleading if some OTUs within that genus are spurious and others are not. Further, if researchers are concerned that there may be a mixture of genuine and contaminating reads within a single OTU clustered at 97% similarity, further analysis within that OTU could be performed. Differentiation might, for example, be achieved by clustering OTUs at a similarity of 100% and plotting each of the resulting OTUs against amplicon concentration. Ultimately, such removal of contaminants based on the relationship between OTU distribution and amplicon concentration may be fully automatable.
Impact of potential reagent contamination on measures of microbiota similarity
Currently, there is considerable interest in understanding the relationships between complex airway microbiota, host physiology and the development and progression of disease. In many cases, the low cost of high-throughput amplicon sequencing provide an opportunity to researchers and clinicians to perform such studies. However, while these technologies are commonly accessible, their application alone is not sufficient to provide informative data, particularly where airway samples contain low levels of microbes and/or high levels of human DNA. Careful bioinformatic processing is required to minimise the substantial impact of the spurious signal and to avoid basing clinical interpretation on potentially misleading output .
We describe a workflow for the removal of the spurious signal that can be applied using commonly available tools and without the need for highly specialised bioinformatics expertise. This approach could be easily applied to non-human studies and is not necessarily specific to 16S rRNA studies, as metagenomic approaches encounter similar issues . To assess the efficacy of this approach, we applied it to the processing of microbiota data from human respiratory samples using largely standard protocols for DNA extraction, library preparation and sequencing. Specifically, we analysed sample types that contained different levels of bacterial and non-bacterial DNA. While our results suggest that for some sample types such as NP swabs, reasonably high quality data can be obtained without the need for stringent processing; the overwhelming artefact signal in the low bacterial biomass samples (MEF, adenoid swabs and adenoid biopsies) means that comparisons of microbial communities across all sample types is not possible in its absence. Further, parallel analysis of bacterial community distributions, with and without the removal of this signal, clearly indicated that failure to apply this methodology would have resulted in data that were misleading, pointing as it did to greater communication between anatomical sites than was indeed the case.
It is important to note that the level of the spurious signal within the sequencing data will be influenced by a wide range of factors and may differ substantially with sampling strategies, anatomical sites and even between replicate samples. Further, the fact that reagent contamination can be almost impossible to exclude entirely and the importance of using PCR primers that are able to amplify sequences from as great a proportion of bacterial species as possible, mean that the likelihood of preventing non-specific amplification is very low. As such, the development of an error-free method for generating 16S rRNA sequence data from clinical samples would seem highly unlikely. Methods that are able to remove spurious signal post-sequencing are therefore important protocol adjuncts. While we do not suggest that the methodology described here is definitive or that other approaches could not achieve similar results, we feel that our workflow provides a means for bench-top biologists with minimal bioinformatics experience to process data from challenging clinical contexts.
Samples were collected as part of the National Health and Medical Research Council-funded (Grant 1007641) randomised controlled trial of surgical interventions for OME (Australia and New Zealand Clinical Trials Registration 12611001073998). Samples were collected at baseline from children undergoing surgery at the Alice Springs Hospital during May and June 2014. Ethical approval was obtained in the Northern Territory through the Central Australian Human Research Ethics Committee (HOMER 12 to 16) and Menzies School of Health Research Ethics Committee (2011 to 1686).
Surgical procedures and sample collection
The clinical samples included 11 nasopharyngeal (NP) swabs, 22 middle ear fluid samples (MEFs) in saline (left and right ear of each child) and 11 adenoid biopsies. Swabs of the exterior of the adenoid biopsies were taken prior to DNA extraction. Full details of the surgical procedures and sample collection protocols are provided in Additional file 1.
DNA extraction and estimation of total bacterial load
The total DNA was extracted from all clinical samples and two DNA extraction reagent negative controls. Full details of the DNA extraction protocols are provided in Additional file 1. Total bacterial load was determined as described previously  and was used to assess template concentrations for 16S rRNA amplicon sequencing. Full details of the qPCR protocol are provided in Additional file 1. A multi-template control (MTC) consisting of thirteen species in known relative abundance was used in the assessment of OTU picking and taxonomic assignment protocols. Full details of this MTC are given in Additional file 1: Table S1.
16S rRNA gene amplicon library preparation and sequencing
Amplicons were generated using fusion degenerate primers 27 F (5’-TCGTCGGCAGCGTCAGATGTGTATAAGAGACAGAGRGTTTGATCMTGGCTCAG-3’) and 519R (5’-GTCTCGTGGGCTCGGAGATGTGTATAAGAGACAGGTNTTACNGCGGCKGCTG-3’) with ligated overhang Illumina adapter consensus sequences in italic text. Full details of the library preparation and sequencing protocol are provided in Additional file 1. In brief, the initial PCR reactions were performed on a Veriti 96-well Thermal Cycler (Life Technologies, Australia). The PCR reactions were performed in the following programme: initiation enzyme activation at 95°C for 3 min, followed by 25 cycles consisting of denaturation at 95°C for 30 sec, annealing at 55°C for 30 sec and extension at 72°C for 30 sec. After 25 cycles, the reaction was completed with a final extension of 7 min at 72°C.
The Illumina Nextera XT Index kit (Illumina Inc., San Diego. CA, USA) with dual 8-base indices were used to allow for multiplexing. Two unique indices located on either end of the amplicon were chosen based on the Nextera dual-indexing strategy. To incorporate the indices to the 16S amplicons, PCR reactions were performed on a Veriti 96-well Thermal Cycler (Life Technologies, Australia). Cycling conditions consisted of one cycle of 95°C for 3 min, followed by eight cycles of 95°C for 30 sec, 55°C for 30 sec and 72°C for 30 sec, followed by a final extension cycle of 72°C for 5 min.
Prior to library pooling, the barcoded libraries were quantified using the Qubit dsDNA HS Assay Kit (Life Technologies, Carlsbad, CA, USA). Results from this quantification step (amplicon concentration) were used in downstream processing to eliminate contamination (Figure 1, Step 3). The libraries were sequenced by 2 × 300 bp paired-end sequencing on the MiSeq platform using MiSeq v3 Reagent Kit (Illumina) at the Flinders Genomics Facility, Adelaide, Australia. All sequence data generated have been submitted to the Sequence Read Archive .
An overview of the bioinformatic workflow used is shown in Figure 1. FastQC v.11.2 (Babraham Bioinformatics, Babraham Institute, UK) was used to analyse the average quality scores of each sample before and after pairing reads. The Paired-End reAd mergeR (PEAR) v.0.9.5  was used to pair the forward and reverse reads of sequences in each sample and discard all sequences less than 450 bp and/or with a Phred score <33. Kraken v0.10.5  was used to classify sequences against pre-built databases of viral and bacterial sequences and the human genome (GRcH38). The pre-built databases (MiniKraken) were downloaded from the Kraken website (https://ccb.jhu.edu/software/kraken/ accessed 03032015), and query sequences were classified using Kraken’s default parameters.
Samples were then demultiplexed using QIIME v.1.8.0, with individual sequences assigned to their original samples. The demultiplex step contained further quality filtering steps as follows: truncation following three consecutive low quality base calls, removal of reads with <75% high quality base calls and removal of sequences with an unclear base call (N). Chimeras were filtered with a reference-based approach using UCHIME v.4.2  and a representative set of chimera-checked sequences (Greengenes v.13.8; ).
For OTU picking, we used QIIME  as opposed to other popular 16S data analysis pipelines such as mothur . We found the clustering mechanism employed by mothur version 1.34.3 unsuitable for processing paired-end read sequence data. In brief, the mothur MiSeq standard operating procedure relies on an OTU-clustering mechanism (nearest, furthest or average neighbour clustering) that generates a distance matrix, optimised when large data sets are condensed into a small number of identical sequences using the unique.seqs command. MiSeq sequencing generates paired-end reads with the potential for errors in the paired region of the sequences as previously described . Consequently, when applied to large Illumina data sets, the unique.seqs command may be unable to condense data to an appropriate size for the distance matrix, resulting in excessive wall time during clustering. QIIME provides various OTU-clustering approaches (including those employed by mothur), some of which do not require generation of a large distance matrix (for example, UCLUST ).
Traditional de novo OTU picking, closed-reference and open-reference OTU picking were performed in QIIME. In de novo OTU picking, all reads were clustered based upon 97% similarity to each other, irrespective of similarity to known 16S rRNA sequences . Taxonomy of de novo OTUs was assigned at 95% similarity to a representative set (rep-set) of 16S rRNA sequences in the SILVA database (release 111, July 2013) ). In closed-reference OTU picking, all reads were clustered based upon 97% similarity to a reference sequence in the rep-set, with all unassigned sequences discarded.
In open-reference OTU picking , all sequences were initially pre-filtered to discard sequences not meeting a threshold of 60% similarity to the rep-set. A closed-reference OTU-picking step was then performed, where all reads were clustered based upon 97% similarity to the rep-set. Reads failing to meet the 97% similarity threshold, were then clustered de novo (described above). OTU maps created for the closed-reference and de novo steps were then merged to create a combined OTU map. A representative set of sequences was created from the combined OTU map and taxonomy was assigned as described above. In the de novo, closed- and open-reference approaches, UCLUST  v.1.2.22 was used to cluster OTUs at 97% similarity. Analysis of a multi-template control was used to assess suitability of 16S rRNA reference databases for the taxonomic assignment. Taxonomic assignment for all three OTU picking methods was performed at 95% similarity to the rep-set using UCLUST.
OTUs removed from sequencing data prior to biostatical analysis
New. CleanUp. ReferenceOTU46637
New. CleanUp. ReferenceOTU1526
New. ReferenceOTU87, New. CleanUp. ReferenceOTU40460, New. CleanUp. ReferenceOTU30994, AY46848
New. ReferenceOTU91, New. CleanUp. ReferenceOTU22231
New. CleanUp. ReferenceOTU10780
New. CleanUp. ReferenceOTU91
JN187532, JF970596, GU272272, FJ347714, EF515711
New. CleanUp. ReferenceOTU3920
New. CleanUp. ReferenceOTU20538
OTUs that were not classified below family-level by taxonomic assignment based on the rep-set were further classified to obtain a genus- and species-level identification. This was achieved by aligning representative sequences of each selected OTU to the 16S ribosomal RNA sequence database and National Center for Biotechnology Information (NCBI) database using the Basic Local Alignment Search Tool (BLAST) . Presumptive identification was made if an aligned sequence returned an identification coverage score of ≥97%. Where representative sequences aligned to multiple species with an identical coverage score ≥97%, a higher-level taxonomic identifier was assigned.
Rarefaction curves were generated in QIIME for all contaminant-filtered and non-filtered samples. Appropriate subsample depth was established by visual inspection of rarefaction curves to ensure adequate sample depth while retaining low read samples. It was confirmed that reducing the sequence number in this way did not result in a significant reduction in profile diversity, as determined using the Simpson’s Index of Diversity (1-D) (Additional file 1: Figure S1). Accordingly, all samples were subsampled to 400 reads. Subsampling eliminated 34% of all samples (22 out of 64 samples) including 9 out of 11 adenoid biopsy samples from the contaminant-filtered data (due to low sequencing depth). Consequently, adenoid biopsy data were not used in the calculation of diversity estimates for the comparison of filtered and non-filtered sequences.
Diversity estimates were performed before and after OTU filtering to compare its effect across sample types. Bray-Curtis (BC) similarity matrices were created using QIIME for principle coordinates analysis (PCoA) and PRIMER v.6 (PRIMER-E Ltd, Plymouth, UK) for SIMilarity of PERcentages (SIMPER) and Analysis of similarity (ANOSIM) analyses. SIMPER was used to determine the contribution made by specific OTUs to the observed similarity between sample types before and after contaminant filtering. ANOSIM was also performed in PRIMER to test whether there was a statistically significant difference between the MEF and the combined upper-airway samples (NP and adenoid swabs) before and after contaminant filtering.
Mann–Whitney U tests or t-tests were used to test variation in these measures between the adenoid biopsies and swabs, depending on data distribution. The relative abundance of OTUs was plotted relative to amplicon concentration using GraphPad Prism v.6.04 (GraphPad Software Inc. California, USA) with the significance tested by Spearman’s correlation. Cytoscape v2.8.2  was used to create a co-occurrence model. A worksheet with the presence or absence of each OTU observed at >1.5% relative abundance was generated, showing which sample types contained identical OTUs. This spreadsheet was then uploaded to Cytoscape to generate Figure 7.
Written informed consent was obtained from the patient’s guardian/parent/next of kin for the publication of this report and any accompanying images.
analysis of similarity
basic local alignment search tool
national Center for Biotechnology Information
otitis media with effusion
operational taxonomic unit
principal coordinate analysis
paired-end read merger
quantitative insights into microbial ecology
quantitative polymerase chain reaction
similarity of percentages
This study was supported by the National Health and Medical Research Council (Grant 1007641).
- Goleva E, Jackson LP, Harris JK, Robertson CE, Sutherland ER, Hall CF, et al. The effects of airway microbiome on corticosteroid responsiveness in asthma. Am J Respir Crit Care Med. 2013;188:1193–201.View ArticlePubMed CentralPubMedGoogle Scholar
- Garzoni C, Brugger SD, Qi W, Wasmer S, Cusini A, Dumont P, et al. Microbial communities in the respiratory tract of patients with interstitial lung disease. Thorax. 2013;68:1150–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Vaishampayan P, Probst AJ, La Duc MT, Bargoma E, Benardini JN, Andersen GL, et al. New perspectives on viable microbial communities in low-biomass cleanroom environments. ISME J. 2013;7:312–24.View ArticlePubMed CentralPubMedGoogle Scholar
- Salter SJ, Cox MJ, Turek EM, Calus ST, Cookson WO, Moffatt MF, et al. Reagent and laboratory contamination can critically impact sequence-based microbiome analyses. BMC Biol. 2014;12:87.View ArticlePubMed CentralPubMedGoogle Scholar
- Biesbroek G, Sanders EA, Roeselers G, Wang X, Caspers MP, Trzciński K, et al. Deep sequencing analyses of low density microbial communities: working at the boundary of accurate microbiota detection. PLoS One. 2012;7:e32942.View ArticlePubMed CentralPubMedGoogle Scholar
- Willner D, Daly J, Whiley D, Grimwood K, Wainwright CE, Hugenholtz P. Comparison of DNA extraction methods for microbial community profiling with an application to pediatric bronchoalveolar lavage samples. PLoS One. 2012;7:e34605.View ArticlePubMed CentralPubMedGoogle Scholar
- Lazarevic V, Gaïa N, Emonet S, Girard M, Renzi G, Despres L, et al. Challenges in the culture-independent analysis of oral and respiratory samples from intubated patients. Front Cell Infect Microbiol. 2014;4:65.View ArticlePubMed CentralPubMedGoogle Scholar
- Segal LN, Alekseyenko AV, Clemente JC, Kulkarni R, Wu B, Chen H, et al. Enrichment of lung microbiome with supraglottic taxa is associated with increased pulmonary inflammation. Microbiome. 2013;1:19.View ArticlePubMed CentralPubMedGoogle Scholar
- McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, et al. An improved Greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012;6:610–8.View ArticlePubMed CentralPubMedGoogle Scholar
- Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, et al. The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucl Acids Res. 2013;41:D590–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15:R46.View ArticlePubMed CentralPubMedGoogle Scholar
- Schmieder R, Edwards R. Fast identification and removal of sequence contamination from genomic and metagenomic datasets. PLoS One. 2011;6:e17288.View ArticlePubMed CentralPubMedGoogle Scholar
- Leach AJ, Boswell JB, Asche V, Nienhuys TG, Mathews JD. Bacterial colonisation of the nasopharynx predicts very early onset and persistence of otitis media in Australian aboriginal infants. Pediatr Infect Dis J. 1994;13:983–9.View ArticlePubMedGoogle Scholar
- Weiss S, Amir A, Hyde ER, Metcalf JL, Song SJ, Knight R. Tracking down the sources of experimental contamination in microbiome studies. Genome Biol. 2014;15:564.View ArticlePubMed CentralPubMedGoogle Scholar
- Zhou Q, Su X, Ning K. Assessment of quality control approaches for metagenomic data analysis. Sci Rep. 2014;4:6957.View ArticlePubMed CentralPubMedGoogle Scholar
- Marsh RL, Binks MJ, Beissbarth J, Christensen P, Morris PS, Leach AJ, et al. Quantitative PCR of ear discharge from indigenous Australian children with acute otitis media with perforation supports a role for Alloiococcus otitidis as a secondary pathogen. BMC Ear, Nose and Throat Disord. 2012;12:11.
- The National Center for Biotechnology Information Sequence Read Archive. http://www.ncbi.nlm.nih.gov/sra/ (accessed 011014)
- Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: a fast and accurate Illumina Paired-End reAd mergeR. Bioinformatics. 2014;30:614–20.View ArticlePubMed CentralPubMedGoogle Scholar
- Edgar RC, Haas BJ, Clemente JC, Quince C, Knight R. UCHIME improves sensitivity and speed of chimera detection. Bioinformatics. 2011;27:2194–200.View ArticlePubMed CentralPubMedGoogle Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, et al. QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010;7:335–6.View ArticlePubMed CentralPubMedGoogle Scholar
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, et al. Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009;75:7537–41.View ArticlePubMed CentralPubMedGoogle Scholar
- Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26:2460–1.View ArticlePubMedGoogle Scholar
- Navas-Molina JA, Peralta-Sánchez JM, González A, McMurdie PJ, Vázquez-Baeza Y, Xu Z, et al. Advancing our understanding of the human microbiome using QIIME. Methods Enzymol. 2013;531:371–444.View ArticlePubMedGoogle Scholar
- Rideout JR, He Y, Navas-Molina JA, Walters WA, Ursell LK, Gibbons SM, et al. Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ. 2014;2:e545.View ArticlePubMed CentralPubMedGoogle Scholar
- The National Center for Biotechnology Information Basic Local Alignment Search Tool. http://blast.ncbi.nlm.nih.gov/Blast.cgi (accessed 011014)
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13:2498–504.View ArticlePubMed CentralPubMedGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.