LotuS: an efficient and user-friendly OTU processing pipeline
© Hildebrand et al.; licensee BioMed Central Ltd. 2014
Received: 1 April 2014
Accepted: 23 August 2014
Published: 30 September 2014
16S ribosomal DNA (rDNA) amplicon sequencing is frequently used to analyse the structure of bacterial communities from oceans to the human microbiota. However, computational power is still a major bottleneck in the analysis of continuously enlarging metagenomic data sets. Analysis is further complicated by the technical complexity of current bioinformatics tools.
Here we present the less operational taxonomic units scripts (LotuS), a fast and user-friendly open-source tool to calculate denoised, chimera-checked, operational taxonomic units (OTUs). These are the basis to generate taxonomic abundance tables and phylogenetic trees from multiplexed, next-generation sequencing data (454, illumina MiSeq and HiSeq). LotuS is outstanding in its execution speed, as it can process 16S rDNA data up to two orders of magnitude faster than other existing pipelines. This is partly due to an included stand-alone fast simultaneous demultiplexer and quality filter C++ program, simple demultiplexer (sdm), which comes packaged with LotuS. Additionally, we sequenced two MiSeq runs with the intent to validate future pipelines by sequencing 40 technical replicates; these are made available in this work.
We show that LotuS analyses microbial 16S data with comparable or even better results than existing pipelines, requiring a fraction of the execution time and providing state-of-the-art denoising and phylogenetic reconstruction. LotuS is available through the following URL: http://psbweb05.psb.ugent.be/lotus.
KeywordsOTU 16S rDNA gene Pipeline Metagenomics Demultiplexing
Next generation sequencing platforms are reducing the cost of collecting metagenomic data from large environmental and clinical microbial ecosystems. With 16S rDNA amplicon sequencing becoming a mainstream approach in these research areas, there is a need to optimize computer resources to handle this data.
Although online services exist to process 16S rDNA data such as the Ribosomal Database Project (RDP) pipeline  or the PyroTagger pipeline , a single HiSeq run can yield up to 6 × 109 sequences1, which challenges uploading capabilities. Large initiatives, like the Human Microbiome Project (HMP), have used the 16S pipelines Quantitative Insights Into Microbial Ecology (QIIME)  and mothur , applications that can be installed and run locally. mothur follows the philosophy of incorporating all tools in one software package, while QIIME (partly) relies on 3rd party software. Both pipelines present a complete package with tools to interpret the community composition, but more work-intensive components like denoising and sequence clustering are designed for cluster environments.
Sequences are filtered with a novel C++ program bundled with the pipeline, sdm, that has evolved over the course of this project from a simple demultiplexer to a general purpose sequence file formatting, quality filtering/adjustment and OTU sequence (“Seed”) picking tool, optimized for speed and a high recovery rate of sequences. Further benefits of this program include the simultaneous sorting of input sequences into “high-” and “mid-” quality sequences, dependent on the overall quality and length. These two sequence bins will be important in the further LotuS workflow: the high quality sequences are used in the sensitive OTU clustering process, reducing the number of spurious OTU’s caused by sequencing errors and not biological diversity. Both, high- and mid-quality sequences are used to count the occurrence of OTU’s in single samples, with the intent that mid-quality sequences, which are mapping to an established OTU, will not confound overall diversity measures but add to the count of given biological entities present in a sample.
Default sdm options for 454 and MiSeq sequences are provided with LotuS; these can be modified to filter input sequences after average quality, accumulated error over the sequence, quality in a freely definable window and remove 5′ low-quality bases filtered for these criteria. Furthermore, sequences are filtered for min/max length, ambiguous nucleotides, max barcode and primer errors, polynucleotide runs, and trimmed for adapter sequences, if present. sdm/LotuS accepts fasta + quality, fastq and gzipped versions of these as input. Furthermore, sdm can be used on non-demultiplexed sequences to do a quality filtering of sequences, e.g. prior to assembly [9, 10].
Filtered sequences are clustered into OTUs with UPARSE. UPARSE is implemented as described by , with the exception that the “-cluster_OTUs” command is executed with the additional parameters -uparse_maxhot 62 -uparse_maxdrop 12 that increase the number of potential best hits that are explicitly aligned. This makes the overall clustering slightly slower, but in our experience, it results into more consistent OTUs on MiSeq datasets in case studies, thereby reducing the total number of OTUs. Also, we noticed that default UPARSE can propose a minor fraction of OTUs that are overlapping with other OTUs (mapped within the desired OTU sequence similarity, e.g. 97%, to existing OTUs). OTU abundance is estimated by mapping mid- and high-quality filtered input sequences onto the newly created OTUs.
The OTU sequence, here termed OTU seed sequence, should fulfil the following criteria to represent the OTU for sequence matching, taxonomic annotation and tree reconstruction: it should be as long as possible, be close to the median of all amplicons clustered to the OTU (representing the centre of the OTU) and contain the least amount of sequencing errors. One common practice is to use a consensus sequence of all amplicons clustered to an OTU, also used by UPARSE. However, a consensus sequence could be the average of two or more strains that constitute the OTU and the UPARSE denoising algorithm is critically dependent on sequences pruned to a reduced length (typically 250bp for 454 sequencing). To resolve this, sdm searches within all high quality input sequences for a sequence matching the above criteria, in a process we call “Seed extension”. In brief, all input sequences are aligned with the consensus OTU sequence using usearch . From these, sdm selects iteratively the hit being closest to the OTU median, having the highest overall mean accumulated error and the longest overall sequence length, with low-quality 5′ nucleotides being removed. In the case of paired MiSeq or HiSeq sequences, the highest-quality pair is selected and these are then merged using flash .
Taxonomic annotation of OTUs
By default, the OTU taxonomy is derived from RDP naïve bayes classifier annotations . The minimum acceptance confidence for RDP is by default set to 0.8 but can be modified via “-rdp_thr”. Alternatively, extended OTU seed sequences are aligned against a reference 16S rDNA database with BLAST + ; currently, we support greengenes  and SILVA  16S rDNA databases, but this is easily extendable to other ribosomal databases. A lowest common ancestor algorithm is used to assign a taxonomy modified to only consider hits within 1.5% sequence identity to the overall best hit to the current OTU seed, with a limit of 200 total hits. This threshold is chosen to allow for sequencing errors to still include relevant hits, while limiting the space of possible hits within a close enough range to the current best hit; it was validated on 200 bp simulated reads (see Additional file 1) and is a good trade-off between precision and specificity of taxonomic assignments (Additional file 2). At each taxonomic level, the taxonomic consistency of each hit is evaluated. By default, a taxonomic assignment is accepted if >90% of references are the same taxa, modifiable with the LotuS parameter “-LCA_frac”. If a reference has no taxonomic information for a given level (and taxonomic assignments were consistent thus far), it is discarded. In an extreme example, this can lead to species-level taxonomic assignments even if only one reference is assigned to species level but all other references have no taxonomic information.
Furthermore, the sequence similarity of the best hit is used to delimit the taxonomic depth of an assignment. Even if the best hit has a known species name, if the identity of the OTU to the reference is 96%, the taxonomy will only be used up to genus level, as a best reference at a 96% threshold could indicate that this OTU is not represented by a reference species in the database. The default parameters are to limit species at 97%, genus at 95%, family at 93%, class at 91%, order at 88% and phylum at 78% sequence similarity, though users can change these parameters as the established 16S rRNA gene taxonomy does strictly speaking not follow consistent cutoffs for different taxonomic levels [15, 16].
Multiple alignment and phylogenetic reconstruction
The calculation of a phylogenetic tree of the extended OTU seed sequences is an optional step. The phylogenetic tree can be used to fulfil requirements for calculating diversity indices such as Faith’s phylogenetic diversity  or between sample UniFrac distances . For this step, the sequences are aligned with Clustal Ω  with default parameters for nucleotide alignments. From the aligned sequences, a phylogenetic tree is reconstructed using the gamma model of sequence evolution (options “-nt -gamma -no2nd -fastest -spr 4”) in FastTree2 , as recommended by its author (http://www.microbesonline.org/fasttree/) and is saved in Phylip format.
LotuS saves the output in the specified output folder in a structure that can be directly integrated into specialized analytical packages of statistical software packages. OTU abundance matrix, the phylogeny for each OTU, a phylogenetic tree and a .biom formatted OTU matrix are stored in this folder. Three subfolders contain a) the run logs and processing reports, b) copies of configuration files and c) higher level taxa abundance matrices.
Results and discussion
We used a simulation of 1000 greengenes  16S sequences, truncated and randomly mutated (see Additional file 1), to validate the LCA algorithm and research the influence of read length on taxonomic classification. This showed that longer reads are assigned with a higher confidence in RDP (Additional file 3a) and the fraction of 16S reads that remains unclassified using our LCA algorithm significantly decreases with longer read length, when using a reference database from which the 1,000 simulated reads were removed (Additional file 3b). With reads ≥250 bp, our simulation converged to 100% precision and specificity (Additional file 4a,b). To simulate a rare situation where no close relatives are present in the reference database, we used the LCA algorithm only on database hits that had <97% identity to the target read. Here the fraction of taxonomic assignments is in general lower and no species-level assignments were made, as expected given default parameters (Additional file 3c). Precision and specificity of assignments were lowered, and here, the longer read lengths were especially important (Additional file 4c,d).
Read quality is decreasing with increasing read length in 454 and illumina sequencing , and UPARSE improves OTU clustering by using only the high-quality 5′ DNA. The here proposed OTU seed extension is important for taxonomic classification and tree building, especially for badly characterized species. It takes advantage of the improved, fast OTU clustering, while using long, high-quality reads for taxonomic annotations and multiple sequence alignments.
We tested the validity and performance of LotuS on cecal gut samples from five different mice strains . It consists of two 454 GS FLX runs and a total of 393,070 reads, the expected read length is 400-500 nucleotides. We used five methods (described in Additional file 1) to calculate OTU abundance matrices: LotuS using RDP taxonomy (LR), LotuS using BLAST taxonomy (LB), QIIME de novo OTU creation (QDN), QIIME with reference-based OTUs (QRE) and mothur (MOT). LotuS was run in 454 mode:
lotus.pl -i [path to fasta/qual] -o [output dir] -m [mapping file] -s [sdm option file]
Richness comparisons between pipelines
2 × 454
2 × 454
2 × MiSeq
2 × MiSeq
Compositional similarity of technical replicates
The novel LotuS pipeline is able to handle small to very large 16S datasets on a personal computer and effortlessly integrate multiple sequencing runs. Computational efficiency is very high due to a selection of state-of-the-art proprietary software like UPARSE for denoising and sequence clustering and sdm for demultiplexing and sequence filtering. Comparison to other pipelines suggests a high similarity in higher taxonomic composition to existing tools, but on OTU level, the de-novo-called OTU shows an increased richness compared to closed-reference OTU calling and less richness than non-denoised de novo OTU calling, as expected. This pipeline has the advantage of state-of-the-art, flowgram-independent denoising and long, high-quality OTU sequences from the OTU seed extension step used for phylogenetic tree construction and taxonomic annotation.
Availability and requirements
Project name: LotuS, sdm.
Project home page: http://psbweb05.psb.ugent.be/lotus
Operating system(s): Linux, Mac
Programming language: Perl, C++
Other requirements: proprietary software, downloaded by autoinstaller, UPARSE
License: GNU GPL
Any restrictions to use by non-academics: licence needed.
A correction to this article has been published: http://www.microbiomejournal.com/content/2/1/37
- sdm :
- LotuS :
less OTU scripts
- OTU :
operational taxonomic unit
- rDNA :
We thank Robert Edgar for essential help during LotuS pipeline conception, as well as Matthew Hayward, Sara Viera-Silva and Samuel Chaffron for constructive comments during LotuS development.
This study received funding through Fund for Scientific Research - Flanders (FWO), the Brussels Institute for Research and Innovation, the EU Framework 7 programme (MetaCardis project), the CancerBiome project (European Research Council project reference 268985), VIB, the REGA institute, KU Leuven and EMBL.
- Cole JR, Wang Q, Cardenas E, Fish J, Chai B, Farris RJ, Kulam-Syed-Mohideen AS, McGarrell DM, Marsh T, Garrity GM, Tiedje JM: The Ribosomal Database Project: improved alignments and new tools for rRNA analysis. Nucleic Acids Res. 2009, 37 (Database issue): D141-D145.View ArticlePubMedPubMed CentralGoogle Scholar
- Kunin V, Hugenholtz P: PyroTagger: a fast, accurate pipeline for analysis of rRNA amplicon pyrosequence data. Open J. 2010, 1-8. Article 1Google Scholar
- Caporaso JG, Kuczynski J, Stombaugh J, Bittinger K, Bushman FD, Costello EK, Fierer N, Peña AG, Goodrich JK, Gordon JI, Huttley GA, Kelley ST, Knights D, Koenig JE, Ley RE, Lozupone CA, McDonald D, Muegge BD, Pirrung M, Reeder J, Sevinsky JR, Turnbaugh PJ, Walters WA, Widmann J, Yatsunenko T, Zaneveld J, Knight R, Pena AG: QIIME allows analysis of high-throughput community sequencing data. Nat Methods. 2010, 7: 335-336. 10.1038/nmeth.f.303.View ArticlePubMedPubMed CentralGoogle Scholar
- Schloss PD, Westcott SL, Ryabin T, Hall JR, Hartmann M, Hollister EB, Lesniewski RA, Oakley BB, Parks DH, Robinson CJ, Sahl JW, Stres B, Thallinger GG, Van Horn DJ, Weber CF: Introducing mothur: open-source, platform-independent, community-supported software for describing and comparing microbial communities. Appl Environ Microbiol. 2009, 75: 7537-7541. 10.1128/AEM.01541-09.View ArticlePubMedPubMed CentralGoogle Scholar
- Edgar RC: UPARSE: highly accurate OTU sequences from microbial amplicon reads. Nat Methods. 2013, 10: 996-998. 10.1038/nmeth.2604.View ArticlePubMedGoogle Scholar
- Wang Q, Garrity GM, Tiedje JM, Cole JR: Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Env Microbiol. 2007, 73: 5261-5267. 10.1128/AEM.00062-07.View ArticleGoogle Scholar
- McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P: An improved greengenes taxonomy with explicit ranks for ecological and evolutionary analyses of bacteria and archaea. ISME J. 2012, 6: 610-618. 10.1038/ismej.2011.139.View ArticlePubMedPubMed CentralGoogle Scholar
- Quast C, Pruesse E, Yilmaz P, Gerken J, Schweer T, Yarza P, Peplies J, Glöckner FO: The SILVA ribosomal RNA gene database project: improved data processing and web-based tools. Nucleic Acids Res. 2013, 41 (Database issue): D590-D596.View ArticlePubMedPubMed CentralGoogle Scholar
- Ye L, Matthijs S, Bodilis J, Hildebrand F, Raes J, Cornelis P: Analysis of the draft genome of Pseudomonas fluorescens ATCC17400 indicates a capacity to take up iron from a wide range of sources, including different exogenous pyoverdines. Biometals. 2014, 27: 633-644. 10.1007/s10534-014-9734-7.View ArticlePubMedGoogle Scholar
- Dingemans J, Ye L, Hildebrand F, Tontodonati F, Craggs M, Bilocq F, De Vos D, Crabbé A, Van Houdt R, Malfroot A, Cornelis P: The deletion of TonB-dependent receptor genes is part of the genome reduction process that occurs during adaptation of Pseudomonas aeruginosa to the cystic fibrosis lung. Pathog Dis. 2014, 71 (1): 26-38. 10.1111/2049-632X.12170.View ArticlePubMedGoogle Scholar
- Edgar RC: Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010, 26: 2460-2461. 10.1093/bioinformatics/btq461.View ArticlePubMedGoogle Scholar
- Magoč T, Salzberg SL: FLASH: fast length adjustment of short reads to improve genome assemblies. Bioinformatics. 2011, 27: 2957-2963. 10.1093/bioinformatics/btr507.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215: 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Pruesse E, Quast C, Knittel K, Fuchs BM, Ludwig W, Peplies J, Glöckner FO: SILVA: a comprehensive online resource for quality checked and aligned ribosomal RNA sequence data compatible with ARB. Nucleic Acids Res. 2007, 35: 7188-7196. 10.1093/nar/gkm864.View ArticlePubMedPubMed CentralGoogle Scholar
- Luo C, Rodriguez-R LM, Konstantinidis KT: MyTaxa: an advanced taxonomic classifier for genomic and metagenomic sequences. Nucleic Acids Res. 2014, 14: 1-12.Google Scholar
- Schloss PD, Westcott SL: Assessing and improving methods used in operational taxonomic unit-based approaches for 16S rRNA gene sequence analysis. Appl Environ Microbiol. 2011, 77: 3219-3226. 10.1128/AEM.02810-10.View ArticlePubMedPubMed CentralGoogle Scholar
- Faith DP: Conservation evaluation and phylogenetic diversity. Biol Conserv. 1992, 61 (1): 1-10. 10.1016/0006-3207(92)91201-3.View ArticleGoogle Scholar
- Lozupone C, Knight R: UniFrac: a new phylogenetic method for comparing microbial communities. Appl Environ Microbiol. 2005, 71: 8228-8235. 10.1128/AEM.71.12.8228-8235.2005.View ArticlePubMedPubMed CentralGoogle Scholar
- Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, Lopez R, McWilliam H, Remmert M, Söding J, Thompson JD, Higgins DG: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011, 7: 539-View ArticlePubMedPubMed CentralGoogle Scholar
- Price MN, Dehal PS, Arkin AP: FastTree 2—approximately maximum-likelihood trees for large alignments. PLoS One. 2010, 5: e9490-10.1371/journal.pone.0009490.View ArticlePubMedPubMed CentralGoogle Scholar
- Hildebrand F, Nguyen ATL, Brinkman B, Yunta RG, Cauwe B, Vandenabeele P, Liston A, Raes J: Inflammation-associated enterotypes, host genotype, cage and inter-individual effects drive gut microbiota variation in common laboratory mice. Genome Biol. 2013, 14: R4-10.1186/gb-2013-14-1-r4.View ArticlePubMedPubMed CentralGoogle Scholar
- Mantel N: The detection of disease clustering and a generalized regression approach. Cancer Res. 1967, 27: 209-220.PubMedGoogle Scholar
- Haas BJ, Gevers D, Earl AM, Feldgarden M, Ward DV, Giannoukos G, Ciulla D, Tabbaa D, Highlander SK, Sodergren E, Methé B, DeSantis TZ, Petrosino JF, Knight R, Birren BW: Chimeric 16S rRNA sequence formation and detection in Sanger and 454-pyrosequenced PCR amplicons. Genome Res. 2011, 21: 494-504. 10.1101/gr.112730.110.View ArticlePubMedPubMed CentralGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.