ghost-tree workflow
ghost-tree takes as input (1) the Foundation Alignment (for example, the Silva 18S alignment) where sequences are annotated with taxonomy; (2) the Extension Sequence Collection (for example, unaligned ITS sequences from the UNITE database); and (3) a taxonomy map, which contains taxonomic annotations of the sequences in (2). The Foundation Alignment is filtered by ghost-tree using scikit-bio (scikit-bio.org) to remove highly gapped and high entropy positions. Next, FastTree [10] is used, with the Jukes and Cantor model of DNA evolution [11], to build a phylogenetic tree from the resulting filtered alignment. This is the Foundation Tree. In parallel, the Extension Sequence Collection can be clustered with SUMACLUST, an open source OTU clustering software package (https://git.metabarcoding.org/obitools/sumaclust/wikis/home/), resulting in an operational taxonomic unit (OTU) map that groups sequences into OTUs by percent identity. For each Extension Sequence OTU, a consensus taxonomy is determined from the Taxonomy Map, and OTUs with the same consensus taxonomy are combined into a single OTU. The sequences in each OTU are then aligned using MUSCLE with diagonal optimization of the first iteration and two maximum iterations, which is suitable for closely related sequences [12]. FastTree is then applied to these alignments to build an Extension Tree for each alignment. This process of “alignment and tree building” is applied to all OTUs in the Extension Sequence Collection. Each OTUs’ consensus taxonomy is associated with the root of the Extension Tree. The taxa at the root of the extension trees are then used to graft the Extension Tree onto the tip in the Foundation Tree with the same taxonomy, resulting in the ghost tree (see illustrations in Fig. 1). We applied ghost-tree to build a phylogenetic tree from Silva (Ver. SSU 119.1) 18S sequences (our foundation) [13] and UNITE (Ver. 12_11_otus) ITS sequences (our extensions) [14]. This tree is available in the ghost-tree GitHub repository. This workflow is illustrated in Fig. 1.
Test data set
To compare diversity analysis results, we began with two ITS profile data sets: one containing 20 human saliva samples [15] and one containing 16 public restroom floor samples [16]. Both sample sets used the ITS-1F forward primer (5′-CTTGGTCATTTAGAGGAAGTAA-3′) sequence [17] and the ITS2 reverse primer (5′-GCTGCGTTCTTCATCGATGC-3′) sequences [18] to generate ITS1 sequence reads. All analyses referencing Python scripts described below were performed in MacQIIME 1.9.0-20140227 ([19]; http://www.wernerlab.org/software/macqiime). The public restroom floor sequences were generated on an Ilumina MiSeq and the saliva on a GS-FLX pyrosequencer. Sequence data was combined for both studies after demultiplexing (split_libraries.py was performed by the authors on the public restroom floor sequences; already demultiplexed data was obtained from NCBI SRA for the saliva data). OTUs were picked using uclust-based closed-reference OTU picking (pick_otus.py) with UNITE OTUs v12_11 as the reference database and the most abundant sequence in each OTU was selected as the OTU representative sequence (pick_rep_set.py).
The BIOM table was then filtered using filter_otus_from_otu_table.py to contain only OTUs with accession numbers present in the ghost tree. To create a data set with a known small effect size, we used simsam.py, which creates a specified number of phylogenetically similar simulated samples using a phylogenetic tree and the filtered BIOM table. Specifically, the small effect we looked for was whether simulated samples derived from a single source sample were more similar to each other than to simulated communities derived from different source samples. Ten simulated samples were created for each of the 36 source samples with 0.6 dissimilarity (d) using simsam.py, resulting in our “simulated BIOM table,” and a simulated mapping file with metadata for all 360 resulting samples. Simulated BIOM tables were created using ghost-tree (which we refer to as the ghost-tree
-Simulated Communities, or GTSCs) and repeated using FastTree with ITS sequences aligned with MUSCLE (which we refer to as the FastTree-Simulated Communities, or FTSCs).
Principal coordinates analyses
PCoA plots for unweighted and weighted UniFrac were created using beta_diversity_through_plots.py with the appropriate simulated mapping file, simulated BIOM table, and ghost tree or our FastTree as the reference tree. PCoA plots for the Jaccard distance [20], a qualitative non-phylogenetic diversity metric, and Bray-Curtis distance [21], a quantitative, non-phylogenetic diversity metric, were created separately by using three scripts. First, beta_diversity.py was run with methods “binary_jaccard” and “bray_curtis”, respectively, on the simulated BIOM table to produce two distance matrices (DMs). Next, principal_coordinates.py was applied to those DMs to produce principal coordinates (PC) matrices. Finally, make_emperor.py was run using the PC files and the simulated mapping file to produce PCoA plots for Jaccard and Bray-Curtis. This process was repeated for unsimulated samples and for both FTSCs and GTSCs.
Diversity calculations and statistics
To test whether simulated samples derived from the same source sample were more similar than those derived from different source samples (the small effect), per-environment OTU tables (saliva and restroom floor) were created using split_otu_table.py. ANOSIM was computed using compare_categories.py to compare the distribution of distances between samples with the same source to the distribution of samples with a different source. Six distance matrix calculations were created for both FTSCs and GTSCs using the appropriate simulated mapping file with 999 permutations: Jaccard, Bray-Curtis, unweighted UniFrac with FastTree, weighted UniFrac with FastTree, unweighted UniFrac with ghost-tree, and weighted UniFrac with ghost-tree. The Jaccard distance is a qualitative non-phylogenetic diversity metric where no tree is required. The Bray-Curtis distance is a quantitative, non-phylogenetic diversity metric where no tree is required. The weighted UniFrac metric includes information on the abundance of various taxa in addition to the phylogenetic tree, while the unweighted UniFrac only includes the phylogenetic information (for details see [9]). The unweighted and weighted FastTree distances are calculated with a phylogenetic tree based on a FastTree phylogenetic analysis of only the ITS data aligned using MUSCLE.
To test whether samples could be differentiated by their environment type (restroom or saliva, the large effect), ANOSIM was computed using compare_categories.py on each of the six distance matrices for the simulated and real BIOM tables using the appropriate (either simulated or unsimulated) mapping file, with 999 permutations. This analysis was performed for simulated and unsimulated sample sets.
ghost-tree software
ghost-tree is hosted on GitHub under the BSD open source software license. It is implemented in Python, using scikit-bio (www.scikit-bio.org) and Click (http://click.pocoo.org/), and adheres to the PEP8 Python style guide. ghost-tree is subject to continuous integration testing using Travis CI which, on each pull request, runs unit tests with nose, monitors code style using flake8, and monitors test coverage with coveralls.