CAMISIM allows customization of many properties of the generated communities and data sets, such as the overall number of genomes (community complexity), strain diversity, the community genome abundance distributions, sample sizes, the number of replicates, and sequencing technology used. For setting these options, a configuration file is needed, which is described in Additional file 1. Simulation with CAMISIM has three stages (Fig. 1):
-
1
Design of the community, which includes selection of the community members and their genomes, and assigning them relative abundances,
-
2
Metagenome sequencing data simulation, and
-
3
Postprocessing, where the binning and assembly gold standards are produced.
Community design
In this step, the community genome abundance profiles, called Pout, are created. These also represent the gold standard for taxonomic profiling and, from the strain to the superkingdom rank, specify the relative abundances of individual strains (genomes) or their parental taxa in percent. In addition, a genome sequence collection for the strains in Pout is generated. Both Pout and the genome sequence collection are needed for the metagenome simulation in step 2. The taxonomic composition of the simulated microbial community is either determined by user-specified taxonomic profiles or generated de novo by sampling from available genome sequences.
Profile-based design
Taxonomic profiles can be provided in BIOM (Biological Observation Matrix) format [20]. With input profiles, the NCBI complete genomes [21] are used as the sequence collection for creating metagenome data sets. Optionally, the user can choose to also include genomes marked as “scaffold” or “contig” by the NCBI. Input genomes are split at positions with multiple occurrences of ambiguous bases, such that no reads spanning contig borders within larger scaffolds are simulated.
Profiles can include bacterial, archaeal, and eukaryotic taxa, as well as viruses. The taxonomic identifiers of BIOM format are interpreted as free text scientific names and are mapped to NCBI taxon IDs (algorithm in Additional file 1). The so generated input profile Pin specifies pairs (t,abt) of taxon IDs t and taxon abundances \(ab_{t} \in \mathbb {R}_{\geq 0}\). The profile taxa are usually defined at higher ranks than strain and thus have to be mapped approximately to the genome sequence collection for creating Pout.
Given an ordered list of ranks R = (species, genus, family, order, class, phylum, superkingdom), CAMISIM requires as an additional parameter a highest rank rmax∈R. We define the binary operator ≺ based on the ordering of the ranks in R. Given two ranks, ri,rj∈R, we write ri≺rj, if ri appears before rj in R, and we say ri is below rj. Related complete genomes are searched for all ranks below rmax. By default, this is the family rank. Another parameter is the maximum number of strains m that are included for an input taxon in a simulated sample.
To create Pout from Pin, the following steps are performed: let Gin be the set of taxon IDs of the genome collection at the lowest annotated taxonomic rank, usually species or strain. For all t∈Gin, the reference taxonomy specifies a taxonomic lineage of taxon IDs (or undefined values) across the considered ranks in R. We use these to identify a collection of sets F={Gt | t=lineage taxon represented by≥1 complete genome}, which specifies for each lineage taxon the taxon IDs of available genomes from the genome collection. F is used as input for Algorithm 1.
The algorithm retrieves for each t from the tuples (t,abt)∈Pin the lineage path taxt across the ranks of R (lines 2–3). Moving from the species to the highest considered rank, rmax, the algorithm determines whether for a lineage taxon tr at the considered rank r a complete genome exists, that is, whether Gt≠∅ for t=tr (lines 4–5). If this is the case, the search ends and tr is considered further (line 6). If no complete genome is found for a particular lineage, the lineage is not included in the simulated community, and a warning is issued (line 20). Next, the number of genomes X with their taxonomic IDs tr to be added to Pout is drawn from a truncated geometric distribution (Eq. 1, line 8) with a mean of \(\mu = \frac {m}{2}\) and the parameter k restricted to be less than m.
$$ P(X = k) = \left(1 - \frac{1}{\mu} \right)^{k} \cdot \frac{1}{\mu} $$
(1)
If \(|G_{t_{r}}|\) is less than X, \(G_{t_{r}}\) is used entirely as Gselected, the genomes of tr that are to be included in the community. Otherwise X genomes are drawn randomly from \(G_{t_{r}}\) to generate Gselected (lines 9–12). It is optional to use genomes multiple times, by default the selected genomes g∈Gselected are removed from F, such that no genome is selected twice (line 17). Based on the taxon abundances abt from Pin, the abundances abi of the selected taxa i∈Gselected for t are then inferred. First, random variables Yi are drawn from a configurable lognormal distribution, with by default normal mean μ=1 and normal standard deviation σ=2 (Eq. 2), and then the abi are set (Eq. 3; lines 13–15). Finally, the created pairs (i,abi) are added to Pout (line 16) and Pout is returned (line 21).
$$ \begin{aligned} Y_{i} &\sim \text{Lognormal}(\mu,\sigma) \\ \Leftrightarrow \frac{d}{dx}P(Y_{i}\leq x) &= \frac{1}{x\sigma\sqrt{2\pi}} \, e^{-\frac{ \left(\ln x - \mu \right)^{2}} {2\sigma^{2}}} \end{aligned} $$
(2)
$$ ab_{i} = \frac{Y_{i}}{\sum_{j\in G_{\text{selected}}}Y_{j}} \cdot ab_{t} $$
(3)
De novo design
A genome sequence collection to sample and a mapping file have to be specified. The mapping file defines for each genome a taxonomic ID (per default from the NCBI taxonomy), a novelty category and an operational taxonomic unit (OTU) ID. Grouping genomes into OTUs is required for sampling related genomes, to increase strain-level diversity in the simulated microbial communities. The novelty category reflects how closely a query genome is related to draft or complete genomes in a genome sequence reference collection. This is used to maximize the spread of selected genomes across the range of taxonomic distances to the genome reference collection, such that there are genomes included of “novel” strains, species, or genera. This distinction is relevant for evaluating reference-based taxonomic binners and profilers, which may perform differently across these different categories. The user can manually generate the mapping file as described in Additional file 1 or in [15].
If controlled sampling of strains is not required, every genome can be assigned to a different OTU ID. If no reference-based taxonomic binners or profilers are to be evaluated, or the provided genome sequence collection does not vary much in terms of taxonomic distance to publicly available genomes used as references for these programs, all genomes can be assigned the same novelty category.
In addition, the number of genomes greal to be drawn from the input genome selection and the total number of genomes gtot for the community genome abundance profile Pout have to be specified. The greal real genomes are drawn from the provided genome sampling collection. An equal number of genomes is drawn for every novelty category. If the number of genomes for a category is insufficient, proportionately more are drawn from others. In addition, CAMISIM simulates gsim=gtot−greal genomes of closely related strains from the chosen real genomes in total. These genomes are created with an enhanced version of sgEvolver [22] (Additional file 1: Methods) from a subset of randomly selected real genomes. Given m, the maximum number of strains per OTU, up to m−1 simulated strain genomes are added per genome. The exact number of genomes X to be simulated for a selected OTU is drawn from a geometric distribution with mean μ=0.3−1 (Eq. 1). This procedure is repeated until gsim-related genomes have been added to the community genome collection, comprising gtot=greal+gsim genomes [15].
Next, community genomes are assigned abundances. The relevant user-defined parameters for this step are the sample type and the number of samples n. In addition to single samples, multi-sample data sets (with differential abundances, replicates or time series) have become widely used in real sequencing studies [23–26], also due to their utility for genome recovery using covariance-based genome binners such as CONCOCT [27] or MetaBAT [28]. Several options for creating multi-sample metagenome data sets with these setups are provided:
-
1
If simulating a single sample data set, the relative abundances are drawn from a lognormal distribution, which is commonly used to model microbial communities [29–32]. The two parameters of the lognormal distribution can be changed. By default, the mean is set to 1 and the standard deviation to 2 (Eq. 2). Setting the standard deviation σ to 0 results in a uniform distribution.
-
2
The differential abundance mode models a community sampled multiple times after the environmental conditions or the DNA extraction protocols (and accordingly the community abundance profile) have been altered. This mode creates n different lognormally (Eq. 2) distributed genome abundance profiles.
-
3
Metagenome data sets with multiple samples with very similar genome abundance distributions can be created using the replicates mode. Having multiple replicates of the same metagenome has been reported to improve the quality for some metagenome analysis software, such as for genome binners [23, 27, 33, 34]. Based on an initial log-normal distribution D0, n samples are created by adding Gaussian noise to this initial distribution (Eq. 4). The Gaussian term accounts for all kinds of effects on the genome abundances of the metagenomic replicates including, but not limited to, different experimenters, different place of extraction, or other batch effects.
$$ \begin{aligned} D_{i} = D_{0} + \varepsilon \, \text{with} \; \varepsilon & \sim N(0,1) \; \text{and} \\ \varepsilon &\sim N(0,1) \\ \Leftrightarrow \frac{d}{dx}P(\varepsilon \leq x) &= \frac{1}{\sqrt{2\pi}} \cdot e^{-\frac{1}{2}x^{2}} \end{aligned} $$
(4)
-
4
Time series metagenome data sets with multiple related samples can be created. For these, a Markov model-like simulation is performed, with the distribution of each of the n samples (Eq. 5) depending on the distribution of the previous sample plus an additional either lognormal (Eq. 2) or Gaussian (Eq. 4) term. This emulates the natural process of fluctuating abundances over time and ensures that the abundance changes to the previously sampled metagenome do not grow very large.
$$ \hfill \begin{aligned} D_{i} &= D_{i-1} + \varepsilon & \text{with} \\ D_{0} &\sim \text{Lognormal}(\mu,\sigma) & \text{and} \\ \varepsilon &\sim N(0,1) & \text{or} \\ D_{i} &= \frac{D_{i-1} + \varepsilon}{2} & \text{with} \\ \varepsilon &\sim \text{Lognormal}(\mu,\sigma) & \\ \end{aligned} \hfill $$
(5)
Metagenome simulation
Metagenome data sets are generated from the genome abundance profiles of the community design step. For each genome-specific taxon t and its abundance (t,abt)∈Pout, its genome size st, together with the total number of reads n in the sample, determines the number of generated reads nt (Eq. 6). The total number of reads n is the overall sequence sample size divided by the mean read-length of the utilized sequencing technology.
$$ n_{t} = n \cdot \frac{ab_{t} \cdot s_{t}}{\sum_{i \in P_{\text{out}}}ab_{i} \cdot s_{i}} $$
(6)
By default, ART [35] is used to create Illumina 2 × 150 bp paired-end reads with a HiSeq 2500 error profile. The profile has been trained on MBARC-26 [36], a defined mock community that has already been used to benchmark bioinformatics software and a full-length 16S rRNA gene amplicon sequencing protocol [37, 38], and is distributed with CAMISIM. Other ART profiles, such as the one used for the first CAMI challenge, can also be used. Further available read simulators are wgsim (https://github.com/lh3/wgsim, originally part of SAMtools [39]) for simulating error-free short reads, pbsim [40] for simulating Pacific Biosciences data and nanosim [41] for simulating Oxford Nanopore Technologies reads. The read lengths and insert sizes can be varied for some simulators.
For every sample of a data set, CAMISIM generates FASTQ files and a BAM file [39]. The BAM file specifies the alignment of the simulated reads to the reference genomes.
Gold standard creation and postprocessing
From the simulated metagenome data sets—the FASTQ and BAM files—CAMISIM creates the assembly and binning gold standards. The software extracts the perfect assembly for each individual sample, and a perfect co-assembly of all samples together by identifying all genomic regions with a coverage of at least one using SAMtools’ mpileup and extracting these as error-free contigs. This gold standard does not include all genome sequences available for the simulation, but the best possible assembly of their sampled reads.
CAMISIM generates the genome and taxon binning gold standards for the reads and assembled contigs, respectively. These specify the genome and taxonomic lineage that the individual sequences belong to. All sequences can be anonymized and shuffled (but tracked throughout the process), to enable their use in benchmarking challenges. Lastly, files are compressed with gzip and written to the specified output location.