Characterization of sequence coverage patterns associated with different transduction modes in model systems
Specialized transduction by Escherichia coli prophage λ
We used the well-studied specialized transducing bacteriophage λ to analyze the sequencing coverage patterns produced by specialized transduction. In specialized transduction, a prophage, which is integrated in the chromosome of the bacterial host, packages host genome-derived DNA with low frequency due to imprecise excision from the genome upon prophage induction. Prophage λ integrates between the gal (galactose metabolism) and bio (biotin metabolism) operons in the E. coli genome. In rare cases, λ excision is imprecise and either the gal or the bio operon is excised and packaged in the phage particle (Fig. 2a) [27]. This packaging of adjacent E. coli host-derived DNA can lead to the transduction of the bio and the gal operons. Transduction of recipient cells can be temporary or permanent, depending on if the DNA gets recombined into the chromosome or remains as an extrachromosomal element, which is diluted out in the population during cell divisions [27].
Using the transductomics approach, we found that coverage of the E. coli genome with sequencing reads derived from purified λ phage particles is almost exclusively restricted to the λ phage integration site and two ~ 25 kbp regions on the left and right of the λ integration site (Fig. 2b, c). These flanking regions with read coverage represent the regions that are transduced by λ phage as indicated by the presence of the bio and the gal operons in these flanking regions (Fig. 2c). The coverage of the λ prophage region is roughly 10,000-fold greater than the coverage of the flanking transduced regions indicating that only a small number of phage particles actually carry transduced DNA and thus are specialized transducing particles.
Using the E. coli-prophage λ model, we show that specialized transduction by a prophage produces a unique read coverage pattern. Furthermore, analysis of the read coverage pattern of the transduced DNA region adjacent to the prophage DNA allows determination of both the size and content of the transduced host genome region (~ 50 kbp in total in case of λ), as well as estimation of the frequency with which transducing particles are produced (1:10,000 in case of λ). The number of transducing particles produced based on our data is roughly 100-fold higher than previously reported values for successful transduction of the gal operon by phage λ (1:1,000,000 successful transductions per λ particles) [28], which indicates that only a small fraction of λ carrying host DNA ultimately leads to successful transduction.
Generalized transduction of the Salmonella enterica serovar typhimurium LT2 genome by phage P22 and the E. coli genome by phage P1
We used two well-studied generalized transducing bacteriophages P22 and P1 to analyze the sequencing coverage patterns produced during generalized transducing events. In generalized transduction, nonspecific host chromosomal DNA is packaged into phage particles during lytic infection and can then be injected into a new host cell (Fig. 3a). The DNA can then recombine into the host chromosome by homologous recombination.
98.2% of the sequencing reads from purified P22 particles mapped to the P22 genome leaving 1.8% of reads that map to the S. enterica genome. The percentage of P22 particles mapping to the S. enterica genome corresponds to the reported percentage of 1.5% transducing P22 particles (i.e., carry host DNA) previously reported [29]. The mapped P22-derived reads covered the S. enterica genome unevenly, while whole genome sequencing of S. enterica yielded even coverage (Fig. 3b). Regions of high or low P22 read coverage corresponded in 23 out of 28 previously reported transduced chromosomal markers [30] (Fig. 3b). Only one region at around 4 Mbp, for which high transduction frequencies had been reported, did not show high coverage (Fig. 3b), which might be due to differences in pac sites within this region between the S. enterica strain used in our study and the strain used in 1982.
The coverage of P22-derived reads showed a distinct pattern of peaks that rise vertically on one side and decline slowly over several 100 kbp increments on the other side. We speculate that the vertical edge of the peak corresponds to the location of the pac site at which the packaging of DNA into phage heads is initiated and that the slope of the peak indicates the range of processivity of the headful packaging mechanism (i.e., how many headfuls are packaged into particles before the packaging apparatus dissociates from the chromosome). This speculation is based on several facts: (1) the size of host DNA carried by transducing particles corresponds to the size of the P22 genome (~ 44 kbp) [31]; (2) the P22 genome is replicated by rolling circle replication, which produces long concatemers of P22 DNA. A specific sequence on the phage DNA (pac site) initiates the packaging of these concatemers into phage heads using a headful mechanism [31]; (3) the packaging of phage DNA continues sequentially along the P22 genome concatemer with a decreasing probability for each next headful to be encapsulated in a phage particle [30]; (4) there are five to six sequences on the S. enterica genome that are similar to the pac site, which leads to packaging of Salmonella DNA into P22 particles upon P22 infection, albeit with much lower frequency as compared to P22 DNA [30].
For E. coli phage P1, the majority of sequencing reads from purified P1 particles mapped to the P1 genome and only 4.5% of the reads mapped to the E. coli genome. The percentage of transducing P1 phages was previously reported to be 6% [32]. We also observed that the P1-derived reads mapping to the E. coli genome covered the genome unevenly. However, the pattern was less pronounced as compared to P22 and S. enterica (Fig. 3c). This low unevenness in sequencing read coverage corresponds to previous data on transduction frequencies of chromosomal markers, which found a maximum transduction frequency across the E. coli genome of 10-fold [33].
Sequencing host DNA carried in generalized transducing phages reveals uneven read coverage patterns along the host genome indicative of transduction. These patterns vary in magnitude depending on the transducing phage and they can only be observed if read coverage is analyzed along long stretches of the host genome covering multiples of the length of the DNA carried by the transducing phage, e.g., in case of P22 44 kbp. Additionally, the patterns also provide an indication of the frequency with which different regions of the host genome are transduced, as well as the locations of the pac sites.
Hijacking of helper prophage by a phage-related chromosomal island in Enterococcus faecalis and lateral transduction by prophages
Certain chromosomal islands, including pathogenicity islands and integrative plasmids, are mobilized using helper phages [14, 34]. This is a form of molecular piracy in which structural proteins of the helper phage are hijacked by the chromosomal island and used as a vehicle for the transfer of the island to other cells. We used E. faecalis VE14089, which is a natural resident of the human intestine and causes opportunistic infections, to study the sequencing coverage patterns produced by chromosomal island transfer by way of a helper phage. E. faecalis VE14089 is host to a chromosomal island (EfCIV583) that uses structural proteins from a helper phage (φ1) for transfer [15, 34, 35] (Fig. 4a). E. faecalis VE14089 possesses five additional prophage-like elements (pp2 to pp6). Some of these prophages contribute to E. faecalis pathogenicity and confer an advantage during competition with other E. faecalis strains in the intestine [15, 36].
Read coverage differs widely between the different prophage like elements with the coverage of EfCIV583 exceeding the coverage of all other elements by almost an order of magnitude (Fig. 4b). This finding is in line with previous results that showed that EfCIV583 DNA is more abundant than all other prophage DNA in purified VLPs from E. faecalis V583 [36], an isogenic strain of VE14089. Interestingly, pp2, pp4, and pp6 did not yield coverage peaks in the VLP fraction, which confirms previous observations that these prophage elements are not excised under the conditions that we used for prophage induction [15].
For φ1, pp5 and EfCIV583 we see patterns (coverage slopes visible in the log-scale coverage plot) that indicate that not only the chromosomal island is transduced but also regions adjacent to these prophages and the chromosomal island. Based on the maximal coverage of the transduced regions versus the coverage of the prophage regions (Fig. 4b), we estimate the maximal frequencies of transduction to be 1:500 for φ1, 1:240 for pp5, and 1:43 for the left side of EfCIV583 and 1:3780 for the right side of EfCIV583. These relatively high transduction frequencies and the fact that transduced regions span several hundred kbp facing unidirectional from the integration site of the prophages and EfCIV583 suggest a lateral transduction mechanism as described by Chen et al. [12]. For pp3, read coverage was overall much lower and no obvious transduction pattern was observed.
Our data also revealed that there are several additional regions in the E. faecalis VE14089 genome that had an elevated sequencing coverage in the purified phage sample suggesting that these regions encode additional elements that are transported in VLPs. These elements consist of IS-Elements that carry a transposase and surprisingly the three rRNA operons. For the rRNA operons, the coverage has a deep valley between the 16S and the 23S rRNA gene suggesting that a specific mechanism for rRNA gene transport is present or that the processed rRNAs were sequenced. We can currently think of four explanations for this intriguing pattern. First, potentially ribosomes are enriched alongside the VLPs in our VLP purification method. However, if this were the case we should have observed similar patterns in VLP fractions of other pure culture organisms, which we did not. Second, intact ribosomes are packaged by VLPs produced in E. faecalis. However, this leaves open the question of why the rRNA from these ribosomes was amenable to sequencing by the Illumina method used, which should not enable direct sequencing of RNA. Third, the E. faecalis genome contains a gene for a group II intron reverse transcriptase, which could potentially reverse transcribe rRNA into short DNA fragments that would be packaged into VLPs. This explanation is highly speculative, however, as the enzyme has not been experimentally characterized and it would require broad off-target activity to reversely transcribe a significant amount of rRNA. Fourth, genomic DNA with rRNA genes are packaged with high specificity into one or several types of VLPs from E. faecalis.
Our results show that the transduction of a chromosomal island by a prophage can produce a similar coverage pattern as an induced prophage, indicating that chromosomal islands are easy to detect based on read mapping coverage. However, they can only be distinguished from prophage by annotation of the genes and genomic regions and potentially based on their often times smaller size.
Transport of Bacillus subtilis genome by gene transfer agent-like element PBSX
We induced the gene transfer agent (GTA)-like element PBSX from the B. subtilis ATCC 6051 genome to study the sequencing coverage pattern produced by the supposedly randomized incorporation of fragments from the whole genome into GTA type VLPs. PBSX is a defective prophage that randomly packages 13 kbp DNA fragments of the B. subtilis genome in a GTA-like fashion (Fig. 4c) [10, 38, 39]. In contrast to other GTAs, it does not transfer the packaged DNA between cells but rather acts similar to a bacteriocin against B. subtilis cells that do not carry the PBSX gene cluster [10].
DNA sequencing reads derived from purified PBSX particles covered the B. subtilis genome unevenly with a maximum 30-fold difference between the lowest and highest covered regions (Fig. 4d). Reads from whole genome sequencing of B. subtilis covered the genome evenly slightly increasing toward the origin of replication, as expected [40]. The genomic region containing PBSX had a lower read coverage in VLP particle-derived reads as compared to neighboring genomic regions. This is consistent with results from a previous study where it was found that a genetic marker integrated in the PBSX region was less frequently packaged into particles as compared to a marker in a neighboring region [41]. Interestingly, the genomic region containing the prophage SPbeta, which gets excised upon mitomycin C treatment [42], did not show any higher or lower coverage in the VLP particle-derived sequencing reads as compared to neighboring genomic regions (Fig. 4d).
Our results show that packaging of host DNA by the GTA-like PBSX element of B. subtilis produces a distinct and non-random sequencing coverage pattern that bears similarities to the read coverage pattern produced by the generalized transducing phage P1 (Fig. 3c).
Detection limits of the approach
The patterns for different transduction modes have distinct characteristics that will impact sensitivity of detection and the false positive rate. For prophage induction and specialized transduction pattern detection, there are three potential challenges: (1) the length of the genome sequence fragment (contig) used for read mapping needs to be sufficiently long to encompass both the prophage genome, as well as a portion of the host genome; (2) potential assembly artifacts (chimeric contigs consisting of multiple source genomes) can lead to highly uneven read coverage that could look similar to an induced prophage pattern. In the case of our approach, this is mitigated by the fact that we map whole metagenome and VLP reads to the same contigs and thus we expect to obtain even read coverage for the whole metagenome read mapping, which is indicative of correct assembly; and (3) if read coverage is too low, patterns will not be sufficiently distinct. It can be expected that frequency of specialized transduction is specific to host species and prophages. Nevertheless, we tested the lower limit of read coverage levels needed for detection of prophage induction and specialized/lateral transduction by down sampling read numbers for the VLP reads from E. coli prophage λ and E. faecalis prophages (Figs. 2 and 4b) to achieve coverage levels similar to what we observed for our mouse case study below, which ranged from several tenfold to several thousand fold. For E. coli prophage λ, we found that at ~ 6000× maximum read coverage (5% of total reads), the specialized transduction pattern was still weakly visible, but disappeared at lower coverages, while the induction of the prophage itself was still identifiable at read coverages of 20× (0.01% of total reads) and less (Fig. S1a). For the E. faecalis prophages, specialized/lateral transduction patterns were still visible at ~ 500× coverage for the pp1 region and at ~ 150× for the pp5 region. Prophage induction was detectable at coverages well be low 40× (Fig. S1b). These results indicate that specialized and lateral transduction, as well as prophage induction, can sufficiently be detected with read coverages obtained in shotgun metagenomic sequencing of VLPs.
For generalized transduction and GTA-mediated DNA transfer pattern detection, the two main challenges are (1) potential generation of similar patterns by contamination of the ultra-purified VLPs with DNA from microbial cells, which can for example be addressed by comparing contig rank abundances between whole metagenome and VLP read coverage (see below in case study) and (2) difficulty to recognize the pattern on short contigs, because sloping can extend across 100s of kbp. To test if generalized transduction or GTA-like patterns can be detected on shorter contigs we used the P22, P1 and PBSX data to simulate how contig length impacts pattern visibility. For this, we looked at the coverage patterns of 200 kbp long stretches in the genome (Fig. S2). We found that detectability of generalized and GTA-like patterns in 200 kbp sequence stretches depended on where the 200 kbp stretch was located within the overall read coverage pattern. In some cases, distinguishable coverage sloping was observed (e.g., #2 in Fig. S2a and #2 in S2c); in other cases, coverage looked even or irregular (e.g., #4 in S2b and #4 in S2c). These results indicate that generalized transduction and GTA-mediated DNA transfer can be detected from contig lengths produced using short read shotgun metagenomics of microbiome samples; however, some DNA transfer events are likely missed if the longer contigs do not cover regions that show the characteristic coverage sloping associated with these transfer events.
Case study: high occurrence of transduction in the intestinal microbiome
We next assessed the power and application of our transductomics approach for detecting transduced DNA in VLPs from complex microbiomes. We sequenced the whole metagenome (~ 390 mio reads) and VLPs (~ 360 mio reads) from a fecal sample of one mouse to high coverage. The VLPs were ultra-purified using the multi-step procedure shown in Fig. 1, for which we previously showed that it efficiently removes DNA from microbial cells and the mouse present in fecal samples [24]. We were able to assemble 2143 contigs > 40 kbp from the whole metagenome reads with the largest contig being 813 kbp (ENA accession for assembly: ERZ1273841). We discarded contigs < 40 kbp because detection of transduction patterns requires coverage analysis of a sufficiently large genomic region. We mapped the metagenomic and VLP reads to the contigs > 40 kbp to obtain the read coverage patterns. For complete metagenome, 44% of all reads mapped to the contigs > 40 kbp and for the VLPs 10% of all reads indicating that a large portion of DNA carried in VLPs is derived from prophages and microbial hosts. Of the 2143 contigs, 1957 showed a “standard” read coverage pattern (Fig. 5a, Suppl. Table S1), i.e., high even coverage of the contigs with metagenomic reads and low even or no coverage with VLP reads, indicating no mobilization of host DNA in VLPs. The remaining 186 contigs (8.6% of all contigs > 40 kbp) showed a read coverage pattern that indicates potential mobilization of DNA in VLPs (Fig. 5b-f, Suppl. Table S1).
To verify that the multi-step VLP ultra-purification procedure employed for this study efficiently removes DNA from microbial cells, ruling out potential microbial host DNA contamination, we further assessed read coverage patterns for the 186 contigs in comparison to all contigs. We ranked all 2143 contigs by their normalized coverage for both the whole metagenome and the purified VLP samples (i.e., average × fold read coverage/sum of average × fold read coverage for the sample) with the highest normalized coverage being assigned rank 1 (Table S1). The expectation is that contigs from which DNA is carried in VLPs have the same or lower rank for the VLP sample as compared to the whole metagenome sample, while the rank for contigs for which VLP reads are derived from microbial contamination should have a higher rank as compared to the whole metagenome reads, because randomly contaminating DNA would be depleted in the purified VLP sample. We found that 26 out of the 186 contigs with coverage patterns suggesting DNA mobilization had a normalized coverage based rank that was higher for VLP read coverage than for whole metagenome read coverage indicating that these 26 patterns are potentially due to contamination or alternatively due to very low efficiency of mobilization.
We classified all contigs taxonomically using CAT [43] (Suppl. Tables S2 and S3). The majority of contigs were classified as Bacteroidetes (all contigs: 805, transduction pattern contigs: 83), Firmicutes (all: 586, transduction pattern: 42), Proteobacteria (all: 89, transduction pattern: 3), or not classified at the phylum level (all: 527, transduction pattern: 34). The distribution of assigned contig taxonomy corresponds to what is expected for the microbiota of a healthy mouse [44]. We found that with a few exceptions the relative abundance of contigs assigned to specific phyla was similar between the set of all contigs > 40 kbp and the subset of contigs with transduction patterns. The phyla that differed in relative contig abundance were Proteobacteria with less than half the relative abundance in the contigs with transduction patterns, Verrucomicrobia with 3.5× and Candidatus Saccharibacteria with 11.5× the contig abundance in the contigs with transduction patterns. Since members of Cand. Saccharibacteria have been shown to be extremely small (200 to 300 nm) [45], it is likely that they share similar properties with bacteriophages in terms of size and density and thus might get enriched in the VLP fraction. In fact, all transduction patterns of Cand. Saccharibacteria contigs were classified as “unknown” or “unknown, potentially a small bacterium” prior to knowing the taxonomic identity of the contigs.
We classified the type of DNA mobilization/transduction in the 186 contigs with a mobilization pattern based on the visual characteristics of the mobilized region in the VLP read coverage, as well as based on annotated genes within the mobilized region. For example, we classified mobilization patterns as prophage if the characteristic pattern showed high coverage with sharp edges on both sides (compare Fig. 2) and the presence of characteristic phage genes (e.g., capsid proteins) as an additional but not required criterion.
We observed 74 contigs that indicated induced prophages. Of these, 12 (16%) prophages showed indications of specialized transduction, i.e., read coverage above the base level of the contig in regions adjacent to the prophage (Fig. 5b and d). Additionally, we classified 8 patterns as potential prophages or chromosomal islands, as they showed the same pattern as other prophages, but we were unable to find recognizable phage genes in the annotations.
We found patterns of potential generalized transduction or GTA-carried DNA in 46 contigs; however, some patterns were observed for shorter contigs and could thus potentially be incorrect classifications (Fig. 5c). One of the contigs (NODE_5, classified as Bacteria) with a generalized transduction or GTA pattern additionally showed a sharp coverage drop in a ~ 15 kbp region only in the VLP reads (Fig. 5c). This region is flanked by a tRNA gene and carries one gene annotated as a potential virulence factor, internalin used by Listeria monocytogenes for host cell entry [46]. This region might represent a chromosomal island that was excised from the bacterial chromosome prior to or during production of the unknown VLP and that did not get encapsulated in the VLP. Alternatively, similar strains might be present in the sample, but only some carry the chromosomal island and strains carrying the chromosomal island are less prone to producing the VLPs, e.g., by superinfection resistance provided by the chromosomal island against a generalized transducing phage.
We observed 9 patterns that showed strong differences between whole metagenome read coverage and VLP read coverage, but that did not correspond to any of the patterns we analyzed in our proof-of-principle work. However, based on gene annotations, we determined that these patterns likely represent retrotransposons or other transposable elements. For example, on contig NODE_1640 (classified as Bacteria by CAT), we observed high coverage with VLP reads on one part of the contig, which carries a gene annotated as a retrotransposon (Fig. 5e). Interestingly, the retrotransposon region is flanked by a ltrA gene which is encoded on bacterial group II intron and encodes maturase, an enzyme with reverse transcriptase and endonuclease activity [47]. Surprisingly, the region containing the ltrA gene had above average coverage in the whole metagenome reads, but no coverage in the VLP reads. This suggests that the intron actively reverse splices into expressed RNA with subsequent formation of cDNA [47] leading to increased copy number of this genomic region. As another example, on contig NODE_1223 (classified as Bacteria by CAT) a region containing a transposase gene is strongly overrepresented in the VLP reads suggesting that this region is a transposable element that is packaged into a VLP (Fig. 5e).
Finally, we determined that two patterns are likely lytic phages and 47 patterns are classified as “unknown” transduced DNA, as the coverage pattern is uneven indicating transport in VLPs but we could not determine the type of transport. To provide an example, in contig NODE_646 (classified as Clostridiales by CAT), we observed a potential prophage pattern in which we found some of the main phage relevant genes such as major capsid protein; however, within the prophage pattern, we observed high coverage spikes for which we currently have no good explanation (Fig. 5f).