Uncovering a hidden diversity: optimized protocols for the extraction of bacteriophages from soil

Background Bacteriophages are the most numerous biological entities on earth and play a crucial role in shaping microbial communities. Investigating the bacteriophage community from soil samples will shed light not only on the yet largely unknown phage diversity, but also may result in novel insights into phage biology and functioning. Unfortunately, the study of soil viromes lags far behind any other ecological model system, due to the heterogeneous soil matrix that rises major technical difficulties in the extraction process. Resolving these technical challenges and establishing a standardized extraction protocol is therefore a fundamental prerequisite for replicable results and comparative virome studies. Results We here report the optimization of protocols for extraction of bacteriophage DNA from soil preceding metagenomic analysis such that the protocol can equally be harnessed for phage isolation. As an optimization strategy, soil samples were spiked with a viral community consisting of phages from different families (106 PFU/g soil): Listeria phage ΦA511 (Myovirus), Staphylococcus phage Φ2638AΔLCR (Siphovirus), and Escherichia phage ΦT7 (Podovirus). The efficacy of bacteriophage (i) elution, (ii) filtration, (iii) concentration, and (iv) DNA extraction methods was tested. Successful extraction routes were selected based on spiked phage recovery and low bacterial 16S rRNA gene contaminants. Natural agricultural soil viromes were then extracted with the optimized methods and shotgun sequenced. Our approach yielded sufficient amounts of inhibitor-free viral DNA for non-amplification dependent sequencing and low 16S rRNA gene contamination levels (≤ 0.2 ‰). Compared to previously published protocols, the number of bacterial read contamination was decreased by 65 %. In addition, 468 novel circularized soil phage genomes in size up to 235 kb were obtained from over 29,000 manually identified viral contigs, promising the discovery of a large, previously inaccessible viral diversity. Conclusion We have shown a dramatically enhanced extraction of the soil phage community by protocol optimization that has proven robustness in both a culture-depended as well as through metaviromic analysis. Our huge data set of manually curated soil viral contigs roughly doubles the amount of currently available soil virome data, and provide insights into the yet largely undescribed soil viral sequence space.


45
Soil bacteriophages are a vital part of soil bacterial ecology and a major reservoir of genetic 46 material that contributes to biological evolution and diversity. They also influence the flow of nutrients as a potential cause of microbial distribution and mortality [1,2]. Soil is known to 48 harbour a vast abundance of phages (10 8 -10 9 pfu·g -1 ), with their numbers exceeding those of 49 co-occurring bacteria by 10 -1000 fold [3][4][5]. This undiscovered viral diversity could lead not 50 only to novel findings in phage biology, but can equally promote advances in phage therapy, 51 research for the treatment of pathogen infected humans, or agricultural cropland. Despite this 52 ecological and medical importance, the soil virome is poorly studied compared to other 53 ecosystems, likely due to major technical and computational limitations. Nowadays, 97 % of 54 all viruses are thought to be found in solid matrices such as soil and sediment and adversely, 55 only 1.8 % of all publicly available metavirome data cover those combined sources [6]. It is 56 therefore of high importance to establish reliable methods for comparing changes in viral 57 abundances within and across those samples. Given the physicochemical diversity of soils, 58 the soil matrix complexity and its high microbial diversity [2], it is not surprising that no universal 59 phage extraction protocol or standardization towards viral elution, concentration and DNA 60 extraction have yet been proposed. Only few phage extraction protocols from soil samples 61 have been suggested in literature, which frequently suffer from a low recovery rate of phages 62 (< 5 %), and downstream metagenome analyses therefore require DNA amplification [1, 3, 7-63 8 inversion and let to settle overnight at 4 °C. The suspended soil should then be centrifuged as 155 described in previous sections, the supernatant kept aside and the pellet resuspended in 156 another volume of PPBS for a total of three rounds [4,12]. 157

Removal of bacterial contaminants using centrifugation and filtration 158
After elution optimization of bacteriophages from soil particles, the complete removal of 159 contaminating bacterial cells and thus bacterial DNA was attempted. Several techniques 160 previously described in literature, such as centrifugation steps prior to 0.8 µm [22], 0.45 µm 161 [1,17,23] or 0.22 µm [3,9,10,12,15,[24][25][26] PES filtration (Figure 1, Route 3-6), some 162 coupled with a chloroform treatment [9,12] (Figure 1, Route 1-2) were assessed. The benefit 163 of using PES as the filter material was already established elsewhere and applied here [22]. 164 Any filtration attempt with pore size < 16 µm was impaired if the single (Figure 1, Route 1-4) 165 or mixed (Figure 1, Route 5-6) supernatants were not centrifuged thrice at 5,000 x g for 10 166 minutes. Filtration procedures and the spiked phage community were measurably not impaired 167 when using low-speed centrifugation to remove impurities. As summarized above, published 168 protocols in the literature suggest either a 0.22 µm, 0.45 µm or 0.8 µm filtration of extracted 169 metaviromes to decrease bacterial contamination below a reasonable threshold and 170 simultaneously not impair viral yield or diversity. A reductive effect on soil phage diversity when 171 using a 0.22 µm filter to decrease external contamination, has however not been evaluated 172 yet. In this optimization protocol, no significant difference in spiked phage recovery was 173 observed when comparing a 0.22 µm to 0.45 µm filter pore size (unpaired t-test, p-174 value=0.7782, n=9) ( Figure 3a). In addition, a 16S rRNA qPCR analysis revealed that both 175 0.45 µm and 0.22 µm PES filtration techniques removed more than 99.9 % of all bacterial 16S 176 rRNA genes. The 0.22 µm filtration, however, decreased bacterial gene contamination 177 significantly better (unpaired t-test, p-value <0.0001, n=6) ( Figure 3b). A higher recovery of 178 phages in metaviromic samples was recently reported when substituting a 0.22 µm with a 0.45 179 µm filtration step [23]. Along these lines, the 0.45 µm filtration was nevertheless chosen here 180 for further optimization purposes in order to avoid a potential bias in the native soil viral 181 9 community. Both routes (0.22 and 0.45 µm filtration) were then selected for shotgun 182 sequencing analysis of soil metaviromes to provide definite answers to bacterial contamination 183 levels and viral diversity in each approach (see below). 184 Besides centrifugation and filtration, the efficiency of chloroform treatment to remove bacterial 185 contamination was assessed (Figure 1, Route 1-2). Chloroform treatment is generally a rather 186 impractical approach, because both bacteriophages in the environmental sample [9], as well 187 as downstream concentration devices, are sensitive to chloroform. A maximum concentration 188 of 0.8 % chloroform is supported when using a PES tangential -or regular filtration, which in 189 turn did not reduce bacterial DNA contamination (data not shown). A chloroform treatment prior 190 to phage concentration was therefore excluded from the protocol. 191 Concentration of viral particles from soil samples prior to DNA extraction 192 PEG and TFF concentration techniques in combination with ultrafiltration are commonly used 193 techniques to concentrate viral particles from large volumes. These techniques have been 194 described in detail elsewhere [9,10,20,27] and were assessed here in eluted soil samples. 195 An optimal concentration technique should concentrate viral particles without introducing a 196 bias to the native viral community, and equally reduce the suspensions' volume sufficiently to 197 allow DNA extraction. Spiked phages were eluted from soil samples using the optimized elution 198 protocol, filtrated through a 0.45 µm PES filter and subjected to PEG or TFF ( Figure 1, Route 199 I-II). As shown in Figure 4, both concentration techniques performed equally good in 200 concentrating spiked phages and no differences in spiked phage yield after concentration was 201 observed. Furthermore, viral suspensions were reduced in both approaches to a final volume 202 of 20-50 ml, which allowed CsCl ultracentrifugation and ultrafiltration. Like in the case of 203 filtration, the spike dependent approach for evaluating efficiency of those concentration 204 methods, has not resulted in a clear conclusion. Both routes were therefore selected for 205 shotgun sequencing to elucidate which route revealed superior by obtaining highest viral 206 recovery and diversity (see below). CsCl ultracentrifugation concentrates and purifies phages 207 from samples or pure cultures. It is however not only technically demanding, but also cost and 208 10 equipment intensive. The necessity of a CsCl purification for soil samples prior to DNA 209 extraction was evaluated by extracting soil viruses with and without CsCl centrifugation prior 210 ultrafiltration (Figure 1, Route II and III). Ultrafiltration units were coated with PBS + 1 % BSA 211 to reduce viral absorption as suggested and optimized elsewhere [27]. Purification of soil 212 samples with CsCl ultracentrifugation seemed to slightly diminish spiked viral yields ( Figure 4a  213 and 4b). However, this loss in PFU could also be attributed to a loss in infectivity rather than a 214 loss in viral yield caused by centrifugation conditions [20]. When avoiding a CsCl purification 215 step prior to ultrafiltration (Figure 1, Route III, and Figure 4c), the centrifugal filters clogged 216 while concentrating and the remaining volume could not be reduced to less than 5 ml. A 217 purification of soil viral suspensions using CsCl ultracentrifugation was therefore implemented 218 in the optimized protocol, which ensured a concentration to a final volume below 300 µL. When 219 aiming for isolating infective viral particles from soil samples, however, this purification can 220 easily be omitted. 221

Phage DNA extraction from soil samples 222
Ultrafiltrated concentrated samples were treated with DNase I to remove free DNA that is not 223 enclosed by viral capsids, and brought up to a final volume of 600 µl. From here, phage DNA 224 was either extracted using modified phenol/chloroform extraction routes (Figure 1, Route A-D) 225 or QIAamp Viral RNA Mini Kit (Qiagen) according to manufacture instructions (Figure 1, Route 226 E). DNA extraction using the suggested kit resumed in a > 10 fold reduction of viral DNA than 227 other proposed DNA extraction routes, and was therefore excluded from the optimized protocol 228 (data not shown). The phenol/chloroform DNA extraction was adapted from Thurber et al. [9], 229 whereas the necessity of formamide and CTAB in both viral DNA extraction routes was 230 assessed by bypassing these steps separately or in combination. DNA extraction routes with 231 larger volumes that originated directly from the TFF concentrated sample (Figure 1, Route III,232 no CsCl purification), resulted in a jellification of the sample after the addition of formamide, or 233 presented a contamination with qPCR inhibitors that even flawed DNA measurements if 234 formamide was neglected. Extraction routes covering a formamide treatment resulted 235 11 therefore either in a complete loss of metaviromic DNA through a jellification of the sample 236 ( Figure 1, Route III), or reduction in DNA yield from CsCl purified routes (Additional file 3: Table  237 S2). Formamide treatment is thus decreasing the viral DNA yield irrespectively of purification 238 and was therefore excluded. A CTAB/NaCl treatment on the other hand, did not correlate 239 consistently with a disadvantageous outcome and was further highly dependent on the 240 experimental setup. For the sake of simplicity, a CTAB/NaCl treatment of viral suspensions 241 was hence incorporated in viral DNA extraction (Additional file 3: Table S2). When performing 242 the optimized protocol, a sufficient amount of pure viral DNA was extracted from 400 -1000 g 243 of soil (Additional file 3: Table S2, Table 1), which allowed direct sequencing without using 244 multiple displacement amplification. 245

Metagenomic analysis 246
By using a spiked phage community as reporter, it was not possible to provide definite answers 247 regarding an optimal filter pore size (0.22 µm vs 0.45 µm) or soil phage concentration method 248 (PEG vs TFF). The four optimized phage DNA extraction routes (0.22 µm + TFF, 0.22 µm + 249 PEG, 0.45 µm + TFF and 0.45 µm + PEG) were hence compared for viral richness, diversity 250 and bacterial DNA contamination levels using metagenomic analysis. Viral DNA was extracted 251 from 1 Kg of soil for each extraction route and paired-end shotgun Illumina sequenced with 76 252 million reads per route. Over 90 % of all raw reads survived the initial data pre-processing as 253 trimming and size exclusion and a total of 311 million reads from all four metaviromes 254 remained. Those reads were assembled into 48,227 contigs (> 5 kb) with average lengths 255 between 11.11 and 13.17 kb ( Table 1). As a normalization measure and to exclude a potential 256 direct effect of the number of reads to the assembled contigs, a sub-assembly with 60 million 257 reads for each sequenced extraction route was performed (Additional file 4: Table S3). This After assembly, the 48,227 remaining contigs were manually inspected and classified as virus, 283 if viral hallmark genes such as terminases or structural proteins were present. Equally, those 284 contigs that harboured bacterial genes such as ribosomal sequences were separated from the 285 viral fraction and classified as bacteria. After identification by manual curation, 13,114 contigs 286 were classified as virus and another 13,519 as bacteria, whereas 21,586 remained unclassified 287 due to insufficient annotation (hypothetical proteins or none) ( Table 2). Manually classified viral 288 13 contigs from each metavirome were then pooled and redundant contigs (clustered at > 99 % 289 identity) were removed. This initial clustering analysis resulted in 10,886 (74 %) unique and 290 partially complete viral genomes from all four extracted soil metaviromes. On the basis of 291 overlapping ends (more than 10 bp), we could extract 379 novel, circularized phage genomes 292 with lengths from 5.1-235 kb (average 58.9 kb) from this non-redundant viral fraction. In 293 addition, 89 complete potential phage genomes (sizes between 5 -70.9 kb) were identified in 294 contigs left unclassified (Table 3). 295 Viral diversity in each extraction route was compared by removing all redundant viral contigs 296 with clustering 1 and 2 information leaving 29,704 (61.7 %) contigs classified as either viral, 297 bacterial or unknown origin (Table 3). A subset of 20 million reads from each sequenced 298 metavirome was then separately mapped against this manually curated and trimmed viral 299 community to estimate viral recruitment in each extraction method. In the 0.22 µm filtrated 300 metaviromes, 97.5 % of all recovered viruses were present in the TFF route (967 unique viral 301 contigs > 5 kb), whereas 88.9 % were recovered by PEG concentration (219 unique viral 302 contigs > 5 kb) ( Figure 6a). Similarly, by comparing TFF versus PEG concentration in the 0.45 303 µm filtrated metaviromes, it is evident that TFF performed better by recovering a higher 304 percentage of unique soil viruses ( Figure 6b). Particular soil viruses exclusively detected in the 305 TFF route infected predominantly the phylum Actinobacteria (43 %), whereas 41 % could not 306 be identified due to lack of annotation. In contrast, only 2 % of the unique viruses recovered 307 by PEG infected Actinobacteria and as many as 77 % could not be characterized due the low 308 number of functional gene predictions or lack of signature genes. displayed the highest sequence affiliation to viral contigs, recruiting more than 25 % of the 319 reads. Recruitment of viral reads decreased to less than 15 % in 0.45 µm extracted 320 metaviromes ( Figure 7). Notably, the recruitment rates for the unclassified fraction was 321 considerably higher in 0.22 µm filtrated samples. 322

323
The study of soil viromes lag far behind any other ecological model system, due to the 324 heterogeneous soil matrix that rise major technical difficulties in the extraction process [2]. 325 Resolving these technical challenges and establishing a standardized extraction protocol is 326 therefore a fundamental prerequisite for replicable results and comparative virome studies. We 327 here report the optimization of protocols for extraction of bacteriophage DNA from soil 328 preceding metagenomic analysis, such that the protocol can also be harnessed equally well 329 for phage isolation. As anticipated, the elution of virus particles from soil samples has 330 crystallized to be the major bottleneck in the present study [1,2], due to the fact that > 90 % of 331 bacteriophages tend to absorb to soil particles [19]. Indeed, all suggested elution buffers in the 332 literature [1,3,9,13,18,20] were either found to perform insufficiently in the recovery of 333 bacteriophages from soil (< 5 % of spiked phages), or resulted in major technical limitations 334 due to complete inhibition of downstream filtration procedures. We designed an optimal elution 335 buffer PPBS, consisting of ionic salt compounds supplemented with 2 % BSA that disrupts 336 phage soil particle interactions through competing for viral binding sites. This finding is 337 consistent with Lasobras et al. [18], who reported that an optimal elution of virus particles from 338 soil requires either a proteinaceous material that competes for viral binding sites or chaotropic 339 agents which alter the favourability of absorption. The beneficial action observed with PPBS 340 buffer was not enhanced when substituting BSA with beef extract, which supports the 341 favourable effect of BSA in viral elution and validates BSA as equally, if not superior, to a beef 342 15 extract supplement. In summary, the optimized elution protocol described here results in a 343 recovery of virtually all spiked bacteriophages whereas no technical difficult or harsh method 344 was applied to maximise viral recovery. As a major advantage, this gentle optimized elution 345 protocol allows the isolation of infective viral particles through omitting techniques that could 346 result in tail breakages or defective particles. 347 For shotgun sequencing or functional metagenomics, a maximum reduction of bacterial DNA 348 contamination is crucial to allow data analysis. simultaneously did not compromise total viral recovery and diversity from soil samples. Indeed, 359 less viral diversity was observed in metaviromes filtrated through a 0.45 µm pore-size, which 360 is probably due to the increased bacterial DNA contamination and thus, an impaired assembly 361 of viral contigs ( Figure 6). This finding is supported by our recruitment analysis, which revealed 362 similar amounts of reads recruited to the manually classified viral or bacterial fraction (14.7 % 363 and 15% respectively) in 0.45 µm metaviromes. The vast majority of reads recruited in the 0.22 364 µm metaviromes however, matched to viral contigs and < 5 % to bacterial contigs (Figure 7). 365 The 0.45 µm metaviromes suffered thus from more bacterial contamination, which does not 366 only complicate approaches such as functional metagenomics, but also hinders sequencing 367 analysis due to an impaired assembly. Using the proposed optimized protocol for elution and 368 filtration of soil viruses, the 16S rRNA gene contamination could be reduced to a level below 369 the recommended threshold of 0.2 ‰ [21]. This finding is consistent with Castro et al. [20], who 370 16 observed a considerably lower host DNA contamination when relying on both centrifugation 371 (e.g thrice 5000 x g) and filtration techniques. Interestingly, the predominant bacterial 372 contaminant in all sequenced metaviromes was found to be Candidatus Saccharibacteria. Due 373 to their size, Saccharibacteria can pass untroubled through even a 0.22 µm filter and are thus 374 being concentrated along with the bacteriophages in any protocol. Unfortunately, those 375 bacteria are not only found in soil, but in many other environmental samples such as sludge 376 and activated sludge from wastewater treatment plants, human saliva and the gut microbiome 377 [29]. Any metavirome extracted from those samples must, therefore, be either manually 378 curated to remove bacterial reads, or handled with great care to reach valid conclusions. 379 In order to concentrate bacteriophages from large suspension volumes, most commonly used 380 approaches as TFF and PEG precipitation were compared here. Independently from the 381 filtration technique applied, TFF performed better in recovering and concentrating soil phages 382 and, therefore, revealed a greater soil viral diversity. After concentration, a purification of soil 383 viral suspensions using CsCl ultracentrifugation was implemented in the optimized protocol 384 before viral DNA extraction to remove inhibitors and allow ultrafiltration concentration to a finale 385 volume below 300 µL. As resolved here, CsCl ultracentrifugation needs to be applied for 386 purification purposes prior to DNA extraction, but can easily be omitted when aiming for the 387 isolation of infective viral particles. In this study, viral DNA extraction was carried out as 388 described by Thurber et al. [9], whereas a shortening of the protocol by omitting a formamide 389 treatment was implemented. The formamide treatment did not improve DNA extraction 390 regardless of CsCl purification, and resulted in either a jellification (no CsCl purified sample) 391 or a reduction in the total DNA yield (CsCl purified sample). The optimized DNA extraction 392 protocol therefore omits unnecessary or detrimental steps found in the literature and shortens 393 the protocol by one day.

16S rRNA gene qPCR analysis 448
The presence of contaminating bacterial DNA was assessed throughout each extraction route 449 (Additional file 1: Figure S1) using Taqman 16S rRNA qPCR. For this, a 16S rRNA gene 450 fragment (934 bp) was cloned into a pGEM-T-Easy-Vector (3,015 bp), and the correct insertion 451 was verified using restriction enzyme digestion. For standard preparation, the plasmid 452 containing the insert was linearized and purified with Gene-Elute PCR Clean-Up Kit (Sigma). 453 DNA concentration was measured with Qubit (ThermoFisher) and copy numbers were 454 calculated. Taqman qPCR was carried out using the SensiFAST Probe no-ROX kit (Bioline), 455 whereas primer and probes were placed in conserved regions of the 16S rRNA gene (amplicon 456 size: 105 bp, Additional file 2: Table S1). All qPCR assays were performed on Rotorgene 600 457 (BioLabo, Corbett Research) with following conditions: 5 minutes at 95 °C for polymerase 458 activation, followed by 40 cycles at 95 °C for 10 sec and 60 °C for 20 sec. 459

Elution of bacteriophages from soil 473
The elution optimization strategy is summarized in Figure 1. Soil samples (400 g) were spiked 474 for each phage with 1 x 10 6 pfu g -1 soil. The spiked sample was suspended 1:1 (w/v) in the 475 respective elution buffer and manually shaked for 10 minutes by repetitive inversion. Elution 1, Route 5-6). For this, the eluted overnight soil sample was centrifuged 10,000 x g [4, 10], 10 488 minutes at 4°C and the first supernatant kept aside. Consecutively, the same soil pellet was 489 again suspended in equal parts of the buffer, put on a shaker for 30 min at 300 rpm, 4°C, and 490 centrifuged as described above. This was repeated a third time to maximise bacteriophage 491 recovery [4,12]. PFU for each spiked phage was assessed in all supernatants and the three 492 finally united. 493

Removal of bacterial contamination 494
In order to reduce contaminating bacteria and sediments, the single (Figure 1, Route 1-4) or 495 united (Figure 1, Route 5-6) supernatants were centrifuged three rounds at 5,000 x g [20], for 496 10 minutes at 4 °C. At each individual round, the supernatant was recovered into a new, sterile 497 centrifugation tube and the pellet discarded. To remove larger floating particles that were not 498 parted by centrifugation, soil supernatants were pre-filtrated using a 16 µm cellulose filter and 499 21 sterile glassware. The filtrate was eventually passed through either a 0.45 µm or 0.22 µm PES 500 filter. Bacterial contamination, as well as recovered bacteriophages, were determined using 501 16S rRNA gene qPCR and plaque assays, respectively. Besides centrifugation and filtration, 502 the efficiency of chloroform treatment to remove bacterial contamination was assessed. To 503 evaluate potential benefits of chloroform and to concurrently allow downstream concentration 504 of viral particles, a final chloroform concentration of 0.8 % was applied (Figure 1, Route 1-2). 505

Tangential flow filtration 506
For concentrating soil viral particles, a TFF approach was tested. Briefly, viral suspensions 507 were concentrated using a 100 kDa cut-off PES membrane (Millipore) and the retentate 508 containing the bacteriophages (>100 kDa) was continuously cycled to maximally reduce its 509 volume. The presence of spiked bacteriophages was quantified in the retentate and their 510 absence confirmed in the permeate. TFF concentrated soil viral suspensions were either 511 purified with CsCl ultracentrifugation, or directly subjected to DNA extraction. 512

Polyethylene glycol precipitation 513
Concentration of soil viral particles using PEG precipitation was performed as following. Soil 514 suspensions were mixed thoroughly 2:1 with a 3 X precipitant solution (30 % PEG 6,000 and 515 3 M NaCl in autoclaved ddH2O), and put in ice-water at 4 °C overnight. Next, the suspensions 516 were centrifuged at 16,000 x g for 1 hour at 4 °C and the pellet resuspended in 7 mL of SM 517 buffer. To free the concentrated viral particles of PEG, samples were dialyzed in 4 litres of SM 518 buffer overnight at RT using a 50 kDa membrane (Biotech CE tubing). 519

Caesium chloride gradient purification 520
Concentrated viral particles were purified using ultracentrifugation in a four-layered CsCl 521 gradient. Methods were adapted from literature [9,20]. Briefly, 10 mL of a concentrated sample 522 was adjusted to a density of 1.15 g mL -1 and loaded on top of a 6 mL step gradient containing 523 2 mL of 1.35, 1.5 and 1.7 g mL -1 CsCl, respectively. Gradients were centrifuged at 82,000 x g 524 22 for 2 hours at 10 °C. Bacteriophages in the density fractions between 1.35 and 1.5 were 525 harvested (position visible through a light blue phage band). The collected samples with a 526 finale volume of 2-3 mL per gradient were dialyzed at 4 °C as described above. 527

Ultrafiltration concentration 528
The TFF retentate and dialyzed CsCl fractions were further concentrated using Amicon ultra 529 centrifugal filters with 100 kDa cut-off (Millipore). Prior to centrifugation, filters were coated with 530 PBS + 2 % BSA in order to prevent viral absorption [27]. The concentrated sample was 531 recovered into a sterile Eppendorf and the centrifugal filter washed twice with 100 µL of ddH2O. 532 Bacteriophage recovery in the concentrate and bacteriophage absence in the filtrate was 533 confirmed using plaque assay. The total volume was eventually brought up to 450 µL for DNA 534 extraction. 535

Viral DNA extraction 536
The optimization strategy for viral DNA extraction is summarized in Figure 1 (Route A-E). 537 Phenol chloroform viral DNA extraction was carried out as described elsewhere [9] with some 538 optimizations. Ultrafiltrated concentrated samples were supplemented with 50 µL of 10 x 539 DNase I Buffer (ThermoFisher) and treated with 10 U of DNase I (ThermoFisher) for 2 hours 540 at 37 °C. The enzyme was inhibited using 50 mM EDTA at 65 °C for 10 minutes and the volume 541 brought up to 600 µL. From here, viral DNA was either extracted using modified 542 phenol/chloroform extraction routes (Figure 1 The remaining suspension was again split into two equal volumes. Eppendorf 2, which was 552 not treated with formamide, was equally split (Route C-D). All four Eppendorf tubes were 553 topped up to 567 µL using sterile ddH2O and the DNA was extracted as following: 30 µL of 10 554 % SDS and 3 µL of 20 mg/mL Proteinase K were added, mixed, and incubated for 1 h at 55 555 °C. Subsequently, in one Eppendorf originating from the formamide treatment (Route A), and 556 one without formamide treatment (Route C), 80 µL of CTAB/NaCl solution was added and 557 incubated for 10 minutes at 65 °C. Finally, all samples were identically treated in accordance 558 to established protocols [9]: equal volumes of chloroform were added, and the samples were 559 centrifuged for 5 minutes at 8,000 x g at RT. The supernatant of each route was transferred to 560 a separate tube and equal volumes of first phenol/chloroform/isoamyl alcohol (25:24:1) and 561 subsequently chloroform were added, to be centrifuged at the same conditions as above. After 562 the second chloroform treatment the supernatant was recovered, and 0.7 volumes of 563 isopropanol were supplemented to precipitate the DNA overnight at 4 °C. The next day, all 564 samples were centrifuged for 15 minutes at 13,000 x g, 4 °C, and the pellet was washed with 565 500 µL of 70 % ice-cold ethanol. The ethanol was then removed, the pellet air-dried and 566 resuspended in 50 µL of ddH2O overnight. This DNA extraction optimization protocol was 567 applied to larger volumes if the sample was not ultracentrifuged but directly originated from 568 TFF concentration. 569

Library preparation, illumina sequencing and annotation 570
The four most optimal extraction routes, e.g. 0.22 µm + TFF, 0.22 µm + PEG, 0.45 µm + TFF 571 and 0.45 µm + PEG, were selected based on spiked phage recovery and bacterial depletion. 572 Those extraction protocols were then used to extract viral DNA originating from 1 Kg of freshly 573 agricultural soil (ZOFE, see above) and shotgun sequenced. For this, soil samples were 574 suspended in PPBS, filtrated, concentrated, and purified with CsCl ultracentrifugation. Viral 575 DNA was obtained using the optimized DNA extraction protocol including CTAB but neglecting 576 formamide. Libraries were prepared with 25 ng unamplified viral DNA of each respective route 577 24 using NebNext Ultra II DNA Library Prep for Illumina and following the manufacturer's 578 instructions (10 rounds of PCR amplification). Library pooling and normalization was based on 579 the concentration of the final libraries as determined with Tapestation (Agilent 4200). Tagged 580 libraries were sequenced with 76 million paired-end reads (150 bp/read) using NextSeq 500 581 sequencing. Raw reads were trimmed with Trimmomatic in default settings and unshuffled 582 trimmed reads paired into a single file using shuffleSequences [33,34]. Shuffled sequences 583 from each individual metavirome were assembled with IDBA-UD [35] and contigs larger than 584 5 kb were extracted for further analysis. Open reading frames (ORFs) on assembled contigs 585 (> 5 kb) were predicted using prodigal [

16S rRNA gene contamination 589
In order to assess 16S rRNA gene contamination in the four sequenced metaviromes, reads 590 from each sample were trimmed to a minimum length of 50 bp, and a random subset of 20 591 million trimmed reads were kept for further analysis. Potential 16S rRNA DNA reads were 592 retrieved by USEARCH6 [42] against the RDP database [43], previously clustered at 90 % 593 identity [43]. The percentage of 16S rRNA gene reads in each method was then calculated 594 based on hits that were confirmed by ssu-align [44]. Affirmed 16S rRNA gene reads were 595 taxonomically classified using the RDP database and classifications with a sequence match 596 (S_ab score) higher than 0.8 were kept. In order to compare the efficiency of bacterial DNA 597 removal with the here optimized protocols, 16S rRNA gene reads originating from a 598 metavirome extracted from the same soil but using a former standardised protocol published 599 in literature (LIT) [9,20], was also processed as described above. 600

Classification of contigs and cluster analysis for complete viral genomes 601
In a first step, annotated contigs (> 5 kb) from the four sequenced metaviromes were manually 602 inspected and classified. Contigs were assigned as viral if phage structural genes such as 603 25 terminase, portal, capsid or tail proteins were present, or if the majority of taxonomical hits 604 belonged to virus. Elsewise, contigs with ribosomal proteins, cell division proteins or other 605 bacteria hallmark proteins were classified as bacteria. Contigs with no evident gene indicators 606 or contigs with proteins of none or hypothetical functions were left as unclassified [21]. 607 Manually assigned viral contigs were then pooled together and viral redundant sequences 608 removed. For this, all viral contigs were globally aligned [45] and clustered at > 99 % identity, 609 whereas only the largest representative contig of each cluster was kept. Phage genomes were 610 then assessed for completeness by searching for overlapping nucleotide sequences ( > 10 bp) 611 at the 3′ and 5′ region. 612

Viral recruitment comparison in phage extraction routes 613
Viral diversity in each extraction route was compared by mapping a subset of 20 million reads 614 from all four metaviromes to the total extracted soil viral community. This viral community 615 originated from the manually curated viral contigs that survived the first and a second cluster 616 round. For the second cluster analysis, remaining viral contigs were anew clustered, if more 617 than 30 % of a smaller contig was present at > 99 % identity in a larger contig (local alignment) 618 [45]. Reads of each metavirome were then mapped against all viral contigs that were cleared 619 for redundant sequences, and phage abundance and diversity from each optimized method 620 was analysed. To provide a normalized measure, the number of hits to each phage contig was 621 divided by the length of the contig (in kb) and by the size of the metavirome (size of the 622 database in Gb). This measure is abbreviated as RPKG (reads per Kb per Gb) and helps to 623 compare recruitments by differently sized contigs versus several metagenomes. A phage was 624 considered present in a given metavirome, if the contig was covered by at least by 1 RPKG at 625 98 % identity. 626

Metavirome reads associated with bacterial, viral or unclassified contigs 627
A subset of 20 million reads of each extraction route was mapped to the manually classified 628 viral, bacterial or unknown contigs. For this, all viral sequences obtained from the extracted 629 26 metaviromes were concatenated to one super viral DNA contig. All bacteria or unknown 630 sequences were likewise combined. This concatenation prevents multiple mapping of a single 631 read to a given viral sequence if represented several times in the assembled metaviromes. 632 The total percentage of reads recruited to either the phage, bacterial or unclassified 633    Optimization strategy of phage extractions protocols form soil samples prior to metagenomics 832 analysis. Different phage elution, filtration, concentration and DNA extraction procedures were 833 tested to maximise viral yield and deplete bacterial DNA contaminates. *16S rRNA qPCR to 834 determine external contaminates, Ɨ plaque assay to assess spiked bacteriophage recovery. 835 Additional file 2 836 Table S1: Primer and probes used for 16S rRNA qPCR (PDF) 837 Primer and probes designed for TaqMan 16S rRNA gene qPCR. 838 Additional file 3 839 Table S2: DNA yield and external contamination with phage DNA extraction methods (PDF) 840 DNA yield and bacterial DNA contamination obtained from phage DNA extraction of soil. 841