VIBRANT: Automated recovery, annotation and curation of microbial viruses, and evaluation of virome function from genomic sequences

Background Viruses are central to microbial community structure in all environments. The ability to generate large metagenomic assemblies of mixed microbial and viral sequences provides the opportunity to tease apart complex microbiome dynamics, but these analyses are currently limited by the tools available for analyses of viral genomes and assessing their metabolic impacts on microbiomes. Design Here we present VIBRANT, the first method to utilize a hybrid machine learning and protein similarity approach that is not reliant on sequence features for automated recovery and annotation of viruses, determination of genome quality and completeness, and characterization of virome function from metagenomic assemblies. VIBRANT uses neural networks of protein signatures and a novel v-score metric that circumvents traditional boundaries to maximize identification of lytic viral genomes and integrated proviruses, including highly diverse viruses. VIBRANT highlights viral auxiliary metabolic genes and metabolic pathways, thereby serving as a user-friendly platform for evaluating virome function. VIBRANT was trained and validated on reference virus datasets as well as microbiome and virome data. Results VIBRANT showed superior performance in recovering higher quality viruses and concurrently reduced the false identification of non-viral genome fragments in comparison to other virus identification programs, specifically VirSorter and VirFinder. When applied to 120,834 metagenomically derived viral sequences representing several human and natural environments, VIBRANT recovered an average of 94.5% of the viruses, whereas VirFinder and VirSorter achieved less powerful performance, averaging 48.1% and 56.0%, respectively. Similarly, VIBRANT identified more total viral sequence and proteins when applied to real metagenomes. When compared to PHASTER and Prophage Hunter for the ability to extract integrated provirus regions from host scaffolds, VIBRANT performed comparably and even identified proviruses that the other programs did not. To demonstrate applications of VIBRANT, we studied viromes associated with Crohn’s Disease to show that specific viral groups, namely Enterobacteriales-like viruses, as well as putative dysbiosis associated viral proteins are more abundant compared to healthy individuals, providing a possible viral link to maintenance of diseased states. Conclusions The ability to accurately recover viruses and explore viral impacts on microbial community metabolism will greatly advance our understanding of microbiomes, host-microbe interactions and ecosystem dynamics.


48
Background 49 Viruses are central to microbial community structure in all environments. The ability to generate 50 large metagenomic assemblies of mixed microbial and viral sequences provides the opportunity to 51 tease apart complex microbiome dynamics, but these analyses are currently limited by the tools 52 available for analyses of viral genomes and assessing their metabolic impacts on microbiomes. 53 54 Design 55 Here we present VIBRANT, the first method to utilize a hybrid machine learning and protein 56 similarity approach that is not reliant on sequence features for automated recovery and annotation 57 of viruses, determination of genome quality and completeness, and characterization of virome 58 function from metagenomic assemblies. VIBRANT uses neural networks of protein signatures and 59 a novel v-score metric that circumvents traditional boundaries to maximize identification of lytic 60 viral genomes and integrated proviruses, including highly diverse viruses. VIBRANT highlights 61 viral auxiliary metabolic genes and metabolic pathways, thereby serving as a user-friendly 62 platform for evaluating virome function. VIBRANT was trained and validated on reference virus 63 datasets as well as microbiome and virome data. 64 65 Results

66
VIBRANT showed superior performance in recovering higher quality viruses and concurrently 67 reduced the false identification of non-viral genome fragments in comparison to other virus 68 identification programs, specifically VirSorter and VirFinder. When applied to 120,834 69 metagenomically derived viral sequences representing several human and natural environments, 70 VIBRANT recovered an average of 94.5% of the viruses, whereas VirFinder and VirSorter 71 achieved less powerful performance, averaging 48.1% and 56.0%, respectively. Similarly, 72 VIBRANT identified more total viral sequence and proteins when applied to real metagenomes. 73 When compared to PHASTER and Prophage Hunter for the ability to extract integrated provirus 74 regions from host scaffolds, VIBRANT performed comparably and even identified proviruses that 75 the other programs did not. To demonstrate applications of VIBRANT, we studied viromes 76 associated with Crohn's Disease to show that specific viral groups, namely Enterobacteriales-like 77 viruses, as well as putative dysbiosis associated viral proteins are more abundant compared to 78 healthy individuals, providing a possible viral link to maintenance of diseased states. 79 80 Conclusions 81 The ability to accurately recover viruses and explore viral impacts on microbial community 82 metabolism will greatly advance our understanding of microbiomes, host-microbe interactions and 83 ecosystem dynamics. Viruses that infect bacteria and archaea are incredibly abundant, and outnumber their hosts 93 in most environments (1)(2)(3). Viruses are commonly considered non-living entities and are obligate 94 intracellular pathogenic genetic elements capable of reprogramming host cellular metabolic states 95 during infection. They are also highly active and cause the lysis of 20-40% of microorganisms in 96 diverse environments every day (4, 5). Due to their abundance and widespread activity, viruses are 97 vital to microbial communities as they drive cycling of essential nutrients such as carbon, nitrogen, 98 phosphorus and sulfur (6-10). In human systems, viruses have been implicated in contributing to 99 dysbiosis that can lead to various diseases, such as inflammatory bowel diseases, or even have a 100 symbiotic role with the immune system (11-13). 101 It is estimated that viral diversity exceeds that of living organisms, and therefore harbors 102 enormous potential for diverse genomic content, arrangement and encoded functions (14). 103 Accordingly, there is substantial interest in "mining" viral sequences for novel anti-microbial drug 104 candidates, enzymes for biotechnological applications, and for bioremediation efforts (15)(16)(17)(18)(19). 105 Moreover, viruses have a unique capability to rapidly evolve genes via high mutation rates and act 106 as intermediate carriers to transfer these genes to their hosts and subsequently to the surrounding 107 communities (20-22). 108 Our understanding of the diversity of viruses continues to expand with the discovery of 109 novel viral lineages within the last decade. One of the most striking examples is the 110 characterization of crAssphage, an extraordinarily abundant virus infecting Bacteroides species 111 within the human gut that went unnoticed for years due to its lack of homology with known viral 112 sequences (23). Moreover, the discovery of megaphages, the largest known bacterial viruses 113 infecting the human gut bacteria Prevotella, has pushed the boundaries on the coding capacity of 114 viruses (24, 25). In the oceans, a newly discovered lineage of Vibrio-infecting non-tailed viruses, 115 generally considered unconventional since most known bacterial viruses are tailed, fueled the 116 notion that current viral recovery methods are skewing our understanding of viruses in the 117 environment (26). Taken together, this highlights that estimates of viral diversity are biased 118 towards tailed dsDNA viruses and are likely underrepresenting other families of viruses including 119 those with ssDNA and RNA genomes (27,28). 120 Recently it has been appreciated that viruses may directly link biogeochemical cycling of 121 nutrients by specifically driving metabolic processes. For example, during infection viruses can 122 acquire 40-90% of their required nutrients from the surrounding environment by taking over and 123 subsequently directing host metabolism (29-31). To manipulate host metabolic frameworks some 124 viruses have selectively "stolen" metabolic genes from their host. These host derived genes, 125 collectively termed auxiliary metabolic genes (AMGs), are actively expressed during infection to 126 provide viruses with fitness advantages (32-35). Viruses encoding AMGs have been found to be 127 widespread in human and natural environments and implicated in manipulating several important 128 nutrient cycles including carbon, nitrogen, phosphorus and sulfur (36-40). Identifying these genes 129 and understanding the processes underpinning their function is pivotal for developing 130 comprehensive models of the impacts of microbiomes and nutrient cycling. 131 Due to the difficulty of collecting virus-only samples as well as the need to integrate viruses 132 into models of ecosystem function, it has become of great interest to determine which sequences 133 within microbial communities are derived from viruses. Within the cellular fraction of a sample 134 there can remain a large number of viruses for a variety of reasons. First, these viruses can exist as 135 active intracellular infections, which may be the case for as many as 30% of all bacteria at any 136 given time (41). Second, there may be particle-attached viruses resulting from viruses' inherently 137 "sticky" nature (42). Lastly, many viruses exist as "proviruses", or viral genomes either integrated 138 into that of their host or existing within the host as an episomal sequence. As such, it is crucial for 139 the accurate evaluation of microbial community characteristics, structure and functions to be able 140 to separate these viral sequences. 141 Multiple tools exist for the identification of viral sequences from mixed metagenomic 142 assemblies. For several years VirSorter (43), which succeeded tools such as VIROME (44) and 143 Metavir (45) with the consideration that viruses tend to display distinctive patterns of 8-nucleotide frequencies 161 (otherwise known as 8-mers), which was proposed despite the knowledge that viruses can share 162 remarkably similar nucleotide patterns with their host (49). These 8-mer patterns were used to 163 build a random forest machine learning model to quickly classify sequences as short as 500 bp 164 without the need for gene prediction. VirFinder generates model-derived scores as well as 165 probabilities of prediction accuracy, though it is up to the user to define the cutoffs which can 166 ultimately lead to uncertainties in rates of false identification of viral sequences. VirFinder was 167 shown to greatly improve the ability to recover viral sequences compared to VirSorter, but it also 168 demonstrates substantial host and source environment biases in predicting diverse viruses. For 169 example, VirFinder was able to recover viruses infecting Proteobacteria more readily than those 170 infecting Firmicutes due to reference database-associated biases while training the machine 171 learning model. Additional biases were also identified between different source environments, 172 seen through the under-recovery of viruses from certain environments compared to others (50   Step 1 Step 2 Step 3 Step 4 were not as well conserved in genomic scaffolds in contrast to whole genomes. 286 VIBRANT's workflow consists of four main steps ( Figure 1A) performance. VIBRANT does not consider scaffolds shorter than 1000 bp or those that encode 322 less than four predicted open reading frames in order to maintain a low false positive rate (FPR) 323 and have sufficient annotation information for identifying viruses. Therefore, in comparison of 324 performance metrics only scaffolds meeting VIBRANT's minimum requirements were analyzed. 325 Inclusion of fragments encoding less than four open reading frames in analyses, which are 326 frequently generated by metagenomic assemblies, are discussed below. We used the following 327 calculations to compare performance: recall, precision, accuracy, specificity and F1 score ( Figure  328 2). 329 First, we evaluated the true positive rate (TPR, or recall) of viral genomic fragments as 330 well as whole viral genomes. Viral genomes were acquired from the National Center for 331 Biotechnology Information (NCBI) RefSeq and GenBank databases and split into various non-332 redundant fragments between 3 and 15 kb to simulate genomic scaffolds (Supplementary Table  333 1). VIBRANT correctly identified 98.38% of the 29,926 viral fragments, which was substantially 334 greater than either VirSorter (40.00% and 50.67%) and VirFinder (76.23% and 89.02%). 335 Similar to TPR, we calculated FPR (or specificity) using two different datasets: genomic 336 fragments of bacteria and archaea (hereafter genomic), and bacterial plasmids (plasmid). Plasmids 337 were evaluated separately because they often encode for genes similar to those on viral genomes, 338 such as those for genome replication and mobilization. Genomic and plasmid sequences were 339

Figure 2. Performance comparison of VIBRANT, VirFinder and VirSorter on artificial scaffolds 3kb-15kb.
Performance was evaluated using datasets of reference viruses, bacterial plasmids, and bacterial/archaeal genomes. For VirFinder and VirSorter two different confidence cutoffs were used (VirFinder: score of at least 0.90, and score of at least 0.75. VirSorter: categories 1 and 2 predictions, and categories 1, 2 and 3 predictions). All three programs were compared using the following statistical metrics: F1 score, recall, precision, accuracy and specificity. To ensure equal comparison all scaffolds tested encoded at least four open reading frames. acquired from NCBI RefSeq and GenBank databases and split into various non-redundant 340 fragments between 3 and 15 kb (Supplementary Table 1 Many viruses from the IMG/VR dataset that were identified by VIBRANT were not 373 identified by either VirFinder or VirSorter, indicating that VIBRANT has the propensity for 374 discovery of novel viruses ( Figure 3B). For most environments, the majority of viruses identified 375 by VirFinder were already identified by either VIBRANT or VirSorter. The differences in the 376 overlap of identified viruses was not too distinctive in environments for which many reference 377 viruses are available, such as marine. For more understudied environments, such as plants or 378 wastewater, VIBRANT displayed near-complete overlap with VirFinder and VirSorter predictions 379 in conjunction with identifying over 40% more viruses. 380

Identification of viruses in mixed metagenomes 382
Metagenomes assembled using short read technology contain many scaffolds that do not 383 meet VIBRANT's minimum length requirements and therefore are not considered during analysis. 384 Despite this, VIBRANT's predictions contain more annotation information and greater total viral 385 sequence length than tools built to identify short sequences, such as scaffolds with less than four 386 open reading frames. VIBRANT, VirFinder (score cutoff of 0.90) and VirSorter (categories 1 and 387 2) were used to identify viruses from human gut, freshwater lake and thermophilic compost 388 metagenome sequences (Table 1). In addition, alternate program settings-VIBRANT "virome" 389 mode, VirFinder score cutoff of 0.75 and VirSorter "virome decontamination" mode-were used 390 to identify viruses from an estuary virome dataset. Each metagenomic assembly was limited to 391 sequences of at least 1000bp but no minimum open reading frame limit was set. Mixed community assembled metagenomes from the human gut, thermophilic compost and a freshwater lake, as well as an estuary virome, were used to compare virus prediction ability between the three programs. For each assembly the scaffolds were limited to a minimum length of 1000bp. Only a subset of each dataset contained scaffolds encoding at least four open reading frames. VIBRANT, VirFinder (score minimum of 0.90) and VirSorter (categories 1 and 2) were compared by total viral predictions, total combined length of predicted viruses, and total combined proteins of predicted viruses.
VIBRANT identified 1,309 total viruses at least 10 kb in length in comparison to VirFinder's 479. 404 VIBRANT was also able to outperform VirSorter in all metrics, averaging 2.4 times more virus 405 identifications, nearly 1.7 times more total viral sequence length, and 1.8 times more encoded viral 406 proteins. 407

Figure 4. Prediction of integrated proviruses by VIBRANT, and comparison to PHASTER and Prophage
Hunter. (A) Schematic representing the method used by VIBRANT to identify and extract provirus regions from host scaffolds using annotations. Briefly, v-scores are used to cut scaffolds at host-specific sites and fragments are trimmed to the nearest viral annotation. (B) Comparison of proviral predictions within four complete bacterial genomes between VIBRANT, PHASTER and Prophage Hunter. For PHASTER, putative proviruses are colored according to "incomplete" (red), "questionable" (blue) and "intact" (green) predictions. Prophage Hunter is colored according to "active" (green) and "ambiguous" (blue) predictions. (C) Manual validation of the Bacteroides vulgatus provirus prediction made by VIBRANT. The presence of viral hallmark protein, integrase and genome replication proteins strongly suggests this is an accurate prediction.
VIBRANT's method of predicting viral scaffolds provides a unique opportunity in 408 comparison to similar tools in that it yields scaffolds of higher quality which are more amenable 409 for analyzing protein function in viromes. It is an important distinction that the total number of 410 viruses identified may not be correlated with the total viral sequence identified or the total number 411 of encoded proteins. Even if VIBRANT identified fewer total viral sequences compared to other 412 tools in certain circumstances, more data of higher quality was generated as viral sequences of 413 longer length were identified as compared to many short fragments. This provides an important 414 distinction that the metric of total viral predictions is not necessarily an accurate representation for 415 the quality or quantity of the data generated. 416 417 Integrated provirus prediction 418 In many environments, integrated proviruses can account for a substantial portion of the 419 active viral community (66). Despite this, few tools exist that are capable of identifying both lytic 420 viruses from metagenomic scaffolds as well as proviruses that are integrated into host genomes. 421 To account for this important group of viruses, VIBRANT identifies provirus regions within 422 metagenomic scaffolds or whole genomes. VIBRANT is unique from most provirus prediction 423 tools in that it does not rely on sequence motifs, such as integration sites, and therefore is especially 424 useful for partial metagenomic scaffolds in which neither the provirus nor host region is complete. 425 In addition, this functionality of VIBRANT provides the ability to trim non-viral (i.e., host 426 genome) ends from viral scaffolds. This results in a more correct interpretation of genes that are 427 encoded by the virus and not those that are misidentified as being within the viral genome region. 428 Briefly, VIBRANT identifies proviruses by first identifying and isolating scaffolds and genomes 429 at regions spanning several annotations with low v-scores. These regions were found to be almost 430 exclusive to host genomes. After cutting the original sequence at these regions, a refinement step 431 trims the putative provirus fragment to the first instance of a virus-like annotation to remove 432 leftover host sequence ( Figure 4A). The final scaffold fragment is then analyzed by the neural 433 network similar to non-excised scaffolds. 434 To assess VIBRANT's ability to accurately extract provirus regions we compared its 435 performance to PHASTER and Prophage Hunter, two programs explicitly built for this task. We 436 compared the performance of these programs with VIBRANT on four bacterial genomes. 437 VIBRANT and PHASTER predicted an equal number of proviruses, 17, while Prophage Hunter 438 identified less, 13 ( Figure 4B). Only one putative provirus prediction (Lactococcus lactis putative 439 provirus 6) was shared between PHASTER and Prophage Hunter but not VIBRANT. However, 440 VIBRANT was able to identify two putative provirus regions (Desulfovibrio vulgaris putative 441 provirus 7 and Bacteroides vulgatus putative provirus 1) that neither PHASTER nor Prophage 442 Hunter identified. Manual inspection of the putative Bacteroides vulgatus provirus identified a 443 number of bona fide virus hallmark and virus-like proteins suggesting that it is an accurate 444 prediction ( Figure 4C). Our results suggest VIBRANT has the ability to accurately identify 445 proviruses and, in some cases, can outperform other tools in this task. 446 447 Evaluating quality of viral scaffolds and genomes 448 Determination of quality, in relation to completeness, of a viral scaffold has been 449 notoriously difficult due to the absence of universally conserved viral genes. To date the most 450 reliable metric of completeness for metagenomically assembled viruses is to identify circular 451 sequences (i.e., complete circular genomes). Therefore, the remaining alternatives rely on 452 estimation based on encoded proteins that function in central viral processes: replication of 453 genomes and assembly of new viral particles. 454 VIBRANT estimates the quality of predicted viral scaffolds, a relative proxy for 455 completeness, and indicates scaffolds that are circular. To do this, VIBRANT uses annotation 456 metrics of nucleotide replication and viral hallmark proteins. Hallmark proteins are those typically 457 specific to viruses and those that are required for productive infection, such as structural (e.g., 458 capsid, tail, baseplate), terminase or viral holin/lysin proteins. Nucleotide replication proteins are 459  Table 2). High 462 quality draft represents scaffolds that are likely to contain the majority of a virus's complete 463 genome and will contain annotations that are likely to aid in analysis of the virus, such as 464 phylogenetic relationships and true positive verification. Medium draft quality represents the 465 majority of a complete viral genome but is more likely to be a smaller portion in comparison to 466 high quality. These scaffolds may contain annotations useful for analysis but are under less strict 467 requirements compared to high quality. Finally, low draft quality constitutes scaffolds that were 468 not found to be of high or medium quality. Many metagenomic scaffolds will likely be low quality 469 genome fragments, but this quality category may still contain the higher quality genomes of some 470 highly divergent viruses. 471 We benchmarked VIBRANT's viral genome quality estimation using a total of 2466 472 Caudovirales genomes from NCBI RefSeq database. Genomes were evaluated either as complete 473 sequences or by removing 10% of the sequence at a time stepwise between 100% and 10% 474 completeness ( Figure 5B). The results of VIBRANT's quality analysis displayed a linear trend in 475 indicating more complete genomes as high quality and less complete genomes as lower quality. 476 The transition from categorizing genomes as high quality to medium quality ranged from 60% and 477 70% completeness. Although we acknowledge that VIBRANT's metrics are not perfect, we 478 demonstrate the first benchmarked approach to quantify and characterize genome quality 479 associated with completeness of viral scaffolds. Manual inspection and visual verification of viral 480 genomes that were characterized into each of these genome quality categories showed that quality 481 estimations matched annotations ( Figure 5C). 482 483 Identifying function in virome: metabolic analysis 484 Viruses are a dynamic and key facet in the metabolic networks of microbial communities 485 and can reprogram the landscape of host metabolism during infection. This can often be achieved 486 by modulating host metabolic networks through expression of AMGs encoded on viral genomes. 487 Identifying these AMGs and their associated role in the function of communities is imperative for 488 understanding complex microbiome dynamics, or in some cases can be used to predict virus-host 489 relationships. VIBRANT is optimized for the evaluation of function in viromes by identifying and 490 classifying the metabolic capabilities of the viral community. To do this, VIBRANT identifies 491 AMGs and assigns them into specific metabolic pathways and broader categories as designated by 492 KEGG annotations. 493 To highlight the utility of this feature we compared the metabolic function of viruses 494 derived from several diverse environments: freshwater, marine, soil, human-associated and city 495 (Supplementary Figure 1) and 15 human gut metagenomes ( Figure 6). As anticipated, based on IMG/VR environment 539 comparisons, the metabolic capabilities between the two environments were different even though 540 the number of unique AMGs was relatively equal (138 for hydrothermal vents and 151 for human 541 gut). The pattern displayed by metabolic categories for each metagenome was similar to that 542 displayed by marine and human viromes. For hydrothermal vents the dominant AMGs were part 543 of carbohydrate, amino acid and cofactor/vitamin metabolism, whereas human gut AMGs were 544 mostly components of amino acid and, to some extent, cofactor/vitamin metabolism. Although the 545 observed AMGs and metabolic pathways were overall different, about a third (50 total AMGs) of 546 all AMGs from each environment were shared; between these metagenomes alone all 14 globally 547 conserved AMGs were present. 548 Observations of individual AMGs provided insights into how viruses interact within 549 different environments. For example, tryptophan 7-halogenase (prnA) was identified in high 550 abundance (45 total AMGs) within hydrothermal vent metagenomes but was absent from the 551 Figure 6. Comparison of AMG metabolic categories between hydrothermal vents and human gut. The Venn diagram depicts the unique and shared nonredundant AMGs between 6 hydrothermal vent and 15 human gut metagenomes. The graphs depict the differential abundance of KEGG metabolic categories of respective AMGs for hydrothermal vents (top) and human gut (bottom). human gut. Verification using GOV2 (Global Ocean Viromes 2.0) (67) and Human Gut Virome 552 databases supported our finding that prnA appears to be constrained to aquatic environments, 553 which is further supported by the gene's presence on several marine cyanophages. PrnA catalyzes 554 the initial reaction for the formation of pyrrolnitrin, a strong antifungal antibiotic. though Microviridae and a likely complete crAssphage were also identified. A significant 582 proportion of all Crohn's-associated viruses (250/721), and the majority of genus-level clustered 583 viruses (42/76), were found to be integrated sequences within a microbial genomic scaffold but 584 were able to be identified due to VIBRANT's ability to excise proviruses. 585 We also generated a protein sharing network containing all 721 Crohn's and 950 healthy-586 associated viruses, which corresponded to taxonomic and host relatedness ( Figure 7A). This 587 protein network identified two different clustering patterns: [1] overlapping Crohn's and healthy-588 associated viral populations clustered with Firmicutes-like viruses which may be indicative of a 589 stable gut virome; [2] Crohn's-associated viruses clustered with Enterobacterales-like and 590 Fusobacterium-like viruses which may be indicative of a state of dysbiosis. The presence of a 591 greater diversity and abundance of Enterobacterales and Fusobacteria has previously been linked 592 to Crohn's Disease (70, 71), and therefore the presence of viruses infecting these bacteria may 593 provide similar information. 594 VIBRANT provides annotation information for all of the identified viruses which can be 595 used to infer functional characteristics in conjunction with host association. Comparison of 596 Crohn's-associated Lambda-like virus genomic content and arrangement suggested a possible role 597 of virally encoded host-persistence and virulence genes that are absent in the healthy-associated 598 virome ( Figure 7B). Among all Crohn's-associated viruses, 17 total genes (bor, dicB, dicC, hokC, 599 kilR, pagC, ydaS, ydaT, yfdN, yfdP, yfdQ, yfdR, yfdS, yfdT, ymfL, ymfM and tonB) that have the 600 potential to impact host survival or virulence were identified. Importantly, no healthy-associated 601 viruses encoded such genes ( Table 2). The presence of these putative dysbiosis-associated genes 602  To characterize the distribution and association of DAGs with Crohn's Disease, we 610 calculated differential abundance for two DAG-encoding viruses across all metagenome samples. 611 The first virus encoded pagC and yfdN, and the second encoded dicB, dicC and hokC. Comparison 612 of Crohn's Disease to healthy metagenomes indicates these viruses are present within the gut 613 metagenomes of multiple individuals but more abundant in association with Crohn's Disease 614 ( Figure 8A). This suggests an association of disease with not only putative DAGs, but also specific, 615 and potentially persistent, viral groups that encode them. In order to correlate increased abundance 616 with biological activity we calculated the index of replication (iRep) for each of the two viruses 617 (78). Briefly, iRep is a function of differential read coverage which is able to provide an estimate 618 of active genome replication. Seven metagenomes containing the greatest abundance for each virus 619 were selected for iRep analysis and indicated that each virus was likely active at the time of 620 collection ( Figure 8B). 621 To validate these aforementioned findings, we applied VIBRANT to two additional  Viruses that infect bacteria and archaea are key components in the structure, dynamics, and 638 interactions of microbial communities. Tools that are capable of efficient recovery of these viral 639 genomes from mixed metagenomic samples are likely to be fundamental to the growing 640 applications of metagenomic sequencing and analyses. Importantly, such tools would need to 641 reduce bias associated with specific viral groups (e.g., Caudovirales) and highly represented 642 environments (e.g., marine). Moreover, viruses that exist as integrated proviruses within host 643 genomes should not be ignored as they can represent a substantial fraction of infections in certain 644 conditions and also persistent infections within a community. 645 Here we have presented VIBRANT, a novel method for the automated recovery of both 646 free and integrated viral genomes from metagenomes that hybridizes neural network machine 647 learning and protein signatures. VIBRANT utilizes metrics of non-reference based protein 648 similarity annotation from KEGG, Pfam and VOG databases in conjunction with a novel 'v-score' 649 Table 2. Identification of putative DAGs encoded by Crohn's-associated viruses. The differential abundance between Crohn's Disease and healthy metagenomes of 17 putative DAGs. Abundance of each gene represents non-redundant annotations from Crohn's-associated and healthy-associated viruses. metric to recover viruses with little to no biases. VIBRANT was built with the consideration of 650 the human guided intuition used to manually inspect metagenomic scaffolds for viral genomes and 651 packages these ideas into an automated software. This platform originates from the notion that 652 proteins generally considered as non-viral, such as ribosomal proteins (81), may be decidedly 653 common amongst viruses and should be considered accordingly when viewing annotations. V-654

ID
scores are meant to provide a quantitative metric for the level of virus-association for each 655 annotation used by VIBRANT, especially for Pfam and KEGG HMMs. That is, v-scores provide 656 a means for both highlighting common or hallmark viral proteins as well as differentiating viral 657 from non-viral annotations. In addition, v-scores give a quantifiable value to viral hallmark genes 658 instead of categorizing them in a binary fashion. 659 VIBRANT was not only built for the recovery of viral genomes, but also to act as a platform 660 for investigating the function of a virome. VIBRANT supports the analysis of virome function by 661 assembling useful annotation data and categorizing the metabolic pathways of viral AMGs. Using 662 annotation signatures, VIBRANT furthermore is capable of estimating genome quality and 663 distinguishing between lytic and lysogenic viruses. To our knowledge, VIBRANT is the first 664 software that integrates virus identification, annotation and estimation of genome completeness 665 into a stand-alone program. 666 Benchmarking and validation of VIBRANT indicated improved performance compared to 667 VirSorter and VirFinder, two commonly used programs for identifying viruses from metagenomes. 668 This included a substantial increase in the relationship between true virus identifications (recall, 669 true positive rate) and false non-virus identifications (specificity, false positive rate). That is, 670 VIBRANT recovered more viruses with no discernable expense to false identifications. The result 671 was that VIBRANT was able to recover an average of 2.4 and 1.7 more viral sequence from real 672 metagenomes than VirFinder and VirSorter, respectively. When tested on metagenome-assembled 673 viral genomes from IMG/VR representing diverse environments VIBRANT was found to have no 674 perceivable environment bias towards identifying viruses. In comparison to provirus prediction 675 tools, specifically PHASTER and Prophage Hunter, VIBRANT was shown to be proficient in 676 identifying viral regions within bacterial genomes. This included the identification of a putative 677 Bacteroides provirus that the other two programs were unable to identify. The importance of 678 integrated provirus prediction was underscored in the analysis of Crohn's Disease metagenomes 679 since it was found that a significant proportion of disease related viruses were temperate viruses 680 existing as host-integrated genomes. 681 VIBRANT's method allows for the distinction between scaffold size and coding capacity 682 in designating the minimum length of virus identifications. Traditionally, a cutoff of 5000 bp has 683 been used to filter for scaffolds of a sufficient length for analysis. This is under the presumption 684 that a longer sequence will be likely to encode more proteins. For example, this cutoff has been 685 adopted by IMG/VR. However, we suggest a total protein cutoff of four open reading frames rather 686 than sequence length cutoff to be more suitable for comprehensive characterization of the viral 687 community. VIBRANT's method works as a strict function of total encoded proteins and is 688 completely agnostic to sequence length for analysis. Therefore, the boundary of minimum encoded 689 proteins will support a more guided cutoff for quality control of virus identifications. For example, 690 increasing the minimum sequence length to 5000 bp will have no effect on accuracy or ability to 691 recall viruses since VIBRANT will only be considerate of the minimum total proteins, which is 692 set to four. The result will be the loss of all 1000 bp to 4999 bp viruses that still encode at least 693 four proteins. To visualize this distinction, we applied VIBRANT with various length cutoffs to 694 the previously used estuary virome (see Table 1). Input sequences were stepwise limited from 695 1000 bp to 10000 bp (1000 bp steps) or four open reading frames to 13 open reading frames (one  696 open reading frame steps) in length. Limiting to open reading frames indicated a reduced drop-off 697 in total virus identifications and total viral sequence compared to a minimum sequence length limit 698 (Supplementary Figure 3). 699 The output data generated by VIBRANT-protein/gene annotation information, 700 protein/gene sequences, HMM scores and e-values, viral sequences in FASTA and GenBank 701 format, indication of AMGs, genome quality, etc.-provides a platform for easily replicated 702 pipeline analyses. Application of VIBRANT to characterize the function of Crohn's-associated 703 viruses emphasizes this utility. VIBRANT was not only able to identify a substantial number of 704 viral genomes, but also provided meaningful information regarding putative DAGs, viral 705 sequences for differential abundance calculation and genome alignment, viral proteins for 706 clustering, and AMGs for metabolic comparisons. 707 708 709

711
Our construction of the VIBRANT platform expands the current potential for virus 712 identification and characterization from metagenomic and genomic sequences. When compared to 713 two widely used software programs, VirFinder and VirSorter, we show that VIBRANT improves 714 total viral sequence and protein recovery from diverse human and natural environments. As 715 sequencing technologies improve and metagenomic datasets contain longer sequences VIBRANT 716 will continue to outcompete programs built for short scaffolds (e.g., 500-3000 bp) by identifying 717 more higher quality genomes. Our workflow, through the annotation of viral genomes, aids in the 718 capacity to discover how viruses of bacteria and archaea may shape an environment, such as 719 driving specific metabolism during infection or dysbiosis in the human gut. Furthermore, 720 VIBRANT is the first virus identification software to incorporate annotation information into the 721 curation of predictions, estimation of genome quality and infection mechanism (i.e., lytic vs 722 lysogenic). We anticipate that the incorporation of VIBRANT into microbiome analyses will 723 provide easy interpretation of viral data, enabled by VIBRANT's comprehensive functional 724 analysis platform and visualization of information. 725 726 727

729
Dataset for generation and comparison of metrics 730 To generate training and testing datasets sequences representing bacteria, archaea, 731 plasmids and viruses were downloaded from NCBI databases (accessed July 2019) 732 (Supplementary Table 3 Two additional databases consisting of redundant Pfam HMM profiles were also generated. 785 The first database consisted of virus annotations which were determined by a text search of 786 "bacteriophage" to the Pfam database. Only HMM profiles with v-scores above zero were 787 considered and those common to bacteria/archaea (e.g., glutaredoxin) were manually removed. 788 This resulted in 894 virus specific HMMs. The second database consisted of common plasmid 789 annotations. Proteins were predicted for the plasmid dataset using Prodigal (-p meta) and all Pfam 790 HMMs with a v-score of zero were used to annotate the plasmid proteins (e-value < 1e-5). Any 791 annotation with at least 50 hits was retained as a common plasmid HMM profile, which resulted 792 in 202 common plasmid HMMs. 793 794 Non-neural network steps and assembly of annotation metrics 795 VIBRANT utilizes several manually curated cutoffs in order to remove the bulk of non-796 virus input scaffolds before the neural network classifier is implemented. These steps will result 797 in the assembly of 27 annotation metrics that are used by the neural network classifier for virus 798 identification, which is followed by additional manually set cutoffs to curate the results. 799 First, open reading frames predicted by Prodigal (-p meta) or user input proteins are used 800 to calculate the fraction of strand switching per scaffold (strand switches divided by total genes). 801 Scaffolds are then classified as having either a low (5%), medium (5-35%) or high (>35%) level 802 of strand switching. Scaffolds with a high level are annotated with the 894 virus-specific Pfam 803 HMMs and only retained if there is at least one significant hit (score > 50). Throughout, scaffolds 804 that are not retained are eliminated from further analysis. Scaffolds with a medium-level, and those 805 with a high-level that passed the previous cutoff, are annotated with the 202 common plasmid 806 Pfam HMMs and only retained if there are three or less significant hits (score > 50). Scaffolds with 807 a low level are combined with those from high/medium that passed the previous cutoff(s). 808 Scaffolds are then annotated with the 10,033 KEGG-derived HMMs. Putative integrated 809 provirus regions are extracted at this step by using sliding windows of either four or nine proteins 810 at a time (step size = 1 protein). Within these windows scaffolds are fragmented according to v-811 scores and total KEGG annotations. Within the 4-protein window, scaffolds can be cut if [1] there 812 are 0-1 unannotated proteins, 3-4 proteins with a v-score of 0-0.02 and a combined v-score of less 813 than 0.06, or [2] three consecutive proteins with a v-score of 0 (considered as a 3-protein window). 814 Scaffolds will also be cut using a 9-protein window if nine consecutive proteins are annotated. 815 Finally, if the final two proteins on a scaffold each have a v-score of 0, the scaffold will be cut. 816 Only scaffold fragments that contain at least 8 proteins are retained. Following provirus excision, 817 several manual cutoffs are used to remove obvious non-viral scaffolds. Briefly, this is done by 818 removing scaffolds with a high density of KEGG annotations (e.g., over 70% if less than 15 819 proteins or over 50% if greater than 15 proteins) or a high number of annotations with a v-score 820 of 0 (e.g., over 15). V-scores are also used such that a scaffold that may be removed for having a 821 high density of KEGG annotations will be retained if the v-score meets a specific threshold (e.g., 822 average of 0.2 in scaffold sizes all metrics were normalized (i.e., divided by) to the total number of proteins 870 encoded by the scaffold. The first metric, for total proteins, was normalized to log base 10 of itself. 871 Each metric was weighted equally, though it is worth noting that the removal of several metrics, 872 mainly metrics 8-18, did not significantly impact the accuracy of model's prediction. The 873 normalized results were randomized and non-redundant portions of these results were taken for 874 training or testing the neural network. It is important to note that the testing set here was not used 875 as the comprehensive testing set for the entire workflow. In total, 93,913 fragments were used for 876 training and 9,000 were used for testing the neural network (Supplementary Tables 11 and 12). 877 To comprehensively test the performance of VIBRANT in its entirety a new testing dataset 878 was generated consisting of fragments from the neural network testing set as well as additional 879 fragments non-redundant to the previous training dataset. This  VirSorter was ran using the "Virome" database. For VirFinder, the intervals were [1] scores greater 886 than or equal to 0.90 (approximately equivalent to a p-value of 0.013), and [2] scores greater than 887 or equal to 0.75 (approximately equivalent to a p-value of 0.037). All equations used can be found 888 in Supplementary Table 13 and results used for the generation of Figure 1 can be found in 889 Supplementary Table 14.  890  891 AMG identification 892 KEGG annotations were used to classify potential AMGs (Supplementary Table 15). 893 KEGG annotations falling under the "metabolic pathways" category as well as "sulfur relay 894 system" were considered. Manual inspection was used to remove non-AMG annotations, such as 895 nrdAB and thyAX. Other annotations not considered dealt with direct nucleotide to nucleotide 896 conversions. All AMGs were associated with a KEGG metabolic pathway map. 897 898 Completeness estimation 899 Scaffold completeness is determined based on four metrics: circularization of scaffold 900 sequence, VOG annotations, total VOG nucleotide replication proteins and total VOG viral 901 hallmark proteins (Supplementary Table 16). In order to be considered a complete genome a 902 sequence must be identified as likely circular. A kmer-based approach is used to do this. 903 Specifically, the first 20 nucleotides are compared to 20-mer sliding windows within the last 900bp 904 of the sequence. If a complete match is identified the sequence is considered a circular template. 905 Scaffolds can also be considered a low, medium or high quality draft. To benchmark completeness, 906 NCBI RefSeq viruses identified as Caudovirales, limited to 10 kb in length, were used to estimate 907 completeness by stepwise removing 10% viral sequence at a time (Supplementary Table 2