New viral biogeochemical roles revealed through metagenomic analysis of Lake Baikal

Lake Baikal is the largest body of liquid freshwater on Earth. Previous studies have described the microbial composition of this habitat, but the viral communities from this ecosystem have not been characterized in detail. Here, we describe the viral diversity of this habitat across depth and seasonal gradients. We discovered 19,475 bona fide viral sequences, which are derived from viruses predicted to infect abundant and ecologically important taxa that reside in Lake Baikal, such as Nitrospirota, Methylophilaceae, and Crenarchaeota. Diversity analysis revealed significant changes in viral community composition between epipelagic and bathypelagic zones. Analysis of the gene content of individual viral populations allowed us to describe one of the first bacteriophages that infect Nitrospirota, and their extensive repertoire of auxiliary metabolic genes that might enhance carbon fixation through the reductive TCA cycle. We also described bacteriophages of methylotrophic bacteria with the potential to enhance methanol oxidation and the S-adenosyl-L-methionine cycle. These findings unraveled new ways by which viruses influence the carbon cycle in freshwater ecosystems, namely, by using auxiliary metabolic genes that act upon metabolisms of dark carbon fixation and methylotrophy. Therefore, our results shed light on the processes through which viruses can impact biogeochemical cycles of major ecological relevance. B3i_F4qvBy9xmcnWBwjShq Video Abstract Video Abstract


Background
Lake Baikal is the largest and deepest lake on Earth [1,2]. Its uniqueness also lies in its extreme oligotrophy, icecovered periods of up to 4-4.5 months per year, and an oxic water column throughout all depths [3]. The lake is permanently mixed and only undergoes stratification for a brief period of time during summer in its first 100 m [4].
The surface of Lake Baikal freezes during winter, so that below the ice layer water temperatures approach 0°C while towards deeper waters temperature raises slightly to a maximum of 4°C. In spring, the ice layer melts, and surface water temperature raises to nearly 12°C, only to decrease rapidly towards deeper waters (below 50 m) to the same 4°C that are kept all year around for the deep water mass. Recent metagenomic studies have analyzed the microbiome of sub-ice epipelagic and bathypelagic waters, revealing the key microbes that dwell at this ecosystem as well as the ecological processes in which they are involved [5,6]. These studies have shed light on the taxonomic composition of the Lake Baikal microbiome and the contributions of these microbes to biogeochemical cycles. Nevertheless, one important component of this ecosystem has not yet been characterized in detail: the viruses.
Viruses play key roles in the functioning of aquatic ecosystems [7,8]. They mediate recycling of organic matter in these habitats by lysing host cells, which leads to the daily release of billions of tons of organic carbon [9]. Yet, the influence of viruses over aquatic microbiomes is not limited to killing. They can also modify host metabolism during infection through the expression of auxiliary metabolic genes (AMGs), which redirect host metabolism towards pathways that promote the production of viral particles, such as the nucleotide metabolism, which yields the building blocks necessary to synthesize the genomes of the viral progeny [10,11]. There are multiple mechanisms by which viruses make use of AMGs to re-direct host metabolism. Among the noteworthy examples of AMGs are included genes of photosynthesis and carbon fixation in viruses of Cyanobacteria [12], genes for sulfur oxidation in viruses of Proteobacteria [13], and genes for carbon metabolism, phosphorus metabolism, and protein synthesis spread across multiple host taxa [14][15][16]. The prevalence and diversity of AMGs vary across ecosystems [17,18]; hence, the repertoire of molecular functions encoded by AMGs is expected to change in response to the metabolic constraints faced by their hosts across different ecosystems as well.
Extensive research has been conducted on the biodiversity, ecology, and AMGs of viruses from marine ecosystems [15,19]. Meanwhile, viruses from freshwater ecosystems have received much less attention. Consequently, little is known about the environmental drivers of community composition, biodiversity, and auxiliary metabolic genes in these ecosystems. The use of metagenomics has made it possible to describe viruses that infect the most representative freshwater microbes, such as acI Actinobacteria [20,21] and SAR11 [22,23]. Studies focused on Czech reservoirs and Lake Biwa (Japan) have reported on the extensive effects of stratification on freshwater viral communities, host prevalence, and on their key roles as recyclers of organic matter [21,24]. Specifically, these studies found that in these stratified lakes, there is constantly shifting viral community in the epilimnion and a more stable community that dwells in the hypolimnion. Meanwhile, in Lake Baikal, recent studies have described a phage putatively infecting Polynucleobacter sp. [5], the virome associated with diseased sponges [25], and viromes from the epipelagic zone which suggested that the Baikal virome undergoes changes in composition across seasons [26].
Here, we sought to perform an in-depth characterization of the viral communities from Lake Baikal. Our sampling strategy included retrieving samples at the epipelagic (photic), mesopelagic (aphotic), and bathypelagic (aphotic) zones during winter and summer. In total, ten cellular metagenomes were obtained from which viral sequences were identified. We used computational methods to assign putative hosts to viral sequences and to classify them taxonomically. With that information, we investigated shifts in community composition regarding taxonomic affiliation and target hosts that were driven by depth and season. Next, we described in detail the gene content of novel viruses infecting some of the most abundant members of the Lake Baikal microbiome and their repertoire of auxiliary metabolic genes that include key metabolic processes that have not been described before in other viruses.

Results and discussion
Depth variations of archaeal and bacterial communities in Lake Baikal We have analyzed a total of ten metagenomes sequenced from different habitats of Lake Baikal. The four datasets that represented winter microbiomes from sub-ice samples have previously been described in detail [5,6]. We expanded this set of samples by generating a new set of six metagenomes obtained from similar geographical coordinates but collected during summer. These included two epipelagic samples (photic, 5 m and 20 m), two mesopelagic samples (aphotic, 300 m and 390 m) at a site near methane seep that is notorious for high methane concentrations [27], and two bathypelagic (aphotic, 1250 m and 1350 m) samples.
We sought to describe the depth variations of prokaryotic community composition of Lake Baikal based on taxonomic profiles derived from metagenomes from winter and summer. Thus, read mapping was used to calculate the relative abundances of different phyla of Bacteria and Archaea across samples taken in winter (sub-ice) and summer water columns (Fig. 1). When reporting our results, we used taxon names defined by GTDB [28]. For those clades that have undergone major changes in their phylum and class level classifications, the necessary equivalence information was provided the first time the taxon is mentioned in the text. As previously described, a clear distinction was observed between photic and aphotic samples [6]. At all times, epipelagic samples tended to have higher abundances of Verrucomicrobiota, Actinobacteriota, Bacteroidota, and Cyanobacteria than their mesopelagic and bathypelagic counterparts. Acidobacteriota and Patescibacteria (formerly candidate phyla radiation) only displayed expressive abundances in the samples from the aphotic zone. Also, the abundances of Nitrospirota, Alphaproteobacteria, and Crenarchaeota (which encompasses the taxa formerly classified as Crenarchaeota, Thaumarchaeota, and Bathyarchaeota) increased towards the aphotic zone samples. Changes in energy availability brought by differences in light and temperature are major drivers of microbial community composition in aquatic habitats [29]. The stable water temperatures in aphotic samples across seasons are likely responsible for the comparable community composition among these samples. Likewise, the more prominent change in temperature seen among photic zone samples, together with the stratification, was likely the driving factors behind the changes in prokaryotic community composition observed between seasons.

Taxonomic classification and predicted hosts of Baikal viruses
The assembled scaffolds from ten Baikal metagenomes were analyzed with VirSorter [30] and VirFinder [31] and queried against the pVOGs database [32] to identify putative viral sequences. These putative viruses were subjected to manual curation after which a total of 19, 475 sequences were classified as bona fide viruses (Table  S1). Since these viral sequences were retrieved from metagenomes of the cellular fraction (as opposed to viromes), they are likely derived from viruses that were actively replicating at the time of sampling. The bona fide viral sequences were clustered into 9916 viral populations on the basis of 95% average nucleotide identity and 80% shared genes within each population [19]. Family level taxonomic assignments were achieved for 12,689 viral sequences. Most of them were classified into the families Myoviridae (7155), Siphoviridae (3138), Phycodnaviridae (1195), and Podoviridae (809). The presence of viruses of eukaryotes in our dataset derives from the fact that samples were not pre-filtered to remove eukaryotic cells. Computational host prediction followed by manual curation allowed putative hosts at the taxonomic level of domain to be assigned to 2870 viral sequences. These predictions suggested that the majority of these sequences belonged to viruses that infect Bacteria (2135), but viruses that infect Archaea (29), Eukaryotes (621), and even virophages (85) were also identified. Among those assigned as viruses of bacteria (i.e., bacteriophages), the majority of sequences were predicted to infect Actinobacteriota (640), followed by Gammaproteobacteria (235), Bacteroidota (241) and Cyanobacteria (226), and Alphaproteobacteria (106). Although less frequent, some sequences were predicted to be derived from viruses that infect taxa with few or no isolated viruses such as Nitrospirota (9), Patescibacteria (14), and Crenarchaeota (23).

Environmental drivers of viral community composition at Lake Baikal
We performed read recruitment from the 10 metagenomes to calculate relative abundances of viral sequences across samples. The resulting abundance matrix was used to investigate patterns of viral community composition in the Baikal ecosystem (Fig. 2). These results pointed to a clear distinction between photic and aphotic samples regarding their viral community composition Fig. 1 Lake Baikal prokaryotic community composition. Bar plots depict the relative abundances of taxa of Archaea and Bacteria at the level of phylum (or class in the case of Proteobacteria) across the ten metagenomes from Lake Baikal. Only taxa that displayed relative abundances equal or above 1% are shown (Fig. 2a). Among photic samples, a separation was observed between summer and winter samples which was mostly driven by viruses with high abundance among winter samples that displayed lower (sometimes below detection limit) abundances among the summer samples. No clear clustering of samples by season was observed among bathypelagic samples. All samples displayed comparable Shannon (8.0-9.1) and Simpson (0.9992-0.9996) diversity indexes, suggesting that despite the changes that take place in community composition across depth and seasons, the level of diversity within the communities remains stable.
Non-metric multidimensional scaling also pointed to a clear distinction between samples from the photic and aphotic zones which were separated by NMDS1 (Fig. 2b). However, no clear separation of samples by season was observed by NMDS1 or NMDS2. Next we analyzed each individual scaffold by comparing abundances in the photic versus aphotic samples (from the same season), and also by comparing abundances in the summer versus winter samples (from the same depth). This result revealed the specific enrichment/depletion patterns of each viral sequence across the seasonal and bathymetric gradients (Fig. 2c). Specifically, we observed distinctive clouds of Fig. 2 Lake Baikal viral community composition. a Heatmap depicting the Z-score transformed abundances of 19,475 bona fide viral sequences across ten metagenomes from Lake Baikal. Both samples (columns) and viral sequences (rows) were subjected to hierarchical clustering based on Bray-Curtis dissimilarity distances. Side row colors indicate the sample from which each viral sequence was assembled. b Non-metric multidimensional scaling comparison of the abundance of viral sequences across 10 Baikal metagenomes based on Bray-Curtis dissimilarity distances. c Scatter plots depicting the abundances of each viral scaffold paired by depth and season. In the left panel, the relative abundances of sequences in the photic samples are displayed in the X axis while the abundances in the aphotic samples are displayed in the Y axis. Samples were paired as follows: 5 m winter × 1250 m winter; 5 m summer × 1250 m summer; 20 m winter × 1350 m winter; 20 m summer × 1350 m summer. In the right panel, the relative abundances of sequences in the winter samples are displayed in the X axis while the abundances in the summer samples are displayed in the Y axis. Samples were paired as follows: 5 m winter × 5 m summer; 20 m winter × 20 m summer; 1250 m winter × 1250 m summer, 1350 m winter × 1350 m summer viral sequences separating the photic from aphotic samples in the depth comparison and the absence of a cloud separating winter from summer samples in the season comparison. This suggests that viruses specific of a given depth zone are much more frequent than viruses of a specific season.
Given these observations, we next investigated how community composition changed according to the source, taxonomic affiliation, and predicted hosts of the viruses. Summing up the abundances of viral sequences according to the sample from which they were assembled revealed, on the one hand, that many of the viral sequences that were assembled from photic sample metagenomes were also abundant in the aphotic samples (Fig. 3a). On the other hand, some viral sequences obtained from the aphotic samples were also abundant among the photic samples, albeit at lower relative abundances. Overall, this suggests an intense mixing between communities among zones, but with a greater influence of the photic zone over the aphotic zone, as could be expected from the convection currents in the lake [4,33]. Next, we summed up the abundances of viral sequences according to their family level taxonomic affiliation, obtained by closest relative assignment. This revealed a very stable trend of community composition with only very subtle changes in the relative abundances of the dominant families (Fig. 3b). Overall, all samples were dominated by viruses assigned to the family Myoviridae, followed by Siphoviridae and Phycodnaviridae, with smaller contributions of Podoviridae and Mimiviridae. Finally, we summed up abundances of viral sequences according to the phylum of their assigned hosts. This pointed to more notable variations in community composition according to depth. Overall, the dominant groups in all samples were viruses predicted to infect Actinobacteriota and Gammaproteobacteria (Fig. 3c). The abundances of viruses predicted to infect Cyanobacteria decreased with depth, while the abundances of viruses predicted to infect Crenarchaeota, Chloroflexota, Planctomycetota, Nitrospirota, and Patescibacteria increased. Overall, these results point to prominent changes in the composition of viral communities across the depth gradient, and subtle yet detectable differences across the seasonal changes. This is in agreement with recent findings that postulated that light and temperature are major drivers of viral community composition in marine ecosystems [19,34]. Canonical Correspondence Analysis (CCA) revealed a tight coupling between the abundances of viral groups and their predicted hosts ( Figure S1). Namely, we observed strong positive associations between virus/host groups for Cyanobacteria, Actinobacteria, Alphaproteobacteria, and Chloroflexota while the remaining groups displayed weaker (yet mostly positive) associations with their prokaryotic counterparts. Associations between the abundances of viruses and their hosts are expected, although often difficult to detect in metagenomic datasets because of factors such as the compositional nature of the data and temporal constraints that lead to a decoupling of virus/host abundances.
In previous studies, we detected a predominance of freshwater microbes involved in the nitrification (i.e., Nitrospirota and Crenarchaeota) and oxidation of methyl compounds (i.e., Methylophilaceae) in the aphotic Lake Baikal [5,6]. On the one hand, the ecological roles and diversity of AMGs of viruses that infect dominant groups of marine ecosystems (i.e., Cyanobacteria and Proteobacteria) have been characterized in detail [10,14,35]. Likewise, the diversity of phages that infect Actinobacteriota (the dominant group among Baikal samples) in freshwater ecosystems has also been described in detail [20,21]. On the other hand, the roles of viruses infecting nitrite oxidizers and methylotrophic bacteria in deep freshwater ecosystems are mostly unknown. Therefore, in this study, we have focused on viruses that prey on microbes carrying out these processes, particularly viruses predicted to infect taxa for which few or no viruses have been described. In what follows, we describe them and their potential involvement in biogeochemical processes through AMGs.

Nitrospirota viruses from Lake Baikal interfere with dark carbon fixation
First, we manually curated the annotation of sequences of viruses predicted to infect bacteria of the phylum Nitrospirota. Members of Nitrospirota are chemolithoautotrophic bacteria that perform nitrite oxidation mediated by nitrite oxidoreductases as a mean for energy acquisition, and some species are capable of complete nitrification (commamox) from ammonia to nitrate [36,37]. These organisms use the reductive tricarboxylic acid (rTCA) cycle for dark carbon fixation [38,39]. The viruses assigned to Nitrospirota were clustered into four distinct viral populations: VP_ 99, VP_1723, VP_4657, and VP_7454. Among those, there is considerable evidence suggesting that VP_99 (Fig. 4a) and VP_1723 (Fig. 4b) are actual fragments of different regions of the same (or closely related) viral genome (Table  S1). First, taxonomic classification assigned viruses from both populations to the genus T4Virus within the family Myoviridae. Second, the sequence representatives of both viral populations were assembled in the summer 1350 m sample. Third, the representatives of these populations have almost identical GC content of 47.48% for VP_99 and 47.23% for VP_1723. Fourth, sequences from both populations match different regions of the Enterobacteria phage T4 genome (NC_000866.4). Finally, members of these two populations have a somewhat complementary gene content with the hallmark viral genes missing in one being present in the other.
The gene content of these populations provided insights into the infection strategies taken by these viruses (Fig. 4). Most notably, the members of VP_99 encoded a 4Fe-S Ferredoxin gene (Fig. 4a). Ferredoxins are involved in a diverse set of redox reactions. These proteins are also involved in the energy metabolism of Nitrospirota [38] and on the rTCA cycle [39]. The high degree of identity (86%) between viral and host ferredoxin suggests that this may be an AMG. Meanwhile, the members of VP_1723 displayed a different gene content (Fig. 4b).
Most notably, members of this population displayed a ferredoxin oxidoreductase, an epsilon subunit of 2oxoglutarate:ferredoxin oxidoreductase, an iron-sulfur cluster biosynthesis protein, and an acyl-coA desaturase. All of these proteins had best hits to Nitrospira genes, suggesting that those are phage AMGs acquired from the host. The ferredoxin oxidoreductase and 2-oxoglutarate: ferredoxin oxidoreductase are clustered together in the genomes of Nitrospira defluvii, with the same orientation and little intergenic space, suggesting that they might have been acquired by the virus together in a single event and, more importantly, that they are all involved in the same cellular process.
The iron-sulfur cluster assembly protein is likely involved in the biosynthesis of the viral encoded ferredoxin (Fig. 4c). Meanwhile, the 2-oxoglutarate:ferredoxin oxidoreductase is a key enzyme of the rTCA cycle in the genus Nistrospira [38,40]. The viral ferredoxin oxidoreductase displayed significant homology with several ferredoxin oxidoreductases from the phylum Nitrospirota, including the pyruvate:ferredoxin oxidoreductase beta subunit of Nitrospira defluvii. This enzyme also mediates a key step of the reverse TCA cycle in Nitrospira. The presence of such genes in a viral genome is surprising since, to our knowledge, no AMGs acting on dark carbon fixation pathways have been described so far. Collectively, the occurrence of these genes in the viral genomes suggests that viruses of Nitrospirota modulate dark carbon fixation processes during infection. This is reminiscent to the way cyanophages modulate photosynthesis and carbon fixation pathways in Cyanobacteria [12].
The acyl-CoA desaturase (also known as fatty acid desaturase or Stearoyl-CoA desaturase) is an enzyme that creates double bonds in fatty acids by removing hydrogen atoms, resulting in the creation of an unsaturated fatty acid. Unsaturated fatty acids are part of cell membranes, and a higher content of unsaturated fats is associated with higher membrane fluidity. The presence of an acyl-CoA desaturase indicates that these viruses modulate the lipid metabolism of their host during infection. This gene belongs to a category of AMGs that is still poorly characterized in phages [14]. Although eukaryotic viruses are known to influence the host lipid metabolism at multiple levels [41,42], a comprehensive understanding of this process in viruses of bacteria has not been achieved [43]. Some ferredoxins are also involved in lipid metabolism [44]; thus, it is possible that the viral ferredoxins and acyl-CoA desaturase work together to modulate host lipid metabolism during infection.
Based on these findings, we postulate that Nitrospirota viruses of Lake Baikal make use of a diverse array of AMGs to modulate host metabolism during infection (Fig. 4C). These findings have important implications to the understanding of dark carbon fixation in freshwater ecosystems, a process of recognized importance [45,46] in which the role of viruses is still poorly characterized. Our data demonstrates that viral enhanced dark carbon fixation might be a process of ecological relevance.

Baikal viruses infecting methylotrophs interfere with methylotrophic metabolism and other major pathways
We identified viral populations predicted to infect methylotrophic bacteria. Among these, were included populations VP_139 (Fig. 5a) and VP_266 (Fig. 5b). These populations displayed ambiguous host predictions, with homology matches to multiple bacterial phyla (Bacteroidota and Proteobacteria). Hence, our original pipeline only assigned hosts to most members of these populations to the level of domain. Manual inspection of their computational host predictions revealed that homology matches to members of the family Methylophilaceae had higher bitscores and identities and lower number of mismatches, indicating that members of VP_139 and VP_266 infect methylotrophic bacteria of the family Methylophilaceae, possibly from the closely related genera Methylopumilus, Methylophilus, or Methylotenera. As before, we found evidence that these sequences are derived from the same genome (Table S1), as suggested by their complementary gene content, taxonomic affiliation (T4Virus), assembly source (winter surface samples), and GC content (35%).
Representative members of both VP_139 and VP_266 populations encoded methanol dehydrogenase (Fig. 5 a  and b), a hallmark gene of methylotrophic metabolism in bacteria [47,48]. This gene is responsible for the conversion of methanol into formaldehyde, the first and fundamental step of methylotrophic metabolism. To our knowledge, this is the first time this gene is being reported in viruses. We propose that viral methanol dehydrogenase is a novel AMG used by phages upon infection to boost up energy production of their methylotrophic hosts.
The representative sequence of VP_266 also encoded the pyrroloquinoline quinone precursor peptide PqqA. Pyrroloquinoline quinone (PQQ) is a redox cofactor which is necessary for the activity of the methanol dehydrogenase [49]. The biosynthesis of PQQ is mediated by radical SAM proteins [50,51], which were detected in the genomes of VP_139. The representative sequence of VP_139 also encoded a methionine adenosyltransferase, which performs biosynthesis of S-adenosylmethionine (SAM) from L-methionine in the S-adenosyl-L-methionine cycle. In addition, it encoded an S-adenosyl-L-homocysteine hydrolase (5′-methylthioadenosine/S-adenosylhomocysteine nucleosidase) which performs the conversion of S-adenosyl-L-homocysteine into S-ribosyl-L-homocysteine also within this cycle. Methyltransferases, three of which were found in the representative genome of VP_139, also play a fundamental role on the S-adenosyl-L-methionine cycle, mediating the demethylation of S-adenosyl-L-methionine to convert it into S-adenosyl-L-homocysteine [52]. The presence of so many auxiliary metabolic genes of the S-adenosyl-L-methionine cycle suggests that modulating this pathway is of fundamental relevance for the replication process of these viruses.
Members of VP_139 also encoded a phosphoribosylaminoimidazole synthetase (phosphoribosylformylglycinamide cyclo-ligase, purM gene), a widespread viral gene which is involved in nucleotide metabolism and alpha and beta subunits for ribonucleotide-diphosphate reductase, which is also involved in this pathway. Together, these observations suggest that members of VP_139 and VP_266 have a diverse array of proteins to modulate the metabolism of their methylotrophic hosts during infection (Fig. 5c). This is achieved by expressing genes for methanol dehydrogenase and the cofactor PQQ to enhance the ratios of methanol oxidation to formaldehyde. It also expresses genes to boost up the biosynthesis of PQQ and the S-adenosyl-L-methionine cycle. Together, these changes to host metabolism are likely to enhance the production of formaldehyde from methanol oxidation. The generated formaldehyde is then converted into formate through the tetrahydrofolate pathway or directed to the ribulose monophosphate cycle. The representative sequence of VP_266 encoded a peptide deformylase. This represents yet another candidate AMG, which would act to enhance the formate pool by removing formyl groups from host peptides. Interestingly, formate is used by phosphoribosylglycinamide formyltransferase 2 in the 5-aminoimidazole ribonucleotide biosynthesis pathway. Downstream of this step of the 5-aminoimidazole ribonucleotide biosynthesis pathway, phosphoribosylaminoimidazole synthetase, and ribonucleotide-diphosphate reductase that also participate in the biosynthesis of purines were also found in the viral genomes. Thus, we conclude that these viruses enhance the methylotrophic metabolism of their hosts for the purpose of redirecting it towards the synthesis of nucleotides to be used in the replication of the viral genome (Fig. 5c).
The discovery of these AMGs represents yet another novel way by which Baikal viruses modulate host metabolism. In this case it is of special relevance that these viruses affect three different host pathways: methanol oxidation, nucleotide metabolism, and the S-adenosyl-Lmethionine cycle. In addition to this extensive gene repertoire, we also identified other genes among these viral populations with the potential to be AMGs, albeit not directly linked to methanol oxidation or nucleotide metabolism. They included glycerol-3-phosphate cytidylyltransferase which is involved in cell wall teichoic acid biosynthesis [53] and a class II aldolase/adducin family protein. Although our data does not allow us to determine the roles of these two proteins during infection, their presence among viral genomes is a novelty, and it points to the diversity of strategies of these viruses to modulate host metabolism. A previous study has reported the isolation of a siphovirus (Phage P19250A) infecting Methylopumilus planktonicus (LD28) from Lake Soyang in South Korea. Nevertheless, this virus did not encode any of the putative AMGs reported here [54].
Another relevant microbe in the bathypelagic water column of Lake Baikal is Methyloglobulus, a genus of small (ca. 2.2 Mb of estimated genome size), yet abundant methanotrophs. These organisms were estimated to be among the most abundant microbes in bathypelagic waters of Lake Baikal (accounting up to 1% of total mapped reads) and a MAG derived from this genus was described [6]. We identified a viral population predicted to infect Methyloglobulus. In particular VP_1254 was composed of scaffolds of ca. 17 Kb that were assembled from bathypelagic metagenomes from both summer and winter (Fig. 6a). These scaffolds displayed multiple homology matches to various taxa of Gammaproteobacteria. Among these, they consistently had high identity hits to a DnaK chaperone gene from the Baikal Methyloglobulus MAG. Finally, read recruitment confirmed the prevalence of these viruses among bathypelagic samples and absence from epipelagic and mesopelagic zones, following a pattern similar to that observed for Methyloglobulus (Fig. 6b). To our knowledge, no genomes of viruses infecting freshwater Methyloglobulus have been described. The gene content of these viruses included proteins involved in production of curly polymers and the ribosomal protein S21, which were previously detected in SAR11 phages [23] and a putative Polynucleobacter phage [5]. Interestingly, these viruses did not encode the diverse array of AMGs described for the methylotroph viruses from VP_139 and VP_266, possibly because these sequences do not represent the complete viral genome. Another possible explanation is the fact that viruses from VP_139 and VP_266 are typical of the epipelagic zone, while those from VP_1254 are typical of the bathypelagic zone. Therefore, the metabolic constraints faced by these two groups of viruses during infection might be drastically different. These differences could explain the distinct array of AMGs between these two groups despite the fact that they infect closely related hosts with similar one-carbon metabolisms.

Novel freshwater viruses of Crenarchaeota
One of the singularities of the bathypelagic and mesopelagic Lake Baikal waters was the high abundances of Crenarchaeota (formerly Thaumarchaeota, e.g., Nitrosopumilus and Nitrosoarchaeum). We identified viral scaffolds predicted to infect Crenarchaeota in both summer and winter from bathypelagic and mesopelagic samples. Specifically, we retrieved scaffolds from multiple populations that presented a remarkable synteny to previously described marine Crenarchaeota viruses (Marthavirus) [55]. The gene content of these scaffolds was conserved regardless of their sample of origin, as well as their gene order (Fig. 7a). In addition, the typical Marthavirus genes radA, ATPases, and CobS were conserved in their Lake Baikal counterparts. Thus, these viruses from Lake Baikal are the first representatives of freshwater viruses of Crenarchaeota, which are closely related to marine Marthavirus. However, a notorious difference between the marine and freshwater viruses of Crenarchaeota was the distribution of isoelectric points among their protein encoding genes (Fig. 7b). Specifically, the isoelectric points of the Mediterranean Marthavirus representative was displaced towards more acidic values. This same tendency has been previously observed when comparing proteomes of Nitrosopumilaceae from marine and freshwater environments [56]. This finding demonstrates that the shift in the distribution of isoeletric points among proteins that is observed during marine-freshwater transitions also extends to viruses, which sheds light on the processes by which these biological entities expand their ecological niches over time.

Conclusion
We have taken advantage from the availability of metagenomes of Lake Baikal to shed light into the diversity of viruses within this unique habitat. Because we used metagenomes collected using a 0.2-μm filter, we expected to obtain only the viral genomes that were being replicated within the retained cells. This method has been widely used and provides information about the viruses that are active in the community [20,57]. Nevertheless, there might be other viruses that were missed by our approach but that might be recovered by sequencing the viral particles (virome). Even so, we have obtained a large number of novel genomes of the most active viruses present in these samples. Interestingly, among them, we have identified viruses predicted to prey on microbes that are major components of the community and which provide critical ecological functions, specifically in the large aphotic water mass of this deep lake.
We have found in Baikal another example of close relatives between marine and freshwater environments, the Crenarchaeota viruses. The degree of synteny observed between Marthaviruses and the Baikal scaffolds was remarkable. The occurrence of a large aerobic and deep water mass (a common feature between the ocean and Lake Baikal) is likely what facilitated the transition of these viruses between the two environments. Such parallelisms allow detection of specific adaptations required to live in low salt environments (the concentration of sodium for example is barely detectable in Baikal). One difference that has been detected in all cases of marine-freshwater transitions is the decrease in the isoelectric point of the proteome [56]. The fact that this could also be detected in viruses indicates how critical this adaptation results for a proper functioning of basic molecular machinery such as that of DNA replication, transduction, and translation.
In conclusion, our analysis of the viral communities from Lake Baikal has demonstrated how their composition and functioning changes across seasons and depths. These findings shed new light on the influence of environmental parameters over viruses in freshwater ecosystems. In addition, we described novel viruses with unique gene repertories, thus expanding the understanding of viral genetic diversity. These novel viruses also displayed new strategies for modulating host metabolism through auxiliary metabolic genes, by which they influence processes of ecological relevance, namely the methylotrophic metabolism and dark carbon fixation. Together, these findings expand the understanding of viruses, the most abundant yet elusive biological entities on Earth and reveal novel roles played by them in processes of major biogeochemical relevance that take place in freshwater ecosystems.

Sampling and environmental parameters
The sampling strategy and sample post-processing for winter samples have been previously described [5,6]. Summer samples were collected with the SBE 32 Carousel Water Sampler from aboard the RV "Vereshchagin" in July 2018. Between 20 and 100 L of water samples were retrieved from four horizons on each station. Fig. 7 Novel Crenarchaeota (formerly classified as ammonia oxidizing Thaumarchaeota) viruses from Lake Baikal. a Synteny maps depicting the similarities between a representative sequence of VP_2384 and a marine Marthavirus sequence. b Distribution of isoeletric points among proteins from marine Marthaviruses and a close relative from Lake Baikal Water temperature and salinity were simultaneously measured with sensors SBE 19 Plus and SBE 25 Sealogger CTD (Sea-Bird Electronics) accurate within 0.002°C and with a resolution of 0.0003°C. pH values were measured using a pH 3310 meter (WTW, Germany). Overall, the hydrological conditions and the mineralization in the water column of the studied area corresponded to the data that were previously recorded during the same period in Lake Baikal [58,59]. At station 2, samples obtained on two runs were used to isolate DNA. The total volume of filtered water from the 300 m sample was 70 L, and from the 390 m sample the volume was 60 L.
For metagenomes, each sample was filtered through a net (size 27 μm) and then filtered through nitrocellulose filter with a pore size of 0.22 μm (Millipore, France), and the material from the filter was transferred to sterile flasks with 20 mL of lysis buffer (40 mmol L − 1 EDTA, 50 mmol L − 1 Tris/HCl, 0.75 mol L − 1 sucrose) and stored at − 20°С. DNA was extracted according to the modified method of phenol-chloroform-isoamyl alcohol method and stored at −70°C until further use. Metagenome sequencing, read-cleaning, and assembly steps were performed as previously described [5,6].

Sequence processing and analysis
Coding DNA sequences were identified in assembled scaffolds using Prodigal [60]. Isoelectric points were calculated for each protein as previously described [56]. Protein sequences were queried against the NCBI-nr database using DIAMOND v0.8.22 [61] and Pfam using HMMER v3.1b2 [62] for taxonomic and functional annotation. Identification of putative viral sequences was performed in three steps: Sequences were analyzed through VirSorter v1.0.6 [30], and those assigned to categories 1 and 2 were considered as putative viruses. Also, sequences were analyzed with VirFinder v1.1 [31] and those with a score ≥ 0.7 and p value ≤0.05 were also considered putative viruses. Finally, protein sequences extracted from the scaffolds were queried against the pVOGs database [32] using HMMER set to a maximum e value of 0.00001 [62]. For each scaffold, we calculated the added viral quotient (AVQ) as the sum of the viral quotients of each pVOG that hits with the proteins of each scaffold [63]. Scaffolds for which at least 20% of proteins mapped to pVOGs resulting in an AVQ ≥ 2 were considered putative viruses. Finally, all of the putative viral sequences were subjected to manual inspection of their gene content and sequences that did not display a clearly viral signature (i.e., presence of hallmark viral genes and enrichment of hypothetical proteins) were excluded from further analysis resulting in a dataset of bona fide viral sequences. In addition, the bona fide viral sequences were clustered into viral populations based on 80% of shared genes at 95% average nucleotide identify as previously described [19].

Taxonomic classification of viral sequences
Taxonomic affiliation of viral sequences was performed by closest relative affiliation. First, protein sequences derived from the bona fide viral sequences were queried against the viral sequences from the NCBI-nr database. DIAMOND was used with the following parameters: identity ≥ 30%, bit-score ≥ 50, alignment length ≥ 30 amino acids, and e value ≤ 0.00001 and the BLOSUM45 matrix. Next, the closest relative of each sequence was defined as the taxon that matched the highest number of protein sequences. Potential ties between taxa were resolved by selecting the one with the highest value of average identity among hits as the closest relative.

Computational host prediction of viral sequences
Host predictions were performed based on previously reported benchmarking of methods to assign putative hosts to viruses based on shared genetic content between virus and host [64]. For these searches, two reference databases were used: the NCBI RefSeq genomes of Bacteria and Archaea and a dataset of 266 prokaryote metagenome assembled genomes (MAGs) previously obtained from Lake Baikal [5,6]. The taxonomic affiliation of RefSeq genomes was obtained from the Genomic Taxonomy Database (GTDB) [65]. Baikal MAGs were also classified according to the GTDB system using GTDB-tk v0.3.2 [66]. Three signals of virus-host association were analysed: homology matches, shared tRNAs, and CRISPR spacers. Homology matches were performed by querying viral sequences against the databases of prokaryote genomes using BLASTn v2.6.0+ [67]. The cut-offs defined for these searches were minimum alignment length of 300 bp, minimum identity of 50%, and maximum e value 0.001. tRNAs were identified in viral scaffolds using tRNAScan-SE v1.23 [68] using the bacterial models. The obtained viral tRNAs were queried against the database of prokaryote genomes using BLASTn. The cutoffs defined for these searches were minimum alignment length of 60 bp, minimum identity of 90%, minimum query coverage of 95%, maximum of 10 mismatches, and maximum e value of 0.001. CRISPR spacers were identified in the databases of prokaryote genomes using CRISPRDetect v2.2 for the MAGs [69] and a custom script for the RefSeq genomes [70]. The obtained spacers were queried against the sequences of bona fide viral sequences also using BLASTn. The cutoffs defined for these searches were minimum identity of 95%, minimum query coverage of 95%, maximum of 1 mismatch, and maximum e value of 1. Ambiguous host predictions that assigned viruses to different microbial taxa were removed at each taxonomic level. Finally, putative hosts were also assigned to the bona fide viral sequences by manually inspecting their gene content.

Prokaryote and viral abundance analysis
A database was compiled with one genome from each species representative of Bacteria and Archaea from the Genome Taxonomy Database (GTDB, release 89) [65]. Protein sequences were predicted from these genomes using Prodigal v2.6.3 [60] with default parameters. Finally reads from the 10 metagenomes were queried against the GTDB database of protein-encoding genes using DIA-MOND [61] setting e value to 0.00001 and minimum Bitscore to 50. For viruses, reads from the 10 metagenomes were queried against the assembled Baikal scaffolds using the sensitive-local mode of Bowtie2 v2.3.5.1 [71]. The resulting abundance matrix was analyzed using the Vegan Package [72] in R v3.6.1. Non-metric multidimensional scaling (NMDS) was performed based on the relative abundances of viral sequences using the Bray-Curtis dissimilarity measure. Canonical correspondence analysis was performed based on the abundances of prokaryote taxa (phylum level) and viral groups (summed up according to predicted host phylum).
Additional file 1: Table S1. Detailed description of all the analysed Baikal scaffolds. Fields include Completeness inferred by VirSorter. For each taxonomic level from domain to species: taxon name, number of CRISPR hits, number of homology matches hits, number of shared tRNA hits. Scaffold length. For each taxonomic level from domain to species: closest relative (CR) average amino acid identity (AAI), number of matched protein encoding genes (PEGs), percentage of matched PEGs relative to the total number of PEGs identified in the scaffold, and CR taxon name. MD5: MD5 checksum of scaffold sequence. Number of identified protein encoding genes. True_Virus: indicating if the scaffold was classified as a bona fide virus sequence. VP: Viral population to which scaffold was assigned. VirFinder score and p-value, VirSorter category, percentage of scaffold PEGs matched to pVOGs database, total number of hits to pVOGs database and added viral quotient (AVQ) of these hits. (XLS 61331 kb) Additional file 2: Figure S1. Canonical Correspondence Analysis depicting the associations among the abundances of viruses (grouped at the level of phylum, or class in the case of Proteobacteria) and prokaryote taxa (grouped at the level of phyla, or class in the case of Proteobacteria) across samples. Coloured dots represent samples, red taxon names represent prokaryote abundances and arrows represent viral group abundances. (PDF 6.83 kb)