The primary concern in the application of Nanopore sequencing is the error rate, and the median raw read accuracy for the R9.4 (the current most widely used version) was below 90% in 2019 [7]. However, based on the sequencing dataset of the mock community in this study, the median value of “Guppy read mean accuracy” has been substantially improved by the basecaller upgrades and newly developed chemistry, achieving 95.5% (Kit 9) and 98.0% (Q20+) basecalled by Guppy v6.0.0 (the sup model) (Fig. 1a). In addition, the density profile between “Guppy read mean accuracy” and “Read mapping accuracy” suggested that Nanopore read quality scores predicted by Guppy correlated well to the empirical read accuracy estimated from read-to-reference alignments, and some sequences quality was even underestimated (Fig. 1a and Supplementary Figure S1).
To facilitate rapid genome reconstruction, we proposed NanoPhase, a package to generate MAGs from a single long-read dataset (Fig. 1b). NanoPhase is designed to detangle the complex dataset into different clusters of draft bins and achieve genome-resolved efficient polishing. Totally, 28.4 Gb (N50: 5.9 Kb, two flowcells) and 7.2 Gb (N50: 5.4 Kb, one flowcell) were generated from the mock community using the Kit 9 and Q20+ chemistry, respectively. As expected, bacterial and archaeal genomes with sequencing coverage of < 5× cannot be reconstructed, and only one E. coli MAG was resolved to represent five closely related strains due to a very high average nucleotide identity (98.3–99.4%). Thus, 12 MAGs with median completeness of 98.7% were reconstructed from the Kit 9 dataset, and the Q20+ chemistry demonstrated slightly better performance, recovering 11 MAGs with median completeness of 99.0% (Fig. 1c). MAGs from both datasets were very close to reference genomes, benefiting from the read-accuracy and homopolymer resolution improvement (Supplementary Figure S2), which was also supported by low Indels errors (Fig. 1d) and high IDEEL [37] scores (Supplementary Figure S3). Notably, 8 (75%) MAGs were assembled into circular, complete genomes in the Kit 9 dataset, more than those generated from the Q20+ dataset (3), mainly due to a much higher sequencing coverage (~ 4-fold).
We next evaluated the genome reconstruction performance of NanoPhase to resolve a complex AS sample harboring thousands of microbial species. Both Kit 9 and Q20+ chemistries were used for noisy long-read sequencing on five and one flowcells, generating 85.3 (Kit 9 dataset, N50: 6.8 Kb) and 9.4 Gb (Q20+ dataset, N50: 6.5 Kb), respectively. In addition, we observed that filtration of sequencing reads with > 90% accuracy (QA90) is suitable for genome reconstruction in the complex microbiome, balancing yield and accuracy and generating more reference-quality MAGs (Supplementary Table 2).
Employing the Kit 9 dataset, 275 MAGs were reconstructed with the median completeness, contig count and coverage of 89.5%, 9 and 17X (Fig. 2a), representing 46.9% of the microbial community. Furthermore, the median N50 of these MAGs was 735 Kb, about 44- or 86-fold improvement compared to the short-read methods (Supplementary Table 3), demonstrating that genome gaps were remarkably closed by long reads. Of these MAGs, 94 MAGs with median coverage of 28X were classified as high-quality, fitting the stringent criteria of including full-length rRNA operons [38]. Notably, circular genomes were also recovered.
Compared to short-read-based methods, NanoPhase also reconstructed highly accurate genomes from the complex sample, from the aspect of completeness (median value of 89.5%; Fig. 2b) and IDEEL (median value of 0.58; Fig. 2c). As expected, the Q20+ chemistry performed better with higher IDEEL (median value of 0.64; Fig 2c) but only generated half of the flowcell (Kit 9 chemistry) throughput. Therefore, the yield limitation has increased the sequencing cost of the Q20+ chemistry in genome reconstruction at present. In addition, short-read-based Pilon polishing of the MAGs did not considerably improve the genome accuracy (Fig. 2b, c), suggesting that it is not an essential step in the future, particularly given the continuous improvement of Nanopore sequencing.
Besides the improved reconstruction of 16S rRNA genes and prediction of secondary metabolites potential discussed in our previous study [7], some mobile genetic elements identification that depended heavily on bacterial isolates could also benefit from these high-contiguity reference-quality genomes, e.g., prophage. As a major contributor to the diversity of bacterial gene repertoires, the relationship between bacteria and prophage is multifaceted, i.e., increasing bacterial fitness while at the risk of future lysis. In total, 165 prophages were identified within 111 MAGs with a median length of 14.3 Kb, and 1 MAG even possesses 5 different prophage sequences. The widespread prophages in recovered MAGs inferred that prophage was a critical factor in the evolution of microbial genomes via horizontal gene transfer between bacterial populations. Interestingly, no virulence factor and antibiotic resistance gene (ARG) were observed within recovered prophage sequences, suggesting that prophage mediated virulence factor and ARG transfer play a minor role in the studied activated sludge microbiome. In addition, the most prevalent accessory gene of prophage was methyltransferases, with a putative role in protecting prophages from the host immune system, which were found in 24 prophage sequences. Notably, most prophages remain dormant, transmitting vertically along with bacterial replication. However, 5 of them were determined as active prophages, indicating their host populations are undergoing prophage induction. Therefore, the active prophage lysis events may have altered the microbial community composition after sampling, suggesting the overlooked role of prophage in shaping the microbial community.