A comprehensive investigation of metagenome assembly by linked-read sequencing

The human microbiota are complex systems with important roles in our physiological activities and diseases. Sequencing the microbial genomes in the microbiota can help in our interpretation of their activities. The vast majority of the microbes in the microbiota cannot be isolated for individual sequencing. Current metagenomics practices use short-read sequencing to simultaneously sequence a mixture of microbial genomes. However, these results are in ambiguity during genome assembly, leading to unsatisfactory microbial genome completeness and contig continuity. Linked-read sequencing is able to remove some of these ambiguities by attaching the same barcode to the reads from a long DNA fragment (10–100 kb), thus improving metagenome assembly. However, it is not clear how the choices for several parameters in the use of linked-read sequencing affect the assembly quality. We first examined the effects of read depth (C) on metagenome assembly from linked-reads in simulated data and a mock community. The results showed that C positively correlated with the length of assembled sequences but had little effect on their qualities. The latter observation was corroborated by tests using real data from the human gut microbiome, where C demonstrated minor impact on the sequence quality as well as on the proportion of bins annotated as draft genomes. On the other hand, metagenome assembly quality was susceptible to read depth per fragment (CR) and DNA fragment physical depth (CF). For the same C, deeper CR resulted in more draft genomes while deeper CF improved the quality of the draft genomes. We also found that average fragment length (μFL) had marginal effect on assemblies, while fragments per partition (NF/P) impacted the off-target reads involved in local assembly, namely, lower NF/P values would lead to better assemblies by reducing the ambiguities of the off-target reads. In general, the use of linked-reads improved the assembly for contig N50 when compared to Illumina short-reads, but not when compared to PacBio CCS (circular consensus sequencing) long-reads. We investigated the influence of linked-read sequencing parameters on metagenome assembly comprehensively. While the quality of genome assembly from linked-reads cannot rival that from PacBio CCS long-reads, the case for using linked-read sequencing remains persuasive due to its low cost and high base-quality. Our study revealed that the probable best practice in using linked-reads for metagenome assembly was to merge the linked-reads from multiple libraries, where each had sufficient CR but a smaller amount of input DNA. 7PaeKL-RUCqQkRQg5_3dd9 Video Abstract Video Abstract


Background
The human microbiota are complex systems that contribute to a large part of human physiological activities and diseases. Knowing the genomic sequences of the microbiota content allows us to study its functions. However, microbial genome sequences are difficult to obtain. While a few microbes can survive isolation and be cultured in vitro for sequencing, the remaining microbial content remains as "microbial dark matter". Alternatively, there have been attempts to use computational means to reconstruct the microbial genomes from a mixture of shortreads sequenced from them. However, such metagenome assembly faces the difficulties of having repetitive sequences of both intra-and inter-species origin, horizontal gene transfers, and mobilization events [1], complicated by uneven abundance of microbes in the sample.
Current algorithms such as IDBA-UD [2], MEGAHIT [3], and MetaSPAdes [4] make use of read depth and fragment insert size constraint to unravel the repetitive sequences and estimate microbial abundance. However, their reliability is affected by the low continuity of shortread assembly. Long-read sequencing has been used to attempt to mitigate these problems, e.g., Nicholls et al. [5] and Sevim et al. [6]. In particular, Moss et al. [7] optimized the long-read library preparation protocol of nanopore sequencing and produced more complete bacterial genomes. However, the application of long-read sequencing in practical application remains costly (the "Discussion" section).
Alternative sequencing platforms that provide longrange sequence information for metagenomics are available in the form of Illumina Truseq Synthetic Long Reads (SLR) and linked-reads. SLR arranges long DNA fragments into 384 well plates, which are further amplified and pooled sequenced with sufficiently deep sequencing depth (~50X per fragment), thus allowing long fragments to be assembled individually [8,9]. Linkedreads are short-reads where reads from the same fragment are marked with the same barcode. The 10x linked-read microfluidic system assigns long DNA fragments into around 1 million partitions, where each fragment is sequenced with a shallow depth (0.1X-0.4X). A method for linked-read metagenome assembly, Athenameta [10], bridges the gaps between contigs by local assembly on co-barcoded reads and outperformed the methods for short-reads and SLR in assembling human gut and environmental microbiome.
There are four key parameters in linked-read sequencing which may impact metagenome assembly [11] (Fig. 1): (i) C R , average depth of short-reads per fragment; (ii) C F : average physical depth of the genome by long DNA fragments; (iii) N F/P , number of fragments per partition; (iv) Fragment length distribution, which is specified using two parameters, namely, μ FL -average unweighted DNA fragment length and Wμ FL -length-weighted average of DNA fragment length. Several of these parameters are interdependent. For example, a greater amount of input DNA increases both C F and N F/P and decrease C R ; and the absolute values of C F and C R are set by how much total read coverage (C) is generated because C R × C F = C. In a previous study, we investigated the effects of these parameters on human diploid assembly [11].
The present study evaluates these parameters with respect to their impact on metagenome assembly. We used three sets of linked-reads, one from simulation, one from a mock community, and another from a real human gut microbiome sample. The simulated data consists of twenty datasets (Table S1) generated by an improved LRTK-SIM [11] that enables to deal with microbial samples with uneven abundance for this study (the "Methods" section). The mock community (ATCC MSA-1003) is a pool of 20 strains with staggered abundance, while the human gut microbiome is from a healthy Chinese stool sample. Because of an absence of ground truth to evaluate human gut microbiome, we annotated contig bins as draft genomes and assigned them to the corresponding taxonomic classification (the "Methods" section).
Our results show that deeper C resulted in more assembled sequences and enabled better genomic coverage, but it was irrelevant to the assembly quality. C was not a dominating factor for contig continuity, which could be influenced more by genome characteristics. We further found C R to affect the number of draft genomes and that C F was associated with assembly quality. The μ FL had marginal effect on assemblies, and lower N F/P values would lead to better assemblies by reducing the ambiguities of off-target reads. Compared to Illumina short-reads, 10x linked-reads significantly improved the metagenome assembly in both contig continuity and genome completeness.

Results
Three sets of linked-reads are used. The first is simulated from the MBARC-26 [6] community (Table S1 and S2), and the twenty simulated datasets are annotated as C − F , C − R ,μ − FL , and N − F=P (where superscript "-" represents the actual values of corresponding parameters, Table  S1). The second and third sets are sequenced from a mock community of 20 strains (one lane reads from Illumina XTen, 108.7 GB, Table S3) and a human gut microbiome (two lane reads from Illumina XTen, 208.97 GB; the "Methods" section and Supplementary Note) followed by reads subsampling to match the expected parameter values. The microbial complexity in the human gut microbiome was evaluated by aligning linked-reads to the reference sequences from human microbiome project [12] (Supplementary Note). To obtain the datasets of different C R and C F , we subsampled short-reads (MSC R-) and long DNA fragments (MSC F-) of the mock community (the "Methods" section), where value of subscript "-" represents the reciprocal of sequenced lanes-for example, MSC R4 /MSC F4 means quarter lane reads were subsampled. Since the composition of the human gut microbiome is unknown, SC Rand SC F-(where subscript "-" represents the reciprocal of sequenced lanes) were generated by subsampling short-reads and barcodes instead. To avoid confusion, we used MSC 1 and SC all to denote total one lane and two lanes linked-reads from the mock community and human gut microbiome, respectively.
According to microbial relative abundance, the microbes were classified into low-(L sim ), medium-(M sim ), and high-abundance (H sim ) in the simulated data (Table  S2); and classified into low-(L mock ), medium-(M mock ), high-(H mock ), and ultrahigh-abundance (UH mock ) in the mock community (Table S3). The contigs from the simulation and mock community were evaluated using two reference-based metrics (total aligned length and genomic coverage) and two measures for contig continuity (contig NG50 and NGA50). For human gut microbiome data, we annotated the contig bins as draft genomes and classified them into high-, medium-, and low-quality [13] (the "Methods" section). The number and quality of annotated draft genomes and contig N50 were used to evaluate the assemblies.
The influence of total read depth C C has little effect on both total aligned length and genomic coverage for L sim and H sim microbes in the simulated data. For M sim microbes, their abundance correlates positively with total aligned length and genomic coverage, indicating that a low abundance could reduce assembly completeness even when C is high (Fig. 2a, b, e, and f). Similarly, we fail to observe any clear trend between NG50 (or NGA50) and C. Two microbes with the deepest C, NC_014212 and NC_017095 (with the highest abundance), were assembled into fragmented contigs ( Fig. 2c and g), suggesting that C was not a dominating factor for contig continuity; which is also seen in C 333 F and C 0:77 R , which have deeper C (C = 120X) than the other configurations. They achieved the largest total aligned length and genomic coverage, but their contig NG50 and NGA50 fluctuated and were not always the best. For example, NC_019904 and NC_002737, which have the lowest abundance among M sim microbes, yielded the largest total aligned length in C 333 The results for the mock community are consistent with those from the simulated data. The total aligned length and genomic coverage were fairly stable for L mock , H mock , and UH mock microbes regardless of the value of C (Fig. 3a, b, e, and f). For M mock microbes, the contigs from MSC F8 /MSC R8 covered the reference genomes poorly due to the insufficient read depth. A quarter lane reads (MSC F4 /MSC R4 ) appeared to suffice for the read depth, achieving around full genomic coverage for all M mock microbes, except for ATCC_33323, which required a quarter lane reads more. No consistent trend could be observed for NG50 and NGA50; a quarter lane reads was necessary to generate contigs with non-zero NGA50 for M mock microbes.
In the results with human gut microbiome, deeper C extends the assembly length but has no impact on the assembly quality. After binning contigs and classifying the bins into draft genomes (the "Methods" section), SC all produced the largest number of bins (148) and the longest assembly length (399.41 Mb). These statistics were reduced along subsampling reads progressively ( Table 1). The proportions of bins annotated as draft genomes were reduced by increasing C (SC R8 77.78%, SC R4 69.39%, SC R2 63.75%, SC R1 65.38%, SC all 54.05%; SC F8 68.75%, SC F4 60.94%, SC F2 59.55%, SC F1 58.16%, SC all 54.05%). C negatively correlates with bin average contamination (SC all 14.40%, SC R2 10.46%, SC R4 9.08%, SC R8 8.94%; Table S4 and Figure S1). We annotated the draft genomes as genus or species (> 60% confidence) based on their k-mer similarities with known microbial genomes (the "Methods" section). Most of the taxonomical classifications were observed by at least two parameter configurations, although some were unique to only one ( Figure S2). Considering the qualities of annotated draft genomes, C demonstrated a positive correlation with the number of medium-and low-quality bins ( Table 1); SC R4 has the most high-quality bins and the largest average bin completeness (73.3%) compared to the other configurations (Table 1 and Table S4). The N50s of high-quality bins are significantly greater than medium-(p value = 0.01) and low-quality (p value = 5.3E−9) bins, suggesting that bin quality (determined by completeness and contamination) is highly correlated with contig continuities ( Fig. 4a-c). Interestingly, high-quality bins required read coverage of at least 50X (SC all = 85.81X; SC F1 = 132.71X; SC F2 = 111.11X; SC F4 = 96.43X; SC F8 = 64.67X; SC R1 = 75.66X; SC R2 = 151.55X; SC R4 = 63.42X; SC R8 = 53.08X), suggesting that the low abundance microbes were not assembled into high-quality genomes. Nevertheless, the contigs with extremely high depth may come from repetitive sequences and reduce the qualities of bins they belong to (Fig. 4d-f; C (high) = 81.4X; C (medium) = 140.1X; C (low) = 1636.5X).
The tradeoffs between C R and C F There are tradeoffs between C R and C F in maintaining the same C. Because the product of PCR amplification per partition can generate around 500 Mb short-reads, loading DNA with greater density (deeper C F ) results in more fragments per partition and fewer reads sequenced for each fragment (shallower C R ). For M sim and H sim species in the simulated data, we found that increasing C R is more effective than increasing C F when C is around 10X, and they are comparably effective when C is beyond 30X (Fig. 2a, b, e, and f). As a rule, deep C R is more pressing to reconstruct DNA fragment if C is low.   118.0 kb). These observations suggest that deeper C R would result in more assembled sequences, while deeper C F would help in improving assembly quality.

DNA fragment length and metagenome assembly
DNA long fragment information is critical for linkedread assembly, as it can help in spanning the gaps between contig breaks that are due to genome variations and repetitive sequences. On the other hand, it may lead to the loss of barcode specificity in disentangling short tandem repeats if the fragments are exceedingly long.
In practice, it is difficult to extract very long DNA fragments from metagenomic sample; even on the gentlest DNA extractions, the mean fragment length (μ FL ) is usually at most 10 to 20 kb. Our simulated data of μ FL from 5 to 100 kb showed that the assembly was not sensitive to μ FL . In some special cases, extremely long DNA fragments could improve the assemblies of M sim microbes with high repeat rates. For example, μ 100 FL (μ FL = 100 kb) improved the contigs NG50 (3.17 Mb, Fig. 2k) and NGA50 (2.71 Mb, Fig. 2l) of NC_018014, which was the one with the highest repeat rate (18.3%).

Barcode specificity is important in microbial deconvolution
For human genome sequencing, each partition contains ten fragments (N F/P = 10) on average [14]. N F/P is supposed to be larger (N F/P = 40) for metagenomic sequencing due to the limited fragment size (Wμ FL = 11.15 kb) and relatively small microbial genome size (Table S5). Large N F/P also increases the difficulties in recognizing the fragments that short-reads belong to. The assembly on N F/P1 , the smallest N F/P (N F/P = 10) in simulation, had much better NG50 and NGA50 for most of the H sim and M sim microbes (14 out of 18, the remaining 4 microbes are comparable, Fig. 2 o and p). Small N F/P also failed to assemble L sim microbes (Fig. 2 m and n).

Assembly on Illumina short-reads and PacBio CCS longreads
Illumina short-read sequencing is a mainstream technology for metagenomic sequencing, but its quality for metagenome assembly is unsatisfactory due to the lack of long-range connectivity. We downloaded the shortread data of the mock community from the Sequence Read Archive [15] (the "Methods" section) and performed an assembly. The assembly on linked-reads (total aligned length 52.04 Mb; genomic coverage 77.20%) is much better than that on short-reads (total aligned length 38.13 Mb; genomic coverage 56.69%, NG50 and NGA50, see Figure S3). For human gut microbiome, the assembly from 8.8 Gb short-reads showed a comparable number of bins (53) and total assembly length (145.50 Mb vs. 159.40 Mb for SC R4 , Table 1). However, the short-read assembly generated no bins with high-quality because it had known issue to detect rRNAs and tRNAs [16,17] (Table 1). SC F8 , with the worst N50 in linkedread assembly, was also 4.49 times (115.29 kb vs. 25.69 kb) greater than Illumina short-reads. The average bin contamination rate of 17.39% for the assembly from short-reads was also much worse than linked-reads (Table S6).
In mock community, we further compared linked-reads to PacBio CCS, which have both extreme long (N50 = 9.08Kb) and highly accurate (> 99% base accuracy) reads. The total aligned length and genomic coverage were comparable between CCS reads (54.04 Mb and 78.68%) and MSC 1 (52.02 Mb and 77.18%), but CCS reads improved the contig continuity substantially ( Figure S4).

Comparison to human genome parameter statistics
10x linked-read sequencing was originally developed for human genome assembly, so we compared the parameter distributions between human genome and human gut microbiome. Because no reference genome was available for human gut microbiome, we collected the sequences of all the non-redundant high-quality bins from SC F-and SC R-datasets as "pseudo" reference genomes and reconstructed 15,994,284 long fragments (> 2 kb). For human gut microbiome, C R was comparable (C R 0.30X vs. 0.32X, Table S5), and C F was 6.26 times larger than human genome (NA24385, C F 595.85X vs. 95.20X, Table S5); also, the DNA fragments were obviously much shorter (μ FL 7.91 kb vs. 28.06 kb; Wμ FL 11.15 kb vs. 44.53 kb, Table S5, Figure S5 and S6).

Discussion
Human microbiota provide rich information to understand microbial activities impacting human health and disease. Projects such as HMP (Human Microbiome Project) [12] and MetaHIT (Metagenomics of the Human Intestinal Tract) [18] have been proposed to collect microbiomes from diverse places of human body and aimed to understand their compositions and functions. De novo metagenome assembly on short-reads is commonly used to assemble microbial genomes from a mixture of culture-free microbes. Although it has been widely applied to assemble thousands of bacterial genomes [19,20], there are four difficulties that remain: (1) assembly for low-abundance microbes; (2) repetitive sequences assembly such as 16S, 23S rRNA; (3) assembly of regions with genetic variation; (4) strain level assembly based on haplotype phasing. Besides metaSPAdes used in the current study, IDBA-UD [2] and MEGAHIT [3] were also tested and achieved comparable results with metaSPAdes. They all showed much worse assembly than linked-reads (Table S6).
Long-read sequencing has the potential to assemble more complete genomes and is believed to dominate the field in the future. However, linked-reads are still worth to be considered as a transitional technology. First, both PacBio and Oxford Nanopore are several times more costly than 10x linked-reads (especially for library preparation). Second, high base error rate of long-reads lacks strength for haplotype phasing and strain level assembly. Third, clinical samples benefit from the small amount of input DNA required by linked-read sequencing. A previous study also observed some high-quality bins generated by linked-reads missed in long-reads assembly [7].
In this study, we comprehensively investigated the four parameters of linked-read sequencing on metagenome assembly, which could be fine-tuned in either library preparation or short-read sequencing. Read depth C and microbial abundance are the two most important parameters to determine genome coverage and the number of bins annotated as draft genomes. Low-abundance microbes were almost impossible to be assembled by any of the technologies; the assemblies of medium-abundance microbes were substantially improved by deep C, and they were fairly stable for high-abundance ones.
According to our observation, C should be chosen from 120X to 400X to optimize the assembly quality. There is a tradeoff between C F and C R , where deep C F can generate more high-quality bins and C R controls total assembly length. Large μ FL enables DNA fragments spanning distant contigs, but it is unnecessary to produce extremely long fragments for microbial genomes. The repetitive sequences spread in microbial genomes are usually short (e.g.,16S:~1.5 kb, 23S:~2.9 kb), which could be resolved by assembling the co-barcoded reads with small N F/P .
Athena-meta includes four steps: (1) generate "seed" contigs using short-reads without barcodes; (2) link contigs into scaffold graph using aligned paired-end reads; (3) local assembly by recruiting co-barcoded reads that spanning both "seed" contigs; (4) pool and assemble the locally assembled sequences and "seed" contigs. We can link and interpret our observations with the corresponding strategies in Athena-meta. C is critical to construct "seed" contigs, as high-quality seed contigs are the prerequisite for local assembly using co-barcoded reads. C R and C F impact reconstruction of long DNA fragments, and the probability of two distant contigs spanned by the same fragment, respectively. Small N F/P can reduce off-target reads and make local-assembly more efficiently. Our study revealed that the probable best practice in using linked-reads for metagenome assembly is to merge the linked-reads from multiple libraries, where each has sufficient C R but a smaller amount of input DNA.

Methods
Simulate linked-reads for microbes with uneven abundance LRTK-SIM [11] was initially built for human diploid assembly by simulating 10x linked-reads. In this study, we extended it to allow genomes with uneven depth to reflect different microbial abundance ( Figure  S7). We downloaded the reference genomes (denoted as M) of 23 bacterial and 3 archaeal strains from MBARC-26 [21] and categorized them into L sim (Molarity < 10 −15 ), M sim (10 −15 < Molarity < 10 −14 ) and H sim (Molarity > 10 −14 ) ( Table S1). The molarity was normalized to sum to 1 as microbial relative abundance ( P 26 i¼1 A i ¼ 1), and C F for microbe i (C Fi ) was calculated as C Fi = C F × A i × 26 (C F was predefined). The total fragment length for microbe i (M i ) was C Fi × L i , where L i was genome size of M i . The estimated input nucleotides were calculated as P M i¼1 A i Â L i Â 26 . We simulated a wide range of C F (from 28X to 333X), C R (from 0.064X to 0.77X), μ FL (from 5 to 100 kb), and N F/P (from 10 to 160) to investigate their impact on metagenome assembly (Table S1).

DNA extraction, library preparation, and sequencing
For mock community, DNA from ATCC 20 strain staggered mix genomic material (ATCC-MSA1003) was extracted without size-selection. For human gut microbiome from stool sample, we extracted the DNA using Qiagen QiAaMP Stool Mini Kit and removed the DNA fragments below 5 kb. After that, the molecular weight of isolated DNA was assayed by pulsed-field electrophoresis. For 10x Chromium library preparation, 1 ng of isolated high molecular weight DNA was denatured according to the manufacturer recommendations, added to the reaction master mix and mixed with gel bead and emulsification oil to generate droplets within a Chromium Genome chip. The rest part of library preparation was done following the manufacturer protocol (Chromium Genome v1, PN-120229).
The two libraries were sequenced by Illumina XTen with 2 × 150 bps paired-end reads, respectively. The DNA of human gut microbiome was also prepared for standard Illumina XTen short-read sequencing.

DNA long fragment reconstruction and linked-read subsampling
Long Ranger v2.2.1 [22] was used to correct barcode base errors, calculate PCR duplication rate, and perform barcode-aware linked-read alignment. BWA-MEM v0.7.17 [23] was adopted to align short-reads and linked-reads without barcodes. Long DNA fragments were reconstructed according to the mapping coordinates of cobarcoded short-reads. The linked-reads were sorted by barcode first and then by their mapping coordinates. Long DNA fragments were reconstructed by greedy extension and terminated if the nearest co-barcoded read was > 50 kb away. Each fragment must include at least two cobarcoded read pairs and have a minimum length of 2 kb.

Conclusion
In this study, we comprehensively investigated four parameters of linked-read sequencing on metagenome assembly and compared with Illumina short-reads and PacBio CCS reads. Our study revealed that the probable best practice in using linked-reads for metagenome assembly is to merge the linked-reads from multiple libraries, where each has sufficient C R but a smaller amount of input DNA.