Skip to main content

Table 2 Number of sequences at each stage of the RefSeq benchmarking workflow

From: Gauge your phage: benchmarking of bacteriophage identification tools in metagenomic sequencing data

Dataset

Chromosome (n)

Plasmid (n)

Phage (n)

Chromosome (bp)

Plasmid (bp)

Phage (bp)

1. Post-2020

7400

9960

1849

30,034,515,475

1,009,997,498

128,660,045

2. Post-2020 dereplicated

3546

4453

901

14,845,391,115

480,796,703

62,387,054

3. Subsampled

253

309

901

1,011,740,231

28,782,822

62,387,054

4. Prophage removed

2307 (2088)

400 (91)

901

965,959,872

27,323,488

62,387,054

5. pVOG removed

2065

313

901

848,361,160

24,433,971

62,387,054

6. Fragmented artificial contigs

104,003

2754

6664

830,889,456

21,783,942

53,426,665

  1. Columns labelled with (n) contain the number of sequences at each step, and columns with (bp) indicate the number of base pairs at each step. Steps are numbered as follows:
  2. 1. Sequences downloaded from RefSeq which were deposited between 1 January 2020 and 12 August 2021
  3. 2. Sequences from (1) which were then dereplicated with RefSeq sequences deposited before 1.st January 2020 and training sets for DeepVirFinder, Seeker, VIBRANT, VirFinder, and VirSorter2
  4. 3. Host sequences (chromosome and plasmids) from (2) which were subsampled by a factor of 14.3
  5. 4. Host sequences from (3) with prophage removal using Phigaro and PhageBoost. The number in parentheses indicates the number of prophages removed.
  6. 5. Host sequences from (4) with sequences that have ≥ 30% of their open reading frames having hits to the pVOG database removed
  7. 6. All sequences from (5) randomly and uniformly fragmented to sizes between 1 and 15 kbp for use in the benchmarking study