Estimate of the sequenced proportion of the global prokaryotic genome

Background Sequencing prokaryotic genomes has revolutionized our understanding of the many roles played by microorganisms. However, the cell and taxon proportions of genome-sequenced bacteria or archaea on earth remain unknown. This study aimed to explore this basic question using large-scale alignment between the sequences released by the Earth Microbiome Project and 155,810 prokaryotic genomes from public databases. Results Our results showed that the median proportions of the genome-sequenced cells and taxa (at 100% identities in the 16S-V4 region) in different biomes reached 38.1% (16.4–86.3%) and 18.8% (9.1–52.6%), respectively. The sequenced proportions of the prokaryotic genomes in biomes were significantly negatively correlated with the alpha diversity indices, and the proportions sequenced in host-associated biomes were significantly higher than those in free-living biomes. Due to a set of cosmopolitan OTUs that are found in multiple samples and preferentially sequenced, only 2.1% of the global prokaryotic taxa are represented by sequenced genomes. Most of the biomes were occupied by a few predominant taxa with a high relative abundance and much higher genome-sequenced proportions than numerous rare taxa. Conclusions These results reveal the current situation of prokaryotic genome sequencing for earth biomes, provide a more reasonable and efficient exploration of prokaryotic genomes, and promote our understanding of microbial ecological functions. Video Abstract


Background
Prokaryotes are generally assumed to be the oldest existing form of life on earth and the primary engines of global biogeochemical processes; they are found in almost all ecosystems [1,2]. Genome sequencing provides a blueprint for the evolutionary and functional diversities of prokaryotes and improves our understanding of how they interact with one another, their hosts, and their surroundings [3][4][5]. However, what is the cells or taxa proportion of genomesequenced bacteria or archaea on earth? This basic and seemingly simple question has never been answered.
Since the first bacterial genome was completely sequenced in 1995, more than 200,000 bacterial and archaeal complete or draft genomes have been uploaded to public databases as a result of the development of sequencing technology and the decrease in costs [6,7]. Meanwhile, due to improvements in sequencing throughput and computational techniques, cultivation-independent recovery of genomes from metagenomes further promotes prokaryotic genome mining [8][9][10]. Interestingly, compared to the exponential accumulation of genomic data, the latest estimate of global prokaryotic operational taxonomic units (OTUs, 16S-V4 regions at 97% sequence identities) is only 0.8-1.6 million, far less than the trillions previously predicted [11,12]. It is necessary to globally evaluate the proportion of sequenced prokaryotic genomes in environments.
The Earth Microbiome Project (EMP) was founded in 2010 to sample and explore the Earth's microbial communities at an unprecedented scale [13][14][15]. In this study, we conducted a large-scale sequence alignment between the data released by the EMP and the sequenced bacterial or archaeal genomes in the public database. From these data, we evaluated the present situation of prokaryotic genome sequencing in the earth biomes for the first time.
The genome-sequenced proportion in the prokaryotic biome was closely related to habitat ( Fig. 1 and Supplementary dataset 2). Microbial environments are divided into different environment types by the EMP. The EMP ontology (EMPO level 1) classifies microbial environments as free living and host associated, with further subdivision into 17 environment types (EMPO level 3) [13]. We found that the genome-sequenced proportions in host-associated biomes were significantly higher than those in free-living biomes. For the host-associated prokaryotic biomes (5161 samples), the median B cell (100%) was as high as 68.3% (16.9-95.2%), and the median B OTU (100%) was 40.7% (11.8-69.4%). However, for the freeliving prokaryotic biomes (4839 samples), the median B cell (100%) was only 29.1% (16.2-52.0%), and the median B OTU (100%) was 13.0% (7.7-24.8%). In detail, the median B cell (100%) in plant corpus, animal corpus, and animal secretions exceeded 95%, and the median B OTU (100%) exceeded 66.7%. Comparatively, the median B cell (100%) values for plant surface, sediment (non-saline), and hypersaline samples were all less than 10%, and the median B OTU (100%) values for sediment (non-saline) and sediment (saline) samples were less than 5% (Fig. 1). For closely related strains, B cell and B OTU also showed similar variabilities among different habitats (Supplementary Figs. S1, S2). Despite significant differences, the genomesequenced proportions were high in most of the prokaryotic biomes.
Furthermore, we found that the genome-sequenced proportion in the prokaryotic biome was significantly negatively correlated with its alpha diversity indices ( Supplementary Fig. S3). For both cells and taxa, the prokaryotic biomes with low alpha diversity indices (observed OTUs, Shannon index, Chao1 index, and Faith's PD value) tended to have a higher degree of genome sequencing. For example, the Pearson correlation coefficients of B cell (100%) and B OTU (100%) with Shannon indices were − 0.62 (p < 0.01) and − 0.67 (p < 0.01), respectively.

Low genome-sequenced proportions of global prokaryotic taxa
A total of 262,011 OTUs were obtained from 10,000 EMP samples through a meta-analysis. We defined the genome-sequenced proportion of all taxa (at 100%, > 98.6%, or > 97% identities in the 16S-V4 region) as P OTU and found that the P OTU (100%) of the 10,000 samples was only 2.1% (Supplementary dataset 3). The P OTU (98.6%) and P OTU (97%) values were 6.8% and 12.2%, respectively, and both were also much lower than the corresponding B cell and B OTU medians. Furthermore, we found that 75.8% of OTUs were present in two or more biome samples. The P OTU (100%) value was 0.6% for the OTUs that appeared in only a single sample (401 of 63, 459 OTUs), 1.2% for those in 2 to 10 samples (1641 of 134,119 OTUs), 5.4% for those in more than 10 samples (3478 of 64,433 OTUs), 16.2% for those in more than 100 samples (1431 of 8810 OTUs), and 72.5% for those in more than 1000 samples (108 of 149 OTUs) (Fig. 2a). Notably, many prokaryotic taxa could exist in diverse environment types; approximately 21.7% of prokaryotic taxa could exist in two types of environments, and 20.2% of OTUs could exist in three or more types of environments. We found that the taxon genome-sequenced proportion also increased with its distribution extent in different environment types. The P OTU (100%) was only 0.6% for prokaryotic OTUs that existed in only one type of environment (932 of 152,229 OTUs), 14.5% for OTUs in five or more types of environments (2645 of 18,230 OTUs), 43.6% for 10 or more types of environments (904 of 2074 OTUs), and 74.6% for 14 or more types of environments (287 of 385 OTUs) (Fig. 2b). A higher genome-sequenced proportion of prokaryotic cosmopolitan OTUs led to a lower P OTU than the corresponding B OTU (Fig. 2c).
Because an OTU was likely to appear in multiple samples, we evaluated the effects of sample quantity on P OTU by random sampling. Our results demonstrated that the P OTU (100%) displayed an exponential decay trend (R 2 = 0.992) and eventually stabilized at 2.13% ± 0.03% as the number of samples increased (Fig. 3a). Similarly, the P OTU (98.6%) and P OTU (97%) values also decreased with increasing sample size and stabilized at approximately 6.8% and 12.2%, respectively (Supplementary Fig. S4). The estimated P OTU values based on 10,000 EMP samples were close to the genomesequenced proportions in all global prokaryotic taxa. We evaluated the changes in P OTU as the number of sequenced genomes increased from 2010 to 2019. The results showed that the P OTU (100%) increased exponentially (R 2 = 0.998) by sixfold over the decade. However, it was estimated that it would take at least 25 years for the P OTU (100%) to reach 95%. With the increase in sequenced genomes, the P OTU (100%) value showed an allometric increase (R 2 = 0.989), and we determined that the 95% P OTU (100%) value required more than 10 9 sequenced genomes (Supplementary Fig. S5). In addition, the P OTU also differed significantly between environments. The P OTU (100%) value based on the total host-associated samples was 4.6% whereas the P OTU (100%) value for all the free-living samples was only 2.1%. The P OTU (100%) values for the animal corpus and plant corpus environments were 28.3% and 23.7%, respectively, whereas the P OTU (100%) values for sediment (non-saline), soil (non-saline), and water (non-saline) environments were only 2.3%, 2.9%, and 2.9%, respectively. P OTU (98.6%) and P OTU (97%) also showed similar patterns (Fig. 3b). Thus, despite the rapid accumulation of prokaryotic genomic information, the genome-sequenced proportion of the global prokaryotic taxa was still fairly low. OTUs share 100% identities with the sequenced genomes. Based on the analysis of 10,000 EMP samples, each gray point represents a single sample. For the box plots, the middle line indicates the median, the box represents the 25th-75th percentiles, and the error bar indicates the 10th-90th percentiles of observations. Environment types were classified by EMPO; red represents host associated and green represents free living The majority of the biomes were occupied by a few predominant taxa with high relative abundances Our results showed that the top 1% of the prokaryotic taxa (sorted by their percentage of 16S rRNA sequences) accounted for 72.9% of the global prokaryotic biomes ( Fig. 4a and Supplementary Fig. S6). These top 1% of taxa always had a high abundance in different environment types (Fig. 4b), which was similar to a recent report on global soil dominant bacteria [19]. By contrast, the rare taxa with low abundance (the total number of sequences < 10) accounted for 59.8% of the total prokaryotic taxa but only 1.2% of the global prokaryotic cells (Supplementary Fig. S7). We found that the number of samples affected the observed proportion of rare taxa to global taxa; as the number of samples increased, the ratio value increased gradually (Supplementary Fig.  S8). Notably, the genome-sequenced proportion of the top 1% of prokaryotic taxa reached 38.0% whereas that of the 59.8% of prokaryotic taxa with a low abundance was only 0.6% (Fig. 4c and Supplementary Fig. S6). The genome-sequenced proportions of the top 1% of prokaryotic taxa from different environment types exceeded 12% (Fig. 4b). We further selected 1325 highly abundant and widely distributed OTUs on the following conditions: existing in at least 9 environment types and at least 100 samples and had an abundance reaching the top 1% in at least 1 type of environment (Supplementary dataset 3). These predominant taxa accounted for only 0.5% of the total OTUs but contributed to 50.3% of the global prokaryotic biomes. The genome-sequenced proportion was fairly high in these dominant taxa, and the P OTU (100%) , P OTU (98.6%) , and P OTU (97%) values were 48.2%, 61.7%, and 71.3%, respectively ( Supplementary  Fig. S9). The majority of biomes were occupied by a few predominant taxa with high genome-sequenced proportions.
Culturability altered genome-sequenced preferences among prokaryotes but not environments We estimated the P OTU values of prokaryotes at different taxonomic levels (Supplementary dataset 4), which showed that the P OTU values were obviously different among different taxa, and the P OTU value of the same taxon also differed significantly among different environment types (Supplementary Figs. S10, S11, S12, S13, S14 and S15). For example, of the 11 phyla with OTU Fig. 2 High genome-sequenced proportion of prokaryotic cosmopolitan taxa. a OTUs that can exist in one or more samples. b OTUs that can exist in one or more environment types. The gray column represents the proportion of OTUs that can exist in one or more samples (environments), and the red column represents the genome-sequenced proportion of OTUs. c Lower P OTU than B OTU is caused by a high genome-sequenced proportion of cosmopolitan taxa numbers greater than 1%, the highest P OTU (100%) value was 5.7% for Actinobacteria, and the lowest P OTU (100%) value was 0.04% for Parcubacteria; the difference between them spanned more than 100-fold (Supplementary Fig. S10).
Due to improvements in sequencing throughput and computational techniques, cultivation-independent recovery of genomes from metagenomic data has rapidly developed. In total, 7903 bacterial and archaeal metagenome-assembled genomes (MAGs) were recovered from massive metagenomic data, which were considered from uncultivated strains [8]. We assessed the effect of strain culturability on the current genomic sequencing preferences using these MAGs and 155,810 cultured genomes (Supplementary dataset 5). The results showed that the genome-sequenced proportion of prokaryotes increased by 0.1% after combining these MAGs. According to the environment types, the P OTU (100%) based on MAGs was highly positively correlated with that based on RefSeq (r = 0.91, p < 0.01) (Supplementary Fig. S16). The result showed that, similar to the RefSeq genomes, the MAGs also showed environmental differences, and the culturability of strains was not the main factor leading to these differences. For the 11 phyla with an OTU number proportion greater than 1%, there was no significant correlation between the P OTU (100%) based on the MAGs and the RefSeq (p > 0.05) (Supplementary Fig. S16). This indicated that although the recovered MAGs had a distinct difference in prokaryotic taxa, its species preference was significantly different from the RefSeq genomes.

Discussion
The genome is the basic resource for understanding the physiology, ecology, and evolution of prokaryotes. More than 200,000 bacterial and archaeal genomes are now Fig. 3 Genome-sequenced proportion of prokaryotic taxa from global or different environment types. a As the number of samples increases, the P OTU (100%) shows an exponential declining trend and finally stabilizes at 2.1%. A random selection of 1000, 2000…, 9000 samples was performed 10 times for each group to calculate the mean value and standard deviation. b Significant difference of P OTU among environment types. The red point is P OTU (100%) , the blue point is P OTU (98.6%) , and the orange point is P OTU (97%) available from over two decades of development [3,6]. These genomes provide important insights into the role of microorganisms in industrial processes, the pathogenic mechanisms of pathogenic microorganisms, etc. In this study, we assessed the genome-sequenced proportion of global prokaryotes. We found that the median proportions of the genome-sequenced prokaryotic cells and taxa (at 100% identities in the 16S-V4 region) in global biomes were 38.1% (16.4-86.3%) and 18.8% (9.1-52.6%), respectively. The B cell (97%) of 61.9% of the samples reached 50%, and the B OTU (97%) of 38.4% of the samples reached 50% after combining closely related strains. In addition, the median B cell (97%) and B OTU (100%) values in host-associated biomes were 85.6% (43.2-98.0%) and 62.8% (9.8-82.3%), respectively, which were significantly higher than those in free-living biomes. Thus, the genetic information of a specific prokaryotic biome may have been reported to a considerable degree.
However, compared to prokaryotic biomes, the genome-sequenced proportion of global prokaryotic OTUs was fairly low. Our results suggest that only 2.1% of the global prokaryotic taxa (at 100% identities in the 16S-V4 region) have been sequenced. More than 75% of prokaryotic OTUs could exist in multiple biomes; the more types of environments in which prokaryotic OTUs can survive, the higher the genome-sequenced proportion could be. Prokaryotic biomes are usually composed of a few predominant taxa with a high abundance and many rare taxa with a low abundance [20,21]. We found Fig. 4 High genome-sequenced proportion of prokaryotic taxa with high abundance. a The top 1% of the prokaryotic taxa account for 72.9% of the global prokaryotic biomes. b The top 1% of the prokaryotic taxa from different environment types accounted for more than 40% with a genome-sequenced proportion greater than 10%. The gray column represents the cellular proportion of the top 1% of the taxa, and the red column represents the P OTU (100%) . c High genome-sequenced proportion of the top 1%. The red line is P OTU (100%) , the blue line is P OTU (98.6%) , and the orange line is P OTU (97%) that 0.5% of predominant OTUs occupied 50.3% of prokaryotic cell abundance with a high genome-sequenced proportion (48.2%); however, the 60% of rare OTUs only accounted for 1.2% of the global prokaryotic cells with a low genome-sequenced proportion (0.6%). A large number of rare taxa are considered to be critical components of the earth's ecosystem and contain a large functional genes pool [21,22]. Therefore, from this perspective, our current understanding of global prokaryotic genomic information remains very limited due to the large number of genome-unsequenced rare taxa, and the exploration of this huge genetic resource is just beginning.
Predominant taxa are considered the priority for isolated culture and genome sequencing [19]. We identified 1325 predominant OTUs with a wide distribution, high abundance, and adaptability to a variety of environmental types, more than half of which had not been genomesequenced. In particular, some predominant taxa acquired less attention in specific environmental types. For example, the top 1% taxa of abundance in plant surfaces and animal surfaces accounted for 81.0% and 79.2% of the global prokaryotic biomes whereas the genome-sequenced proportions of the taxa were only 13.7% and 79.0%, respectively. The P OTU (100%) of plant surfaces (leaf or kelp surface biofilms) was ranked 8th, but its median B cell was last given the lack of understanding of predominant taxa.
Currently, most of the prokaryotic sequenced genomes (RefSeq genomes) are from pure cultures, while MAGs are not limited by culturability [8,9,23]. We found similar genome-sequenced differences among different environment types between RefSeq genomes and MAGs, which indicated that the current imbalance of prokaryotic genome sequencing in different environments was more likely due to differences in researchers' attention rather than prokaryotic culturability. Although the significant genome-sequenced differences among different taxa between RefSeq genomes and MAGs suggested that culturability caused genomic sequencing preferences had no effect on MAGs, MAGs had also owned its own taxa sequenced preferences.
The paradigm that only 1% of prokaryotes are culturable has a profound impact on microbial ecology but has recently been debated [24][25][26]. Since the RefSeq genomes are mainly from culturable taxa, and a significant proportion of culturable taxa have not been sequenced, we estimate that the culturable rate of global prokaryotic taxa (> 97% identities) would be higher than the genome-sequenced proportion of 12.2%. Similar to the higher genome-sequenced proportion of the high abundance predominant taxa, predominant taxa should also have a much higher culturability rate than rare taxa; thus, the culturability rate of prokaryotic cells will be much higher than that of taxa. Consequently, our data indicated that the paradigm that only 1% of prokaryotes are culturable is out of date, both for cells and taxa.

Conclusions
This study performed an in-depth analysis of the prokaryotic genome-sequenced proportion in the EMP and comprehensively showed the global-scale genomesequenced degree for various environment types and different species. Most of the biomes were occupied by a few widespread predominant taxa. Given the high genome-sequenced proportion of predominant taxa, the genetic information of most prokaryotic biomes has been revealed to a high degree. However, due to the large number of rare taxa with unknown genomes, our current understanding of the global prokaryotic genome information remains limited. These results will be helpful for more reasonable and efficient explorations of prokaryotic genomes and will accelerate the comprehensive understanding of microbial ecological functions in different environments.

Data collection from EMP and RefSeq
The Earth Microbiome Project (EMP) was founded in 2010 to sample the Earth's microbial communities at an unprecedented scale to advance our understanding of the organizing biogeographic principles that govern microbial community structure on Earth [13][14][15]. A total of 262,011 OTUs and their abundance and nucleic acid sequence information were collected from the website (ftp://ftp. microbio.me/emp/release1), which were obtained and shared by the EMP from 10,000 samples using the Deblur software [27]. Chimera filtering relied on the EMP project. The NCBI's reference sequence (RefSeq) database is a curated non-redundant collection of sequences representing whole or frame genomes [28]. We obtained all of the 155, 810 bacterial or archaeal genomes collected by the database before July 2019. In addition, 7903 (1539 contained the 16S rRNA gene) metagenome-assembled genomes (MAGs) [8] recovered from > 1500 public metagenomes using MetaBAT [29] were also collected for representative uncultivated bacteria and archaea.

Sequence alignment and analysis
Alignment between the EMP OTUs and 155,810 or 7903 genomes was performed using BLASTn (E value < 1e-5) [30]. To assess the adequacy of the OTUs, we analyzed all the samples by increasing the number of samples from 1000 to 10,000 randomly. The genome-sequenced proportions of cells and taxa (at 100%, > 98.6% or, > 97% identities in the 16S-V4 region) in a specific prokaryotic biome were defined as B cell and B OTU , respectively. The genome-sequenced proportion of taxa (at 100%, > 98.6%, or > 97% identities in the 16S-V4 region) from subgroup or global biomes was defined as the P OTU . The 100% identity represents the most rigorous and accurate match, while 98.6% and 97% identities are the new and traditional criteria for species definitions, respectively [16][17][18]. Briefly, B cell represents the ratio of the genome-sequenced sequences in a single sample, B OTU represents the ratio of the genome-sequenced OTUs in a single sample and P OTU represents the ratio of the genome-sequenced OTUs in multiple samples.

Taxonomic analysis of EMP OTU
The taxonomy of each OTU was analyzed by the Ribosomal Database Project (RDP) Classifier [31] at a 70% confidence threshold. The EMP ontology (EMPO) classified 17 microbial environments (level 3) as free living or host associated (level 1) and saline or non-saline (if free living) or animal or plant (if host associated) (level 2) [13]. Based on the taxonomic results and the EMPO (level 3) for each OTU, we calculated the composition and relative abundance of different levels of taxonomy (phylum, class, order, family, and genus) in different environments.