The allometry of cellular DNA and ribosomal gene content among microbes and its use for the assessment of microbiome community structure

Background The determination of taxon-specific composition of microbiomes by combining high-throughput sequencing of ribosomal genes with phyloinformatic analyses has become routine in microbiology and allied sciences. Systematic biases to this approach based on the demonstrable variability of ribosomal operon copy number per genome were recognized early. The more recent realization that polyploidy is probably the norm, rather than the exception, among microbes from all domains of life, points to an even larger source bias. Results We found that the number of 16S or 18S RNA genes per cell, a combined result of the number of RNA gene loci per genome and ploidy level, follows an allometric power law of cell volume with an exponent of 2/3 across 6 orders of magnitude in small subunit copy number per cell and 9 orders of magnitude in cell size. This stands in contrast to cell DNA content, which follows a power law with an exponent of ¾. Conclusion In practical terms, that relationship allows for a single, simple correction for variations in both copy number per genome and ploidy level in ribosomal gene analyses of taxa-specific abundance. In biological terms, it points to the uniqueness of ribosomal gene content among microbial properties that scale with size. Video Abstract Supplementary Information The online version contains supplementary material available at 10.1186/s40168-021-01111-z.


Background
The rRNA gene approach to microbiome analyses, either based on amplicon or metagenomic sequencing, relies on the tacit assumption that the counts of this marker gene translate into a robust measure or proxy for microbial abundance. However, this assumption is often violated. Sources of error in gene abundance determination can come from analytical procedures such as DNA extraction, PCR amplification, and sequencing itself [1]. But likely as important, systematic biases can be caused by the varying abundance of ribosomal genes in the genomes of microbes [2]. The concern is evident in the dedicated databases that document the variability in ribosomal gene copy number per genome (R g ) among microbes [3]. Interestingly, R g seems to correlate with a microbe's life history traits, where fast growth is associated with higher values [4][5][6]. There is also evidence for a certain degree of conservation in R g within bacterial phylogenetic clades [7]. On this basis, bioinformatic tools have been developed to automatically correct ribosomal gene surveys for R g [8]. The phylogenetic conservation of R g , however, seems only conspicuous among closely related microbes [9] and can explain only some 10% of its variability in complex, diverse communities [10]. In some eukaryotes like Saccharomyces cerevisae, R g is unstable and can vary widely among strains or individuals [11]. Importantly, Open Access *Correspondence: ferran@asu.edu Center for Fundamental and Applied Microbiomics and School of Life Sciences, Arizona State University, Tempe, USA such corrections would only lead us to a description of community composition in terms of relative abundance of taxon-specific genome copies. But more useful metrics in microbiome community composition analyses are either cell number [7] (i.e., individuals) or biomass contributions by each taxon. Given the close to 9 orders of magnitude spanned by microbial cell biomass, it can be argued that taxon-specific biomass rather than cell number would be a better descriptor of a taxon's contribution to a community. However, there are still instances where number of cells would be preferred (for example, to gauge dispersal potential, culturability, or susceptibility to deleterious agents like predators or toxicants). In any case, to translate genome numbers to cell numbers, one needs to take into account the level of ploidy, P, the number of copies of the genome present in a cell, where the number of ribosomal operons per cell, R c , is the product PR g . Surprisingly, P is not typically taken into account, perhaps under the assumption that most microbes, like Escherichia coli, are monoploid [12,13]. And yet, in bacteria and archaea, P varies far more than R g [12,14], and most species examined are oligo-or polyploid, with some containing in excess of 200 genomes copies per cell [15]. If one includes unicellular eukaryotes, the variation can be 4 orders of magnitude [16]. Clearly, ploidy constitutes a very important source of bias for community counts in itself [17], affecting estimates from both amplicon sequencing and shot-gun metagenomics. The variable nature of P could potentially either compound or diminish the effect of R g variability in determining a cell's R c , as it is not known whether P and R g correlate or vary independently among species; a high P could be associated with low R g , and vice-versa. Studies on marine protists intended to estimate biomass from 18S counts have shown that R c correlated linearly with cell volume (V c ) [18] or cell length [19] when plotted on double log scales, indicating an R c dependence on size.
Here, we posited that perhaps there is constancy among microbes in the need for ribosomal gene content in relation to their cell biomass. In other words, microbial species would be under selection to contain a sufficient but not excessive R c to support the production of their typical cell biomass, B c , so that R c would be proportional to B c . Assuming cell density to be invariant (around 1.008 g ml −1 ) [20], R c would also be proportional to cell volume (V c ).

Dataset
Values for all parameters were gathered or derived from the literature. In place of B c , we used cellular volume, V c , assuming cellular density to be constant (around 1.008 g ml −1 [20]). Cell volumes were either taken from reported direct determinations or derived from literature photomicrographs assuming simple formulae for a variety of fitting three-dimensional shapes (i.e., sphere, cylinder) or combinations thereof as given in Table S1 (see Additional file 1). When a range of volume values was available, we used the average. For R g , we used values given in rrnDB [3] for the same species or strain. If they were not available, we used literature values or determined it by examination of the strain's publicly available genome through BLAST. Ploidy was either taken directly from reported values or estimated if cellular DNA content and genome size were known. If P was variable within a species or strain, we used the average level of the range given. R c values were then derived as the product of P and R g , although for many protists, R c was taken directly from experimentally determined values. The annotated input data are gathered in Table S1 (see Additional file 1). The limiting factor to the size of the database was the availability of P determinations, which are quite uncommon. In all, we could analyze 107 cases.

Statistics
Power fits of data were run in Excel as linear regressions of the ln-transformed data pairs using a least-squares model. Statistics are given in Table S2 (see Additional file 2). To test the significance of exponent differences in two separate datasets, we used T tests for the slopes of the linear fits.

Estimation of taxon-specific cell numbers and biovolumes from 16S rRNA counts
In a dataset of rRNA gene taxon-specific frequencies, F r , assigned to i taxa whose cell volumes, V c (i), are known, one can directly estimate R c (i) from Eq. 1 (see the "Results" section). The relative contribution to number of cells by taxon i, F c (i), is computed as: And the relative contribution to biovolume, F v (i), as: If a determination of the absolute abundance of the total copies of the ribosomal gene for all taxa considered in the sample of origin, R s , is available (from qPCR, for example, in units of copies per mass, volume or surface sampled), then absolute taxon-specific assignments R(i) can be obtained as the product F r (i)Rs(i). From R(i), one can derive absolute values for cells C(i) and biovolume V(i) attributable to each taxon: absolute number of cells or biovolume (in µm 3 ), respectively, of the entire set of taxa under consideration. An alternative to using V c (i), if those are not exactly known, is to assign rough discrete size ranges to taxa, and to use mean V c (and R c ) values of the range's maximum and minimum. We found it advisable to set variable-width size ranges in such a way that within-range variation in resulting R c values was kept moderate. We used the following cell diameter ranges (in µm): 0.2-0.3, 0.3-0.4, 0.4-0.6, 0.6-0.9, 0.9-1.2, 1.2-1.5, 1.5-2.1, 2.1-2.9, 2.9-4.1, 4.1-5.8, 5.8-8.2, 8.2-11.6, and 11.6-16.4. This set provides within-range variation in R c of less than 8% in all cases, which is smaller than the uncertainty of our estimates for the normalization constant in Eq. 1 of the "Results" section.

Results
Traits that span orders of magnitude are best evaluated as double logarithmic plots, which can be analyzed by power function fits. In this approach, the hypothesis of proportionality between V c and R c we posed should have resulted in a power function fit with an exponent close to unity. Our analysis (Fig. 1) readily dispelled that contention. The fit instead revealed that R c follows well (R 2 = 0.86) a power function of V c with an exponent significantly lower than unity, and indistinguishable from 2/3 (0.66 ± 0.03; ± SE) across nine orders of magnitude in cell volume. For volumes expressed in µm 3 , where 9.58 ± 1.21 is the estimated normalization constant. One could envision that the scaling relationship may have been artifactually distorted at the low range of R c , since it cannot physiologically take values < 1. But a reanalysis of the dataset excluding data pairs with R c ≤ 2 did not change the fit significantly in exponent or normalization constant (see Additional file 2). We also tested the hypothesis that exponents for a fit of data pairs from eukaryotes (exponent = 0.72 ± 0.05) vs. prokaryotes (0.62 ± 0.05) could be different, but this did not find strong statistical support in a T test comparison (p = 0.20).
Equation 1 can be rewritten as a function of linear cell dimensions using a spherical-equivalent cell diameter, Thus, R c scales generally not with the volume but with the surface area of a microbial cell, which for the purpose of this study means that the bias associated with ribosomal gene counts will be size-dependent regardless of our choice of abundance estimator. Ribosomal counts  Table S2, Additional file 2). Data points belonging to eukaryotes are in orange, those for archaea in yellow, and bacteria in green. For three species, we plotted datasets to highlight intraspecies variability: Synechococcus elongatus (light blue symbols) [28], Colozoum pelagicum (light purple) [19], and Sphaerozoum fuscum [19] (light yellow) will overestimate large-celled microbes over smallcelled ones if one is interested in number of cells, the bias increasing with the square of linear cell size (Eq. 2), a prediction that finds experimental support for specific cases in the literature [21]. In terms of biomass, ribosomal counts will underestimate the contribution of large microorganisms, the bias increasing with the 2/3 power of cellular biovolume (Eq. 1). Whichever the desired measure of abundance, however, the explicit relationship in Fig. 1 provides a means for bias correction in tallies of ribosomal genes, as long as cell biovolume is known from ancillary data for the taxa detected in the microbiome of interest. The correction requires knowledge of neither P nor R g .
A procedural explanation is given under the "Methods" section, and we provide an example application in Fig. 2 using a dataset of phototrophic bacteria from endolithic microbiomes within intertidal hard carbonate rocks [22], responsible for their micritization and bioerosion [23], and useful here because typical cell volumes could be assigned to all taxa. The differential outcomes are obvious: 16S rRNA counts of large-celled cyanobacterial genera severely underestimate their contribution to biomass but overestimate their contribution in terms of number of cells (see for example, Hyella sp.). The opposite is true for alphaproteobacterial phototrophs (see for example Rhodomicrobium sp.), most of which are small-celled [24]. The distortion is less intense for the Chloroflexi, with intermediate cell size (see Roseiflexus castenholzii, for example).
We have presented the issue of bias having in mind relative abundance tallies of microbiome members, but proportional tallies have methodological constraints in themselves, because the individual proportions must add up to 1, and thus the relative abundances of taxa are necessarily not independent of each other. There is clear evidence of severely diverging analytical outcomes when both relative and absolute abundance are compared in the same datasets [25,26]. Commonly, relative proportions or taxa-specific ribosomal copies are converted to absolute abundances with parallel quantification of rRNA gene copies by qPCR, either total copies in the community analyzed or those of particular taxa [16]. We note here that, in view of our results, the latter would require allometric correction, whereas the former would not (as done in the dataset presented in Fig. 2) and is thus a Fig. 2 Estimation of microbial community structure based on experimental ribosomal counts (central column), estimated cell number (left column) and estimated biovolume (right column) in a single, exemplary dataset using allometric corrections based on Eq. 1. The dataset is from Roush et al. [22] and includes the subset of taxonomically assignable phototrophic bacteria from an endolithic microbiome on coastal marine carbonate rocks. Only three exemplary phototrophs are labeled, but full, taxonomically explicit distributional data are in Table S3 (see Additional file 3). For ease of comparison, results are graphically presented as relative frequencies, but absolute scales of areal abundance are indicated on the arrow to the right preferable approach. However, we also note that the total number of ribosomal gene copies in a sample is not a good absolute measure of the combined microbial biomass or number of cells present for comparisons among samples, as it will be dependent on their inherent cellsize distribution. Hence, comparisons among samples will only be meaningful if carried out after conversion to biomass or cell numbers, unless the microbial composition of the samples is unchanged.

Discussion
The procedure outlined here requires knowledge of morphological metadata in addition to sequencing counts for each taxon. Unfortunately, cell volume data are not readily available for many taxa, at least in a compiled format, and requires intensive literature searches. In its absence, and as an approximation, using a few discrete cell-size classes instead of exact values yields useful corrected distributions (see Figure S1 in Additional file 4). Yet, an effort to bring microbial size data into a consolidated platform would be desirable in that it would enable the processing of large datasets in an automated, more manageable way.
An additional factor to take into account is the substantial data spread around the fit leading to Eq. 1, which can limit the precision of the correction. An expanded dataset should improve predictive accuracy and perhaps even precision, but some inherent limitations are also at play. P can vary in a single strain with cell cycle [15] and growth conditions [27]. We have included the range of intraspecies variability on the V c /R c space in Fig. 1 for the cases of a single strain of Synechococcus elongatus, and of single cells from natural populations of Sphaerozoum fuscum and Colozoum pelagicum. They suggest that a significant proportion of spread can be attributed to biological intraspecies variability, tempering the prospects for improvement with eventually extended datasets. Studies on Synechococcus elongatus [28][29][30] and Saccharomyces cerevisae [31] point to a regulatory interdependency of P with cell size, indicating that natural variations in ploidy may be met by commensurate variation in volume, making this much less of a problem. Additionally, because the data used here were arrived at through several approaches, a dedicated survey based on more consistent analytical procedures may result in tighter fits. Finally, part of the variability detected may have been due to neglecting contributions of organelle ribosomal genes in protists. This is expected to be negligible for largecelled eukaryotes, but perhaps not so much for the smallest of them, in which organelles take up a larger portion of their cell volume. Indeed, some of these pico-eukaryotes contribute disproportionately (by defect in Rc) to the regression's sum of squares and may have contributed to the somewhat higher exponent in the eukaryote-only fit (Additional file 2). In support of this contention, a reanalysis excluding eukaryotes with V c < 20 µm 3 yields an exponent (0.66 ± 0.06; R 2 = 0.68), more in line with that of Eq. 1, showing no trace of statistical difference (T test, p = 0.68) with that of the prokaryote-only fit (Additional file 2). While the dataset does not allow us to differentiate between bacteria and archaea because of the low number of archaeal cases in it, given the substantial biological differentiation between bacteria and archaea, it may be an interesting future exercise.
The preceding discussion on uncertainty in the correction approach should not be taken as grounds for inaction, given that the range of variation in V c far exceeds that traceable to deviations from the fit, not only among microbes at large, but also in specific settings, and the spectra of microbial size distribution in microbiomes seems to be dynamic. For example, the range of V c of typical bacterioplankton (excluding phototrophs) in seawater spans 3 orders of magnitude and its spectrum can be modified significantly by factors like grazing [32]. Considering photosynthetic plankton would likely add another 4-5 orders of magnitude in V c , and the size spectrum of this group is also affected by environmental parameters [33]. In the human gut microbiome, our initial assessments show that microbiome typical bacteria span over at least 4 orders of magnitude in volume.
Beyond the pragmatic uses for community composition corrections, we see it as unlikely that the apparent scaling relationship with cell surface area has no biological meaning. It is tempting to speculate that R c scales with size to maintain an increasing protein content need. Indeed, protein content scales as a function of cell volume with a similar exponent of 0.70 ± 0.06 (R 2 = 0.87; 95% CI 0.64-0.75) [34]. Because the CI of the exponent for protein content per cell and that for R c in the fit of Fig. 1 [0.72 and 0.61; see (Additional file 2)] overlap, the possibility of a connection to cellular need for proteins cannot be rejected solely on this basis. Indeed, in Synechococcus elongatus in the laboratory, protein content and P (hence also R c ) strongly co-vary with cell volume [30].
Alternatively, and perhaps more trivially, the scaling relationship of R c with V c may simply be a reflection of the size scaling of DNA content per cell. In other words, ribosomal genes would follow the trends of DNA content as a whole, just like any other universal gene would. The allometric relationship between DNA content and cell size, however, has not been addressed in the literature or has been addressed incorrectly by neglecting ploidy [34,35]. We know that genome size scales among bacteria with reported exponents between 0.21 (R 2 = 0.60) [34] and 0.35 (R 2 = 0.45) [36]. In our dataset, which includes eukaryotes, it does so with an exponent of 0.18 (R 2 = 0.34; see Additional file 5). Even when these coefficients of correlation are rather poor, genome size clearly increases much more weakly with V c than does R c . Again, this does not take into account P variations to yield actual DNA content per cell; it is the size of one copy of the genome. A portion of our dataset can be used to explore the scaling of DNA content per cell for prokaryotes (n = 60). To this subset, we can add the measured or slightly derived values reported by Shuter et al. [35] (n = 39), excluding those that relied on assumptions of monoploidy. This combined set yields a power scaling fit with R 2 = 0.89 and estimated exponent of ¾ (0.75 ± 0.03; Fig. 3).
The difference in scaling exponent between genome size and cell DNA content (0.18-035 vs. 0.75) gauges the importance of P. In fact, in our dataset, P seems to scale with V c as a power law with an exponent of 0.54 (R 2 = 0.69; Additional file 6). This is consistent with the fact that the product of genome size and ploidy yields the cell DNA content, as the exponents of the multipliers (0.18 and 0.54, respectively) roughly add up to the estimated exponent of the product (0.75). That the exponents for DNA content per cell (3/4) and R c (2/3) are significantly different (T test, p = 0.02), speaks for respective mechanistic drivers that are fundamentally decoupled. In fact, most known allometric laws found in nature scale with exponents that are simple multiples of 1/4 [37]. It would seem that ribosomal genes are, in that sense, unique.

Conclusions
The results presented here uncover surprising basic rules on the composition of microbes, rules that ties them all together, and that far from being self-evident, pose an intellectual challenge to elucidate. In practical terms, this discovery also provides a rather simple approach to deal with biases affecting the use of current omics methodologies for the assessment of microbiome composition, which, given their extensive use in many areas of microbiology and allied sciences, has a large potential for applicability.

Abbreviations
R g : Ribosomal gene copy number per genome; R c : Ribosomal gene copy number per cell; R s : Total copies of the ribosomal gene for all taxa considered in a sample; V c : Cell volume; B c : Cell biomass; D c : Cell diameter; P: Ploidy.