Hill-based dissimilarity indices and null models for analysis of microbial community assembly

Background High-throughput amplicon sequencing of marker genes, such as the 16S rRNA gene in Bacteria and Archaea, provides a wealth of information about the composition of microbial communities. To quantify differences between samples and draw conclusions about factors affecting community assembly, dissimilarity indices are typically used. However, results are subject to several biases, and data interpretation can be challenging. The Jaccard and Bray-Curtis indices, which are often used to quantify taxonomic dissimilarity, are not necessarily the most logical choices. Instead, we argue that Hill-based indices, which make it possible to systematically investigate the impact of relative abundance on dissimilarity, should be used for robust analysis of data. In combination with a null model, mechanisms of microbial community assembly can be analyzed. Here, we also introduce a new software, qdiv, which enables rapid calculations of Hill-based dissimilarity indices in combination with null models. Results Using amplicon sequencing data from two experimental systems, aerobic granular sludge (AGS) reactors and microbial fuel cells (MFC), we show that the choice of dissimilarity index can have considerable impact on results and conclusions. High dissimilarity between replicates because of random sampling effects make incidence-based indices less suited for identifying differences between groups of samples. Determining a consensus table based on count tables generated with different bioinformatic pipelines reduced the number of low-abundant, potentially spurious amplicon sequence variants (ASVs) in the data sets, which led to lower dissimilarity between replicates. Analysis with a combination of Hill-based indices and a null model allowed us to show that different ecological mechanisms acted on different fractions of the microbial communities in the experimental systems. Conclusions Hill-based indices provide a rational framework for analysis of dissimilarity between microbial community samples. In combination with a null model, the effects of deterministic and stochastic community assembly factors on taxa of different relative abundances can be systematically investigated. Calculations of Hill-based dissimilarity indices in combination with a null model can be done in qdiv, which is freely available as a Python package (https://github.com/omvatten/qdiv). In qdiv, a consensus table can also be determined from several count tables generated with different bioinformatic pipelines. Video Abstract

= ∑ /( ) (Eq. S1.3a, if q≠1) = − ∑ • ( ) (Eq. S1.3b, if q=1) D is the Hill diversity number, q is the diversity order, S is the total number of species, and pi is the relative abundance of the i th species in the sample.
Eq. S1.4a-b show how the Hill numbers for a pair of samples are partitioned into alpha (α), beta (β), and gamma (γ) components. •ln ,2 (Eq. S1.4b, if q=1) q Dβ is the beta diversity; q Dγ is gamma diversity, i.e. the Hill number for the two pooled samples; q Dα is a mean of the alpha diversity in the two samples (see explanation below); and pi,1 and pi,2 refer to the relative abundances of the i th species in sample 1 and 2, respectively.
Jost [3] showed how diversity could be decomposed into α, β, and γ components for multiple samples. In Eq. S1.4a-b, we only show the equations for a pair of samples; however, the equations are derived from equations 9, 13, and 14 in reference [3]. The α component, which is a mean α-diversity of the two samples, is well described by Chao and Chiu [4] (see equation 5b in reference [4]). They call it "mean effective numbers of species-by-community combinations … per community" [4]. An example of how the calculation is done is shown in Table S1.1. The γ-diversity is simply the diversity of the pooled community (Eq. S1.5).
The "mean effective numbers of species-by-community combinations … per community" [4] is given by Eq. S1.6. The "species-by-community combination" here refers to Table S1.1 containing 3 species and 2 samples. The relative abundance for each cell is calculated by dividing the value in the cell by the sum of all values (2 in this case). Then the effective number (diversity) for this species-by-community combination is calculated. The mean effective number is then obtained by dividing this number by 2, since we have two samples in the table. Eq. S1.6 is equivalent to the mean α-diversity in the denominator of Eq. S1.4a. This is shown in Eq. S1.7a-c below. In Eq. S1.7a, (1/2) q is factored out from each term. In Eq. S1.7b, the (1/2) being outside the parenthesis is moved into the parenthesis. In Eq. S1.7c, the equation is simplified and shown to be equivalent to the denominator of Eq. S1.4a.
Beta diversity ( q Dβ) can be converted into a dissimilarity index constrained between 0 and 1. Jost [5] showed how β-diversity could be converted to a local overlap measure (dissimilarity subtracted from one). Local means that it takes the perspective of one sample. If two samples each have three species, and only one species is detected in both samples (Table S1.2), the local overlap is 1/3 and the corresponding dissimilarity index is 1-1/3=2/3. Written as a dissimilarity index and calculated for any diversity order q, this index is obtained by Eq. S1.8a-b. This set of indices are also referred to as Sørensen-type indices since at diversity order 0, it is identical to the Sørensen dissimilarity index [4,5].
Beta diversity ( q Dβ) can also be converted to a regional type of overlap measures. Let us again consider the case with two samples where each has three species and only one species is detected in both (Table S1.2). In total, there are five different species in the two samples, and one is found in both samples. Here, the regional overlap is 1/5 and the corresponding dissimilarity is 1-1/5=4/5. Eq. S1.9a-b calculate the regional dissimilarity indices for all diversity orders. This set of indices are also referred to as Jaccard-type indices since at diversity order 0, it is identical to the Jaccard dissimilarity index [4].
Table S1.2. Hypothetical count table to show the difference between local and regional perspective for Hillbased dissimilarity indices (see also Appendix S1 in Chao and Chiu [4]).
Species Sample 1 Sample 2 The local dissimilarity value represents the effective average proportion of OTUs/ASVs in one sample not shared with the other sample. Regional means that the dissimilarity value represents the effective proportion of all OTUs/ASVs that are not present in both samples. In the main paper, we use the local dissimilarity indices, which we call q d. It should be noted that diversity and dissimilarity indices based on Hill numbers also can take phylogenetic-or functional distances between individual species into account [6] using either a phylogenetic tree [7] or a distance matrix as input [8].
Text S1.3: Null model. With most dissimilarity indices, it is difficult to distinguish between dissimilarity caused by compositional turnover and dissimilarity caused by differences in alpha diversity between two samples [9]. Null models provide a solution to this problem. By random assembly of two samples with pre-defined numbers of species from a regional species pool, a null expectation for the dissimilarity value between the two samples can be generated. If the random assembly process is repeated many times, a distribution for the null expectation is obtained. In this study, we build on the null model by Raup and Crick [10], who used incidence-based data tables. Expressed as a dissimilarity index, their method can be described using Eq. S1.10.

= . •
(Eq. S1.10) RC is the Raup-Crick dissimilarity index, NSSexp>SSobs is the number of randomizations in which the shared species between randomly assembled samples are greater than between observed samples, NSSexp=SSobs is the number of randomizations in which the shared species are equal to the observed samples, and NTOT is the total number of randomizations.
Chase et al. [9] modified the index by normalizing it to a value between -1 and 1 and Stegen et al. [11] extended it to Bray-Curtis dissimilarities. Here, we extend it even further, to all possible Hill-based dissimilarity indices (Eq. S1.11).

= . •
(Eq. S1.11) q RC is the Raup-Crick index for any Hill-based dissimilarity of order q, N[qdexp<qdobs] is the number of randomizations in which the dissimilarity between the randomly assembled samples is less than between the observed samples, N[qdexp=qdobs] is the number of randomizations in which the dissimilarities are equal.
With Eq. S1.11, the index is constrained between 0 and 1, where 0 means a lower observed dissimilarity than the null expectation and 1 means a higher observed dissimilarity. Using a significance level (α) of 0.05 [e.g. as in 11], a q RC value > 0.975 would indicate that the microbial community composition of two samples are statistically more dissimilar to each other than expected by chance and a q RC < 0.025 would indicate the two samples are more similar to each other than expected by chance.
Text S1.4: Bray-Curtis can be equivalent to 1 d under special conditions. When the following conditions hold for two samples, the 1 d dissimilarity is equivalent to the Bray-Curtis dissimilarity. 1. The two samples have the exact same relative abundance distribution.
2. An OTU/ASV detected in both samples has the exact same relative abundance in both samples. Table S1.3 shows an example when these conditions hold true. Both sample A and B have two OTUs/ASVs with relative abundance of 0.30, two with 0.15, and two with 0.05, which means they have the same relative abundance distribution and condition 1 holds true. We can also see that OTUs/ASVs that are present in both samples (#1, #3, and #4) have the exact same relative abundance in both samples, which means condition 2 holds true. (Eq. S1.12a) pi,A is the relative abundance of OTU/ASV i in sample A that is shared with sample B. We could just as well use the OTUs/ASVs in sample B here, since the relative abundance of the shared OTUs/ASVs are the same in both samples. Eq. S1.12a is equal to: = ∑ , (Eq. S1.12b) Pj,A is the relative abundance of OTU/ASV j in sample A that is not shared with sample B. Eq. S1.12a and S1.12b are equivalent since: ∑ , + ∑ , = 1 (Eq. S1.12c) Now, let us have a look at the 1 d dissimilarity index. From equations S1.4b and S1.5b, we deduce: Thus, Eq. S1.12b and S1.13e show that 1 d is equivalent to the Bray-Curtis dissimilarity index under the special conditions described above.