Dynamic interaction network inference from longitudinal microbiome data

Background Several studies have focused on the microbiota living in environmental niches including human body sites. In many of these studies, researchers collect longitudinal data with the goal of understanding not only just the composition of the microbiome but also the interactions between the different taxa. However, analysis of such data is challenging and very few methods have been developed to reconstruct dynamic models from time series microbiome data. Results Here, we present a computational pipeline that enables the integration of data across individuals for the reconstruction of such models. Our pipeline starts by aligning the data collected for all individuals. The aligned profiles are then used to learn a dynamic Bayesian network which represents causal relationships between taxa and clinical variables. Testing our methods on three longitudinal microbiome data sets we show that our pipeline improve upon prior methods developed for this task. We also discuss the biological insights provided by the models which include several known and novel interactions. The extended CGBayesNets package is freely available under the MIT Open Source license agreement. The source code and documentation can be downloaded from https://github.com/jlugomar/longitudinal_microbiome_analysis_public. Conclusions We propose a computational pipeline for analyzing longitudinal microbiome data. Our results provide evidence that microbiome alignments coupled with dynamic Bayesian networks improve predictive performance over previous methods and enhance our ability to infer biological relationships within the microbiome and between taxa and clinical factors. Electronic supplementary material The online version of this article (10.1186/s40168-019-0660-3) contains supplementary material, which is available to authorized users.

tween these entities. We evaluate our model by using multiple data sets 48 comprised of the microbiota living in human body parts including gastroin-49 testinal tract, urogenital tract and oral cavity. We show that models for 50 these systems can accurately predict changes in taxa and that they greatly 51 improve upon models constructed by prior methods. Finally, we characterize 52 the biological relationships in the reconstructed microbial communities and 53 discuss known and novel interactions discovered by these models.  To avoid overfitting we removed any sample that had less than nine measured 127 time points, and estimated a cubic B-spline from the observed abundance 128 profile for all taxa in remaining samples using splrep and BSpline from the 129 Python function scipy.interpolate. Additional file 3: Figure S2 shows the 130 original and cubic spline of a representative microbial taxa from a randomly 131 selected individual sample across each data set. 132 Aligning microbial taxon 133 To discuss the alignment algorithm we first assume that a reference sam- 134 ple, to which all other samples would be aligned, is available. We next discuss 135 how to chose such reference. 136 Formally, let s j r (t) be the spline curve for microbial taxa j at time t ∈ [t min , t max ] in the reference time-series sample r, where t min and t max denote the starting and ending time points of s j r , respectively. Similarly, let s j i (t ) be the spline for individual i in the set of samples to be warped for taxa j at time t ∈ [t min , t max ]. Next, analogously to Bar-Joseph et al.
[13], the alignment error for microbial taxa j between s j r and s j i is defined as where α = max{t min , f −1 i (t min )} and β = min{t max , f −1 i (t max )} correspond to the starting and ending time points of the alignment. Observe that by smoothing the curves, it is possible to estimate the values at any intermediate time point in the alignment interval [α, β]. Finally, we define the microbiome alignment error for a microbial taxon of interest S between individual samples r and i as follows E M (r, i) = j∈S e j (r, i).
Given a reference r and microbial taxon S, the alignment algorithm task 137 is to find parameters a and b that minimize E M for each individual sample i 138 in the data set subject to the constraints: a > 0, α < β and (β−α) (tmax−t min ) ≥ .

139
The latter constraint enforces that the overlap between aligned interval [α, β]   DBNs [23,24] where P denotes a set of conditional probability distributions over discrete variables ∆, F denotes a set of linear Gaussian conditional densities over continuous variables Ψ, and Pa G (X) denotes the set of parents for variable X in G. Since we are dealing with both, continuous and discrete nodes in the DBN, in our method, continuous variables (i.e., microbial taxa compositions) are modeled using a Gaussian with the mean set based on a regression model over the set of continuous parents as follows where u 1 , · · · , u k are continuous parents of y; λ 0 is the intercept; λ 1 , · · · , λ k are the corresponding regression coefficients for u 1 , · · · , u k ; and σ 2 is the standard deviation. We point out that if y has discrete parents then we need to compute coefficients L = {λ i } k i=0 and standard deviation σ 2 for each discrete parents configuration. For example, the conditional linear Gaussian density function for variable T 4 ti+1 in Fig. 1e denoted as f (T 4 ti+1 | T 4 ti , C 3 ti , T 2 ti+1 ) is modeled by where λ 1 , λ 2 , λ 3 and σ 2 are the DBN model parameters. In general, given 232 a longitudinal data set D and known structure G, we can directly infer the 233 parameters Θ by maximizing the likelihood of the data given our regression 234 model.

235
Learning DBN structure Learning the DBN structure can be expressed as finding the optimal structure and parameters where P (D | Θ, G) is the likelihood of the data given the model. Intuitively, 236 the likelihood increases as the number of valid parents Pa G (·) increases, 237 thus, making it challenging to infer the most accurate model for data set 238 D. Therefore, the goal is to effectively search over possible structures while using a function that penalizes overly complicated structures and protects 240 from overfitting.

241
Here, we maximize P (D, Θ | G) for a given structure G using maximum likelihood estimation (MLE) coupled with BIC score instead of Bayesian Dirichlet equivalent sample-size uniform (BDeu) metric used in CGBayesNets. The BDeu score requires prior knowledge (i.e., equivalent sample size priors) which are typically arbitrarily set to 1; however, multiple studies have shown the sensitivity of BDeu to these parameters [27,28], as well as the use of improper prior distributions [29]. Alternatively, BIC score does not depend on the prior over the parameters, thus, an ideal approach for scenarios where prior information is not available or difficult to obtain. Next, in order to maximize the full log-likelihood term we implemented a greedy hill-climbing algorithm. We initialize the structure by first connecting each taxa node at the previous time point (for example T 1 ti in Fig. 1e) to the corresponding taxa node at the next time point (T 1 ti+1 in Fig. 1e). We call this setting the baseline model since it ignores dependencies between taxa's and only tries to infer taxa levels based on its levels in the previous time points. Next, we added nodes as parents of a specific node via intra or inter edges depending on which valid edge (i.e., no cycles) leads to the largest increase of the log-likelihood function beyond the global penalty incurred by adding the parameters as measured by the BIC 3 score approximation inspecting the regression coefficients λ 1 , λ 2 , λ 3 immediately suggests whether 265 the impact is positive or negative. In this example, the regression coefficients Also the regression coefficients corresponding to edge thickness were normalized as follows: Let y be a microbial taxa node with continuous taxa parents u 1 , · · · , u k modeled by where λ 1 , · · · , λ k are the corresponding regression coefficients for u 1 , · · · , u k as previously described in this section. The normalized regression coefficients whereū i is the mean abundance of taxa u i across all samples. learning DBNs for microbiome and clinical data. We start by estimating a 293 cubic spline from the observed abundance profile of each taxa (Fig. 1b). Next, 294 we determine an alignment which allows us to directly compare temporal data 295 across individuals (Fig. 1c), as well as filter out abnormal and noisy samples 296 (Fig. 1d). Finally, we use the aligned data to learn causal dynamic models 297 that provide information about interactions between taxa, their impact, and 298 the impact of clinical variables on taxa levels over time (Fig. 1e-f). 299 We applied our methods to study longitudinal data sets from three human indeed be used to infer differences in rates between individuals.

328
Resulting dynamic Bayesian network models 329 We next applied the full pipeline to learn DBNs from the three micro-  predictions on the three data sets are summarized in Fig. 4(a-c) When analyzing large cohorts of microbiome data, it is important to 441 implement a strategy to remove outliers as these can affect our ability to 442 generalize from the collected data. As discussed in Methods, we can use our DBNs following the filtering we obtain even better models. Additional file 7: 454 Figure S5 compares the average MAE results of our proposed DBN model 455 between the unfiltered and filtered samples for the gut and vaginal data sets.

456
As can be seen, a large performance improvement is observed for the gut data 457 while a slight improvement is observed for the vaginal data when removing 458 the outliers. These results suggest that even though the method uses less 459 data to learn the models, the models that it does learn are more accurate. Uncovering biological relationships 487 We next discuss in more detail the learned DBN models. Day of life to Gammaproteobacteria and Clostridia (Fig. 3b). Moreover, the    . Nodes size is proportional to indegree whereas taxa nodes transparency indicates mean abundance. Additionally, dotted lines denote intra edges (i.e., directed links between nodes in same time slice) whereas solid lines denote inter edges (i.e., directed links between nodes in different time slices). Edge color indicates positive (green) or negative (red) temporal influence and edge transparency indicates strength of bootstrap support. Edge thickness indicates statistical influence of regression coefficient as described in Network visualization. a -Learned DBN for the aligned infant gut microbiome data at a sampling rate of 3 days and maxParents = 3. b -Learned DBN for the aligned vaginal microbiome data at a sampling rate of 3 days and maxParents = 3.  Figure 4: Comparison of average predictive accuracy between methods on the filtered data sets. Figure shows the average MAE of our proposed DBN models against a baseline method and previously published approaches as a function of sampling rates where d denotes day(s). Additionally, each method is run on the unaligned and aligned data sets. a -Performance results for infant gut microbiome data. b -Performance results for vaginal microbiome data. c -Performance results for oral cavity microbiome data. d -Performance results for each data set for a sampling rate (sr ) that most closely resembles the originally measured time points.