- Microbiome Announcement
- Open Access
Meta-analysis of human genome-microbiome association studies: the MiBioGen consortium initiative
Microbiomevolume 6, Article number: 101 (2018)
In recent years, human microbiota, especially gut microbiota, have emerged as an important yet complex trait influencing human metabolism, immunology, and diseases. Many studies are investigating the forces underlying the observed variation, including the human genetic variants that shape human microbiota. Several preliminary genome-wide association studies (GWAS) have been completed, but more are necessary to achieve a fuller picture.
Here, we announce the MiBioGen consortium initiative, which has assembled 18 population-level cohorts and some 19,000 participants. Its aim is to generate new knowledge for the rapidly developing field of microbiota research. Each cohort has surveyed the gut microbiome via 16S rRNA sequencing and genotyped their participants with full-genome SNP arrays. We have standardized the analytical pipelines for both the microbiota phenotypes and genotypes, and all the data have been processed using identical approaches. Our analysis of microbiome composition shows that we can reduce the potential artifacts introduced by technical differences in generating microbiota data. We are now in the process of benchmarking the association tests and performing meta-analyses of genome-wide associations. All pipeline and summary statistics results will be shared using public data repositories.
We present the largest consortium to date devoted to microbiota-GWAS. We have adapted our analytical pipelines to suit multi-cohort analyses and expect to gain insight into host-microbiota cross-talk at the genome-wide level. And, as an open consortium, we invite more cohorts to join us (by contacting one of the corresponding authors) and to follow the analytical pipeline we have developed.
Our understanding of the microbial communities populating the human body (human microbiota) has progressed tremendously in recent years, catalyzed by the use of next-generation sequencing techniques that overcome the limitations of anaerobic cultivation . Much effort has been devoted to understanding the taxonomic and functional diversity of the microbiota and their encoded collective gene pool, the microbiome, with most research activity focusing on the microbes in our gastrointestinal tract [2, 3]. Much of the research has centered on elucidating links between microbes and various diseases , for instance, obesity, inflammatory bowel disease, and diabetes. This has including several studies that went beyond association to demonstrate causal roles of the gut microbiome in disease development.
More knowledge of the microbial ecosystem and the role of different factors in its structure is an essential path leading to more understanding of human biology . Cross-sectional studies carried out in several population-based cohorts have identified the major environmental factors (nutrition, medication, and diet) influencing the composition and functional capacities of the human microbiome [6, 7]. Yet these studies also showed that a large proportion of microbial diversity remained unexplained after considering the environmental influences, thereby raising questions on the role of host genetics.
Given the complex interplay between the microbiome and host physiology, a certain percentage of host genetics, as well as genetic interactions with environmental factors, is expected to shape the composition of the microbial community . Proof-of-principle genome-wide screens (e.g., quantitative trait loci (QTL) studies) have been carried out in model organisms like mouse , while the majority of published studies on humans have used a candidate gene approach to cope with sample size limitations. Recently, analyses of twin cohorts have demonstrated a genetic contribution to variation in the relative proportions of specific members of microbiota , for example, investigations in 1126 twins identified associations to 28 loci, including genetic variants in LCT .
Bonder et al., Turpin et al., and Wang et al. then simultaneously reported GWAS results from three independent cohorts, each revealing glimpses into the genetic landscape underlying the gut microbiota structure [12,13,14]. Together, these GWAS have identified some 100 genome-wide significant loci associated with community structure, taxon abundance, and gut microbiome biodiversity. However, similar to initial GWAS efforts in many other complex traits, there was little overlap seen in the three sets of summary statistics (Fig. 1). SLIT3 was the only gene to pass a standard genome-wide significance threshold of 5 × 10−8 in the TwinsUK and Bonder et al. studies [11, 12], but the two reported single nucleotide polymorphisms (SNPs) within this gene are not proxies of each other, nor do they correlate to the same bacteria or pathway. Despite little overlap in the associated genetic variants, which were limited to the LCT locus, associations to various C-type lectin genes were observed by both Bonder et al. and Wang et al. [11, 12, 14]. These discordances emphasize the need to increase the number of samples in the discovery setting to improve statistical power and to reduce the probability of false-positive associations. Cross-multi-cohort analysis will also overcome limitations imposed by population stratification as well as technical artifacts, including the differences in model choice .
We have therefore established the MiBioGen consortium to study the influence of human genetics on gut microbiota. This collaborative effort currently comprises 18 cohorts worldwide and new members will join us after completing their data collection. We aim to develop a uniform pipeline to allow maximum harmonization across the microbiome data and to use GWAS meta-analyses to provide a fuller picture of human gene-microbiome associations. Furthermore, since all the cohorts have been well phenotyped, their data will aid future investigations into other research questions.
MiBioGen initiative and cohort descriptions
Most of the 18 studies participating in the consortium are prospective cohort studies in countries in Europe, Asia, and North America (Table 1). Besides genetics and microbiome data, the cohorts have also been deeply phenotyped, covering multiple individual outcomes (e.g., anthropometric, metabolic, disease-related). These cohorts also incorporate a wide age spectrum, including both children and adults. The number of individuals per cohort study ranges from 139 to 2482, with a total of 19,790 individuals (18,965 after quality control (QC)). In terms of both sample size and geographic distribution, the MiBioGen consortium is, to our knowledge, the most comprehensive effort for investigating host-genetics-versus-microbiome-associations on a population scale.
As we have multiple phenotypes in addition to microbiome and host genotypes available, we can assess the putative effect of the gut microbiome on human health. Several of the cohorts were set up to investigate certain phenotypes and/or diseases, for instance, GEM (healthy relatives of patients with Crohn’s disease) , or FoCus (a nutritional intervention study) . As a basis for epidemiological studies, various metadata were collected by the different cohorts including anthropometric measures, blood chemistry, dietary pattern, intestinal permeability, and lifestyle. These factors have been shown to influence microbiota composition [6, 7, 14]. All these metadata and phenotypes provide opportunities for assessing the biological significance of gene-microbiome associations, and for gaining insights into gene-environment interactions and the interaction between host genotype–microbiome–diseases.
To provide a platform for robust and reliable results and also to simplify study participation in MiBioGen, we have standardized all the procedures and protocols that participating cohorts need to follow. The MiBioGen data processing pipeline comprises four steps: (1) microbiome data processing, (2) genotype data processing, (3) genome-wide association analyses, and (4) meta-analyses.
Microbiome data processing
The microbiome data included in our consortium was mainly generated using an Illumina sequencing platform (MiSeq or HiSeq). The most frequently sequenced hyper-variable region of the 16S rRNA gene was V4 (eight cohorts, n = 8472), although five cohorts sequenced the V3-V4 region (n = 5719), and another four sequenced the V1-V2 region (n = 4774). We assessed the compatibility of the datasets obtained from sequencing different regions by comparing technical replicates of ten samples (three replicates each) generated from different hyper-variable regions. This analysis showed that the influence of technical differences in microbiome profiles is less than the inter-individual differences (Additional file 1). Nevertheless, including different hyper-variable regions requires compatible methods of 16S rRNA gene-amplicon data processing, and it is no longer feasible to use “open” (de novo) operational taxonomic units (OTU) picking protocols. Further analysis of technical replicates using closed-reference OTU picking showed that the clustering results also have large technical artifacts (Additional file 1). In contrast, the between-replicate similarity on genera- and higher taxonomic levels showed reasonable concordance (Additional file 1). As a result, we implemented the 16S data processing pipeline, which comprised a naive Bayesian classifier from the Ribosomal Database Project , and the most recent, full, SILVA database (release 128): we only analyzed taxonomical results using genus- and higher taxonomic levels.
As well as a standard taxonomy binning procedure, all the additional steps have been standardized across the consortium, including downsampling to 10,000 reads with fixed seed to allow for replicability, procedures of transformations, and corrections for covariates, and the thresholds set for bacterial taxa to be included in the analysis (any taxon should be present in more than 10% of the cohort’s samples). This filtering effectively reduces the total number of tests and also makes cross-validation and meta-analysis feasible among all the participating cohorts. 16S data processing is currently being performed in all the cohorts and shows a high level of congruence: the core-measurable microbiome (CMM) , defined as the list of bacterial taxa present in more than 10% of the samples in a cohort, is stable across the participating cohorts and shapes around 80% of each cohort’s microbiome composition.
Genotype data processing
Individual genome-wide genotype data was generated by the different cohort studies using different genotyping platforms and arrays (Table 1). In order to utilize the genome-wide data and remove artifacts resulting from the different platforms, we imputed missing genotypes to extend the resolution on a genome-wide level. We standardized the imputation procedure for each cohort, including the pre-imputation quality control, reference imputation panel, imputation server and software, as well as the post-imputation filtering to include SNPs in the analyses.
Quality control performed prior to imputation was carried out by each cohort independently according to our general recommendations. Imputation was performed on a freely available Michigan server (https://imputationserver.sph.umich.edu/index.html) that uses a two-step approach: phasing with the Eagle v2.3 algorithm, followed by imputation with Minimac . For our consortium, the data was imputed to the Haplotype Reference Consortium (HRC 1.1) reference panels . To allow imputed SNPs in the association studies, we included minor allele frequency filtering (5%), posterior imputation quality (0.4, applied per sample), and variant imputation quality (0.5, applied per SNP). After imputation, each study yielded around 39.1 million SNPs, with 4 to 6 million variants passing post-imputation QC.
Genome-wide association analysis
Previous microbiome GWAS have used different statistical methods to test association of genetic variants with gut microbiome taxa [9,10,11,12], and these might contribute to some of the differences in observed associations. We are therefore developing a uniform analytical pipeline to be implemented by all the studies participating in our consortium; it uses flexible statistical approaches to cope with the non-normality and high dispersion inherent to microbiome data . Several layers of microbiome representations are considered as traits in GWAS: general diversity metrics (alpha- and beta-diversity), series of binomial traits of bacterial presence, and quantitative traits of bacterial relative abundance. At the moment, we are using multiple cohorts for benchmarking, to fine-tune our algorithm and to reduce inter-cohort and technical differences.
Given the substantial increase in sample size (10-fold), as well as our large number of 18 cohorts, we expect to be able to identify individual bacteria and new genomic loci that affect microbiome composition in general. Based on the effect size (0.147 × SD, using a genome-wide threshold of 5e−8) in some 1800 individuals , this consortium can theoretically provide 80% power to detect effects larger than 0.045 × SD. Our full pipeline can be found and followed at https://github.com/alexa-kur/miQTL_cookbook. We will also publish summary statistical results from each cohort, as well as the full meta-study results, both on GitHub and as supplementary files in our future publications.
Conclusions and future directions
The MiBioGen consortium’s large-scale meta-analysis of 18 cohorts drawn from different populations will permit us to explore the genetic architecture of the gut microbiome. In addition to classic association studies, we will adopt more sophisticated approaches to gain a better understanding of the role of the gut microbiome as a mediator between genetic predisposition and human health/disease. For example, we will explore the association of individual risk scores  to common diseases, based on published GWAS results and individual microbiome composition.
We will also explore human gene-environment interactions with respect to gut microbiome composition. Such interactions have been observed for the LCT non-functional variant and for dairy intake in relation to the abundance of Bifidobacteria [10, 19]. Comprehensive studies have explored the independent effects of environmental and genetic forces on the gut microbiome [6, 7, 12,13,14], and we will investigate a number of gene-environment interactions of interest, including gene-diet, using the combined genetic data and extensive environmental metadata. Certain gene-environment interactions can also be examined in those cohorts that collected stool samples at multiple time points. We appreciate that it will be difficult to determine causality, but we will probably be able to identify a series of environment-gene-microbiome triangles, for instance, those involving age, gender, medication usage, or body mass index. Our results will lead to hypotheses on the links underlying microbiome-related physiological processes. We would therefore encourage any cohorts with an interest in analyzing host-microbiota associations in their own data to join the MiBioGen consortium and to contribute to more overall insights into the intricacies of host genomes’ role in shaping the gut microbiota.
Finally, the additional phenotypes available in each cohort will provide a unique opportunity for quantifying the contribution of the gut microbiome to different phenotypes. For example, GWAS analyses have already been focused on metabolic traits and diseases in different cohorts, and much more cross-checking can be carried out using the EBI GWAS Catalog. The overlap in significant loci will reveal intrinsic relationships between the microbiome, genetics, and diseases, thereby adding to our knowledge of the molecular basis of these pathologies. Recently developed strategies, such as linkage disequilibrium score regression  and polygenic risk scores , as well as downstream pathway enrichment analyses, will help translate genetic associations into real biological insights into the host-microbiome interaction. Our consortium will thus not only contribute to fundamental knowledge on the gut microbiome but also lead on to clinical and therapeutic efforts in treating diseases.
European Bioinformatics Institute
Genome-wide association studies
Haplotype Reference Consortium
Quantitative trait loci
Qin J, et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nat. 2012;490:55–60.
Yatsunenko YT, et al. Human gut microbiome viewed across age and geography. Nat. 2012;486:222–7.
Turnbaugh PJ, et al. The human microbiome project. Nature. 2007;449:804–10.
Sommer F, et al. The resilience of the intestinal microbiota influences health and disease. Nature reviews Microbiol. 2017;15:630–8.
The Human Microbiome Project Consortium. A framework for human microbiome research. Nature. 2012;486:215–21.
Zhernakova A, et al. Population-based metagenomics analysis reveals markers for gut microbiome composition and diversity. Sci. 2016;352:565–9.
Falony G, et al. Population-level analysis of gut microbiome variation. Science. 2016;352:560–4.
Org E, et al. Genetic and environmental control of host-gut microbiota interactions. Genome Res. 2015;25:1558–69.
Benson AK, et al. Individuality in gut microbiota composition is a complex polygenic trait shaped by multiple environmental and host genetic factors. Proc Nat Acad Sci USA. 2010;107:18933–8.
Goodrich JK, et al. Human genetics shape the gut microbiome. Cell. 2014;159:789–99.
Goodrich JK, et al. Genetic determinants of the gut microbiome in UK twins. Cell Host Microbe. 2016;19:731–43.
Bonder MJ, et al. The effect of host genetics on the gut microbiome. Nat Genet. 2016;48:1407–12.
Turpin W, et al. Association of host genome with intestinal microbial composition in a large healthy cohort. Nat Genet. 2016;48:1413–7.
Wang J, et al. Genome-wide association analysis identifies variation in vitamin D receptor and other host factors influencing the gut microbiota. Nat Genet. 2016;48:1396–406.
Kurilshikov A, et al. Host genetics and gut microbiome: challenges and perspectives. Trends Immunol. 2017;511:421–7.
Wang Q, et al. Naive Bayesian classifier for rapid assignment of rRNA sequences into the new bacterial taxonomy. Appl Environ Microbiol. 2007;73:5261–7.
Das S, et al. Next-generation genotype imputation service and methods. Nat Genet. 2016;48:1284–7.
Dudbridge F. Power and predictive accuracy of polygenic risk scores. PLoS Genet. 2013;9(3):e1003348.
Goodrich JK, et al. Cross-species comparisons of host genetic associations with the microbiome. Sci. 2016;352:29–32.
Bulik-Sullivan BK, et al. LD score regression distinguishes confounding from polygenicity in genome-wide association studies. Nat. G E N. 2015;47:291–5.
We thank Jackie Senior for editing the manuscript. Further acknowledgement of each cohort can be found in the Additional file 2.
Full list of MiBioGen consortium participants
Tarun Ahluwalia 1, Elad Barkan 2,3, Larbi Bedrani 4, Jordana Bell 5, Hans Bisgaard 1, Michael Boehnke 6, Marc Jan Bonder 7,8, Klaus Bønnelykke 1, Dorret I. Boomsma 9, Kenneth Croitoru 10, Gareth E. Davies 11, Eco de Geus 9, Frauke Degenhardt 12, Mauro D’Amato 13, Erik A. Ehli 11, Osvaldo Espin-Garcia 14,15, Casey T. Finnicum 11, Myriam Fornage 16, Andre Franke 12, Lude Franke 7, Fabian Frost 17, Jingyuan Fu 7,18, Femke-A. Heinsen 12, Georg Homuth 19, David Hughes 20,21, Richard IJzerman 22, Matthew A Jackson 5, Leon Eyrich Jessen 1, Daisy Jonkers 23, Tim Kacprowski 19, Han-Na Kim 24, Hyung-Lae Kim 24, Robert Kraaij 25, Alex Kurilshikov 7, Markku Laakso 26, Lenore Launer 27, Markus M. Lerch 17, Kreete Lüll 28, Aldons J. Lusis 29, Massimo Mangino 5, Julia Mayerle 17,30, Hamdi Mbarek 9, Maria Carolina Medina 25,31,32, Katie Meyer 33, Karen L. Mohlke 34, Elin Org 28, Andrew Paterson 35,36,37, Haydeh Payami 38, Djawad Radjabzadeh 25, Jeroen Raes 39,40, Daphna Rothschild 2,3, Malte Rühlemann 12, Serena Sanna 7, Eran Segal 2,3, Shiraz Shah 1, Michelle Smith 4,10, Tim Spector 5, Claire Steves 5, Jakob Stokholm 1, Joanna W. Szopinska 41, Jonathan Thorsen 1, Nicolas Timpson 20,21, Williams Turpin 4,10, André G. Uitterlinden 25,42, Alejandro Arias Vasquez 41, Henry Völzke 44, Urmo Vosa 7, Zachary Wallen 38, Jun Wang 39,40, Frank Ulrich Weiss 17, Omer Weissbrod 2,3, Cisca Wijmenga 7,45, Gonneke Willemsen 9, Wei Xu 35,46, Yeojun Yun 24, Alexandra Zhernakova 7
1 COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
2 Department of Computer Science and Applied Mathematics, Weizmann Institute of Science, Rehovot, Israel
3 Department of Molecular Cell Biology, Weizmann Institute of Science, Rehovot, Israel
4 Division of Gastroenterology, Department of Medicine, University of Toronto, Toronto, Ontario, Canada
5 Department of Twin Research and Genetic Epidemiology, King’s College London, London, UK
6 Department of Biostatistics and Center for Statistical Genetics, University of Michigan, MI, USA
7 University of Groningen, University Medical Center Groningen, Department of Genetics, Groningen, The Netherlands
8 European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, UK
9 Department of Biological Psychology, Amsterdam Public Health Research Institute, VU Amsterdam, Amsterdam, The Netherlands
10 Zane Cohen Centre for Digestive Diseases, Mount Sinai Hospital, Toronto, Ontario, Canada
11 Avera Institute for Human Genetics, Avera McKennan Hospital & University Health Center, Sioux Falls, SD, USA
12 Institute of Clinical Molecular Biology, Christian Albrechts University of Kiel, Kiel, Germany
13 Unit of Clinical Epidemiology, Department of Medicine Solna, Karolinska Institutet, Stockholm, Sweden
14 Lunenfeld-Tanenbaum Research Institute, Sinai Health System, Toronto, Ontario, Canada
15 Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
16 Health Science Center at Houston, University of Texas, Houston, TX, USA
17 Department of Medicine A, University Medicine Greifswald, Greifswald, Germany
18 University of Groningen, University Medical Center Groningen, Department of Pediatrics, Groningen, The Netherlands
19 Department of Functional Genomics, Interfaculty Institute for Genetics and Functional Genomics, University Medicine Greifswald, Germany
20 MRC Integrative Epidemiology Unit at University of Bristol, Bristol, UK
21 Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
22 Department of Internal Medicine, Diabetes Centre, VU University Medical Centre, Amsterdam, The Netherlands
23 Division of Gastroenterology-Hepatology, Department of Internal Medicine, NUTRIM School of Nutrition and Translational Research in Metabolism, Maastricht University Medical Center, Maastricht, The Netherlands
24 Department of Biochemistry, School of Medicine, Ewha Womans University, Seoul, South Korea
25 Department of Internal Medicine, Erasmus MC, Rotterdam, The Netherlands
26 Institute of Clinical Medicine, Internal Medicine, University of Eastern Finland and Kuopio, University Hospital, Kuopio, Finland
27 Laboratory of Epidemiology and Population Science, National Institute on Aging, National Institutes of Health, Bethesda, MD, USA
28 Institute of Genomics, University of Tartu, Estonia
29 Department of Medicine, Department of Human Genetics, Molecular Biology Institute, Department of Microbiology, Immunology and Molecular Genetics, University of California, CA, USA
30 Department of Medicine 2, University Hospital, Ludwig-Maximilians-University, Munich, Germany
31 The Generation R Study Group, Erasmus MC, 3000 CA Rotterdam, The Netherlands
32 Department of Epidemiology, Erasmus MC, 3000 CA Rotterdam, The Netherlands
33 Department of Nutrition, Nutrition Research Institute, University of North Carolina at Chapel Hill, Kannapolis, NC, USA
34 Department of Genetics, University of North Carolina at Chapel Hill, NC, USA
35 Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
36 Division of Epidemiology, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
37 Genetics and Genome Biology, The Hospital for Sick Children Research Institute, The Hospital for Sick Children, Toronto, Ontario, Canada
38 Departments of Neurology and Genetics, University of Alabama at Birmingham, Birmingham, AL, USA
39 Department of Microbiology and Immunology, Rega Institute. KU Leuven – University of Leuven, Leuven, Belgium
40 VIB Center for Microbiology, Leuven, Belgium
41 Department of Psychiatry, Radboudumc, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands
42 Department of Epidemiology, Erasmus MC, Rotterdam, The Netherlands
43 Department of Medicine 2, University Hospital, Ludwig-Maximilians-University, Munich, Germany
44 Institute for Community Medicine, Greifswald University Hospital, Greifswald, Germany
45 K.G. Jebsen Coeliac Disease Research Centre, Department of Immunology, University of Oslo, Norway
46 Department of Biostatistics, Princess Margaret Cancer Centre, Toronto, Ontario, Canada
Funding and other related information for each cohort can be found in Additional file 2.
Availability of data and materials
Data availability is determined by each cohort, according to the agreements with their participants, as well as their local regulations and institute requirements.
Ethics approval and consent to participate
Ethical approval and consent to participate were acquired by each cohort, according to their local regulations and institute requirements.
Consent for publication
Consent for publication was acquired by each cohort, according to their local regulations and institute requirements.
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.