Gut metagenomes of type 2 diabetic patients have characteristic single-nucleotide polymorphism distribution in Bacteroides coprocola
© The Author(s). 2017
Received: 25 July 2016
Accepted: 10 January 2017
Published: 1 February 2017
Gut microbes play a critical role in human health and disease, and researchers have begun to characterize their genomes, the so-called gut metagenome. Thus far, metagenomics studies have focused on genus- or species-level composition and microbial gene sets, while strain-level composition and single-nucleotide polymorphism (SNP) have been overlooked. The gut metagenomes of type 2 diabetes (T2D) patients have been found to be enriched with butyrate-producing bacteria and sulfate reduction functions. However, it is not known whether the gut metagenomes of T2D patients have characteristic strain patterns or SNP distributions.
We downloaded public gut metagenome datasets from 170 T2D patients and 174 healthy controls and performed a systematic comparative analysis of their metagenome SNPs. We found that Bacteroides coprocola, whose relative abundance did not differ between the groups, had a characteristic distribution of SNPs in the T2D patient group. We identified 65 genes, all in B. coprocola, that had remarkably different enrichment of SNPs. The first and sixth ranked genes encode glycosyl hydrolases (GenBank accession EDU99824.1 and EDV02301.1). Interestingly, alpha-glucosidase, which is also a glycosyl hydrolase located in the intestine, is an important drug target of T2D. These results suggest that different strains of B. coprocola may have different roles in human gut and a specific set of B. coprocola strains are correlated with T2D.
KeywordsType 2 diabetes Metagenome Bacteria Phylogenetic analysis SNP enrichment
Human gut microbiota are critical to human health and have been related to various disease conditions, such as obesity , diabetes [2–4], cirrhosis of the liver , inflammatory bowel disease , atopic dermatitis , and pulmonary inflammation . Next-generation sequencing (NGS) and bioinformatics technologies provide access to the genetic information of the entire microbiome and thus enable systematic investigation of its composition and functional genetics.
A number of metagenomics studies have compared the relative abundance of bacterial species or pathway enrichment between patients and healthy controls (HCs). However, little work has been done to elucidate strain-level variations or single-nucleotide polymorphisms (SNPs) in the metagenome. Because even slight nucleotide variations can alter the pathogenic behavior and antibiotic resistance of bacteria [9–11], analyses of genomic variations (i.e., SNPs, insertions, and deletions) and structural variation of the metagenome are important for understanding microbiome biology.
Regarding strain-level variation in the microbial metagenome, Schloissnig and colleagues proposed a workflow for analyzing metagenomics datasets at the strain level and described the genomic variation landscape of human gut microbial genomes . Subsequently, Greenblum et al. detected extensive variations in strain-level copy-numbers in the human gut microbiome  and Zhu et al. reported considerable differences (based on gene deletions) in the gene content of strains within the same species in the human gut . Tropism and persistence of different oral Neisseria strains have also been studied using metagenomics sequencing . Although genomic variation of microbiomes is well documented by these studies, association of strain-level metagenomics findings with human diseases has been limited.
Type 2 diabetes (T2D) is a complex metabolic disorder afflicting hundreds of millions of people worldwide . Because T2D is related to diet and digestion, the role of gut microbiota in T2D initiation and progression is of great interest. Previous large-scale T2D-related metagenomics research has shown that the proportion of phylum Firmicutes and class Clostridia cells in the microbiome is significantly reduced in T2D  and that the gut microbiota of T2D patients tends to have fewer butyrate-producing bacteria, such as Roseburia intestinalis and Faecalibacterium prausnitzii .
To the best of our knowledge, no prior study has resolved the association between T2D and the gut microbiome at a strain or SNP level. Here, we utilized a public NGS dataset resource and performed a comparative study examining the SNPs of gut metagenomes in T2D patients relative to HCs.
T2D and control dataset
Raw NGS datasets of DNA obtained from fecal samples of 170 T2D patients and 174 HCs in China  were downloaded from the NCBI Sequence Read Archive (total data, 1.2 terabases; average sample size, 3.5 gigabases; accession numbers SRA045646 and SRA050230). To reduce SNP false positives, we trimmed abnormal bases and filtered low-quality reads as described in detail previously . For each base (A, T, G and C), a mean number of base calls (f) across all sites and a standard deviation (SD) were calculated. Starting from the first site at 5′ end, the site was trimmed if the base call number for a base was beyond f ± 2 × SD. The trimming at 5′ end was terminated until encountering a site with all base call numbers within the range. Next, Trimmomatic  was used to remove adapters and trim low-quality bases (<Q20) at the 3′ end (parameters: -phred33 ILLUMINACLIP:adapters.fa:1:0:7 TRAILING:20 SLIDINGWINDOW:5:10 MINLEN:45 AVGQUAL:20). A total of 1.04 terabases of data remained after quality control procedures were completed. Clinical information and other characteristics of the 344 individuals included in the analysis were obtained from  and .
Bacterial reference genome determination
We first determined a set of bacterial reference genomes for samples included in this study. Metaphlan2, which is based on approximately 17,000 reference genomes , was employed to profile the bacterial species in each sample. Genomes of species that were identified by metaphlan2 in at least four samples were included in the reference set for the alignment analyses. The genome sequences of all of these species were downloaded from the NCBI assembly database and are listed in Additional file 1: Table S1.
Selecting species and genes with sufficient supporting reads
Clean reads were aligned to the whole reference set with Burrows-Wheeler Aligner-maximal exact match (BWA-MEM)  in default settings, and only unique alignments were outputted. Species with sufficient supporting reads were selected with a cutoff of ≥40% of the reference genome being covered by a ≥10× depth in at least 20 samples in both the T2D and the control groups. For genes, the cutoff was ≥80% gene sites with ≥10× depth in at least 10 samples in each group. Only the species and genes that met these criteria were subjected to subsequent SNP analyses.
SNP and intra-sample variation calling and filtering
Two tools, BCFtools  and VarScan2 , were applied to identify SNPs of the metagenome. The alignment of duplicates by BWA-MEM was first marked and filtered in Picard . Then, SAMtools  was used to generate “mpileup” files from the SAM-formatted alignment files. The mpileup files were employed as input files for both BCFtools and VarScan2. The parameters used for BCFtools and VarScan2 were “-vmO z –V indels” and “pileup2snp min-coverage 10, p value 0.05, min-avg-qual 15,” respectively. SNPs detected by both the tools were selected. SNPs were further filtered with the requirements of a ≥0.5 mutated allele (relative to that in the reference genome) frequency, ≥4× supporting reads for the variant, and strand bias of sequencing bases less than tenfold in both BCFtools and VarScan2. Namely, when genome sites with heterozygosity were found in a sample, the major allele was used for the SNP analysis.
To address the heterozygosity or intra-sample variations, mutated allele frequencies (MuAFs) of variant sites were obtained for analysis under polyclonal scenario. The MuAFs of sites were defined as the mean values of outputs by BCFtools and VarScan2, which were highly correlated (R 2 = 0.997). To examine whether intra-sample variations affect the result, SNPs with a >0.8 MuAF were selected for a parallel SNP analysis.
Annotation of genes and SNPs
Gene ontology (GO) annotations of each genome were downloaded from Uniprot . SNPs were annotated in SnpEff  with the -eff parameter. The genome annotation files used by SnpEff were obtained from GenBank.
Phylogenetic tree construction based on whole-genome level SNPs of B. coprocola
Based on the reference genome of B. coprocola (reference strain DSM 17136, GenBank accession GCA_000154845.1), genome regions with >20% samples not having valid coverage (≥10× depth) were discarded. If ≤20% of the samples had invalid coverage in a region, the bases in that region in those samples were labeled as “N.” Then, the nucleotides at SNP sites from the samples were extracted to generate a pileup file. Phylogenetic trees were constructed based on the whole-genome level aligned SNPs by using randomized axelerated maximum likelihood (RAxML) v8.2.9 (100 bootstrap replicates), with GTR model of nucleotide substitution, γ-distributed rates among sites, and Felsenstein correction for ascertainment bias [27, 28]. The parameters for RAxML were “-#100 –m ASC_GTRGAMMA –f a.” Rooting was undertaken by using RAxML with “-f I” option. Trees were drawn with the R package ggtree .
Phylogenetic tree construction of genes
The nucleotide sequences of genes were obtained by aligning reads to corresponding full-length genes of the reference genome. For a given gene, gene regions were discarded if with >20% samples not having valid coverage (≥10× depth) and the coverages of full-length genes are listed in Additional file 1: Table S4. If ≤20% of the samples had invalid coverage in a region, the bases in that region in those samples were labeled as “N.” Then, the full-length genes, except the discarded regions, were aligned. The phylogenetic trees of genes were constructed by using RAxML v8.2.9 with settings as these for whole-genome level, but without ascertainment correction. The two clusters are separated at the root of the phylogenetic tree.
Clustering of samples by mutated allele frequencies of sites
For a given samples, the sites of bacterial genome and gene with a >0.2 MuAFs were selected. For all samples, we obtained a matrix with the column denoting site positions and the rows denoting samples. If there was no variation or MuAFs <0.2, the MuAFs were set to 0. Then, hierarchical clustering and affinity propagation (AP) clustering  were applied to the matrix. The hclust function of stat package in R v3.3.0 was used for hierarchical clustering with parameters “method = complete,” and the output tree was drawn by ggtree. For AP clustering, APcluster in R v3.3.0 was employed with parameters “K = 2.” The clusters generated by AP clustering were visualized by using Cytoscape v3.5.0 . For each node, the top five edges connecting the nodes with the largest similarity are shown.
Relative abundance of species and genome/gene densities were compared between the T2D group and control group with the Mann-Whitney test. Fisher’s exact test was performed to test for bias of SNP sites and sample enrichment inferred from the phylogenetic tree and clusters by AP clustering. Hypergeometric test was applied for one-tailed test for enrichment of biased SNP sites at gene level. The q value with Storey and Tibshirani’s method (R package qvalue v1.43.0) was applied for multiple testing correction  to identify species and genes with significantly biased SNP distribution.
Based on gut metagenomics data from 344 individuals (170 T2D patients and 174 healthy controls), we identified a total of 356 bacterial species (Additional file 1: Table S1; Additional file 2: Figure S3) and their relative abundances. Consistent with previous reports, we found that, relative to the HC group, the T2D group had lower proportions of phylum Firmicutes, class Clostridia, and butyrate-producing bacteria (Additional file 2: Figure S1) [2, 3]. Relative abundance did not differ between the groups for 270 of the 356 species (75.8%) analyzed (Mann-Whitney test, p > 0.05; Additional file 1: Table S2).
We selected 20 bacterial species with sufficiently supporting NGS reads in a sufficient number of samples (see “Methods” section, Additional file 1: Table S3) for analysis of SNP distribution. Based on the reference genomes of these 20 species, a total of 5.94 million SNPs were identified, of which 99.65% were bi-allelic and 0.35% were tri-allelic. The distributions within these 20 species of the normalized bi-allelic SNP densities calculated for genome regions with valid coverage (>10×) in the T2D and HC groups are reported in Additional file 2: Figure S2. The SNP density distribution differed significantly between the T2D and HC groups (Mann-Whitney test, p = 0.0083, q = 0.0258) for only one of the 20 species, namely, Bacteroides coprocola (reference strain DSM 17136, Genbank accession GCA_000154845.1). However, the mean relative abundance of B. coprocola for the T2D group (9.10 ± 7.09%) was similar to that for the control group (8.91 ± 7.16%; Mann-Whitney test, p = 0.9646) in the samples with sufficient reads. Moreover, B. coprocola was prevalent and identified in 31.98% (110/344) of all the samples, ranked the top 24th among the 356 species.
Examination of the SNP distributions of the protein-coding genes of the selected 20 gut bacteria revealed that 51,579 genes in the 20 microbe species had valid coverage with sufficient prevalence (see “Methods” section). Among them, we identified 1300 genes (2.52%) with significantly differentiated SNP densities between T2D and control samples (Mann-Whitney test, q < 0.05; Additional file 1: Table S4). All but one of these genes were found in B. coprocola, whose reference genome contains 4291 protein-coding genes. The one gene (EFQ08025.1) not in B. coprocola was from Faecalibacterium cf. prausnitzii. With the reference strain of B. coprocola DSM 17136, we observed that generally there are more SNPs in the HC group compared to those in the T2D group at the gene level (1268/1300, Additional file 1: Table S4).
The present analysis of a Chinese metagenomics dataset revealed that the gut microbiota of T2D patients and HC individuals had different SNP distributions. The gut microbe species B. coprocola, which had a similar relative abundance between the T2D and HC groups, exhibited biased SNP distribution at both the genome and the gene level. Our phylogenetic analysis yielded 49 B. coprocola genes that had characteristic SNP distribution in T2D patients. Two of these genes (EDU99824.1 ranked first and EDV02301.1 ranked sixth) encode glycosyl hydrolases. Glycosyl hydrolases of bacteria play vital roles in degradation of cellulose and starch and hence generate sugars. Interestingly, alpha-glucosidase, which is also a glycosyl hydrolase located in the brush border of the small intestine, is an important drug target of T2D. Therefore, it is possible that EDU99824.1 and EDV02301.1 in T2D-related strain and control strain may have different glycosyl hydrolase properties, which is worthwhile to be investigated in the future.
B. coprocola was previously reported to have high SNP density in the gut microbiota . But the correlation between its SNPs and diseases had not been investigated. Our results indicate that a specific set of B. coprocola strains may be associated with T2D and further suggest that strain-level bacterial colonization of the gut and the potential restorative influence of probiotic supplements should be investigated in T2D therapeutic research.
As is shown, intra-sample variations affect very slightly on our results. However, assuming one strain type per sample may not be general, and intra-sample variations should not be overlooked in strain-level analysis of metagenomics.
National Center for Biotechnology Information
Type 2 diabetes
We thank Dongsheng Zhao and Xiaolei Wang for their help in computing resources.
YC, ZL, SH, JZ, JW, SH, and XY were supported by the China National High Technology Research and Development Program (2014AA020604, X. Ying). XB and MN were supported by the National Natural Science Foundation of China (No. U1435222, X. Bo).
Availability of data and materials
The metagenomics sequencing data analyzed during this study are included in this published article  and the raw Illumina read data are available in the NCBI Sequence Read Archive under accession numbers SRA045646 and SRA050230 (http://www.ncbi.nlm.nih.gov/sra/). Other data supporting the conclusions of this article is included within the article and its additional files.
YC participated in the design of the study, carried out the analysis, and drafted the manuscript. ZL helped to construct the analysis framework. SH helped to perform the statistical analysis. JZ helped to design and plot the figures. JW downloaded the raw data and performed the quality control. NS and XB helped to interpret the results. MN offered guidance on the bioinformatics methods, helped to construct the analysis framework, and drafted and revised the manuscript. XY conceived the study, participated in its design, coordinated and helped to draft the manuscript, and revised the manuscript. All authors read and approved the final version.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
The metagenomics sequencing data in this study were generated by this published article . According to the paper, fecal samples were obtained from the volunteers after signing an informed consent form and the sampling procedure was approved by the Ethical Committee for Clinical Research from the Peking University Shenzhen Hospital, Shenzhen Second People’s Hospital, and Medical Research Center of Guangdong General Hospital.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Turnbaugh PJ, Ley RE, Mahowald MA, Magrini V, Mardis ER, Gordon JI. An obesity-associated gut microbiome with increased capacity for energy harvest. Nature. 2006;444(7122):1027–31. doi:10.1038/nature05414.View ArticleGoogle Scholar
- Larsen N, Vogensen FK, van den Berg FWJ, Nielsen DS, Andreasen AS, Pedersen BK, et al. Gut microbiota in human adults with type 2 diabetes differs from non-diabetic adults. PLoS One. 2010;5:e9085. doi:10.1371/journal.pone.0009085.View ArticlePubMed CentralGoogle Scholar
- Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012;490:55–60. doi:10.1038/nature11450.View ArticleGoogle Scholar
- Karlsson FH, Tremaroli V, Nookaew I, Bergström G, Behre CJ, Fagerberg B, et al. Gut metagenome in European women with normal, impaired and diabetic glucose control. Nature. 2013;498:99–103. doi:10.1038/nature12198.View ArticleGoogle Scholar
- Qin N, Yang F, Li A, Prifti E, Chen Y, Shao L, et al. Alterations of the human gut microbiome in liver cirrhosis. Nature. 2014;513:59–64. doi:10.1038/nature13568.View ArticleGoogle Scholar
- Walters WA, Xu Z, Knight R. Meta-analyses of human gut microbes associated with obesity and IBD. FEBS Lett. 2014;588:4223–33. doi:10.1016/j.febslet.2014.09.039.View ArticlePubMed CentralGoogle Scholar
- Rather IA, Bajpai VK, KUMAR S, Lim J, Paek WK, Park Y-H. Probiotics and atopic dermatitis: an overview. Front Microbiol. 2016;7. doi:10.3389/fmicb.2016.00507
- Segal LN, Clemente JC, Tsay J-CJ, Koralov SB, Keller BC, Wu BG et al. Enrichment of the lung microbiome with oral taxa is associated with lung inflammation of a Th17 phenotype. Nature Microbiol. 2016:16031. doi:10.1038/nmicrobiol.2016.31
- Bagel S, Hüllen V, Wiedemann B, Heisig P. Impact of gyrA and parC mutations on quinolone resistance, doubling time, and supercoiling degree of Escherichia coli. Antimicrob Agents Chemother. 1999;43:868–75.PubMed CentralGoogle Scholar
- Sokurenko EV, Chesnokova V, Dykhuizen DE, Ofek I, Wu XR, Krogfelt KA, et al. Pathogenic adaptation of Escherichia coli by natural variation of the FimH adhesin. Proc Natl Acad Sci U S A. 1998;95(15):8922–6.View ArticlePubMed CentralGoogle Scholar
- Morowitz MJ, Denef VJ, Costello EK, Thomas BC, Poroyko V, Relman DA, et al. Strain-resolved community genomic analysis of gut microbial colonization in a premature infant. Proc Natl Acad Sci U S A. 2011;108(3):1128–33. doi:10.1073/pnas.1010992108.View ArticleGoogle Scholar
- Schloissnig S, Arumugam M, Sunagawa S, Mitreva M, Tap J, Zhu A, et al. Genomic variation landscape of the human gut microbiome. Nature. 2013;493:45–50. doi:10.1038/nature11711.View ArticleGoogle Scholar
- Greenblum S, Carr R, Borenstein E. Extensive strain-level copy-number variation across human gut microbiome species. Cell. 2015;160(4):583–94. doi:10.1016/j.cell.2014.12.038.View ArticlePubMed CentralGoogle Scholar
- Zhu A, Sunagawa S, Mende DR, Bork P. Inter-individual differences in the gene content of human gut bacterial species. Genome Biol. 2015;16:82. doi:10.1186/s13059-015-0646-9.View ArticlePubMed CentralGoogle Scholar
- Donati C, Zolfo M, Albanese D, Tin Truong D, Asnicar F, Iebba V et al. Uncovering oral Neisseria tropism and persistence using metagenomic sequencing. Nature Microbiol. 2016:16070. doi:10.1038/nmicrobiol.2016.70
- Noble D, Mathur R, Dent T, Meads C, Greenhalgh T. Risk models and scores for type 2 diabetes: systematic review. BMJ. 2011;343:d7163. doi:10.1136/bmj.d7163.View ArticlePubMed CentralGoogle Scholar
- Bolger AM, Lohse M, Usadel B. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics. 2014;30(15):2114–20. doi:10.1093/bioinformatics/btu170.View ArticlePubMed CentralGoogle Scholar
- Mitchell A, Bucchini F, Cochrane G, Denise H, ten Hoopen P, Fraser M, et al. EBI metagenomics in 2016—an expanding and evolving resource for the analysis and archiving of metagenomic data. Nucleic Acids Res. 2016;44(D1):D595–603. doi:10.1093/nar/gkv1195.View ArticleGoogle Scholar
- Truong DT, Franzosa EA, Tickle TL, Scholz M, Weingart G, Pasolli E, et al. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nat Methods. 2015;12:902–3. doi:10.1038/nmeth.3589.View ArticleGoogle Scholar
- Li H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv:13033997 [q-bio]. 2013.
- Storey JD, Tibshirani R. Statistical significance for genomewide studies. Proc Natl Acad Sci U S A. 2003;100(16):9440–5. doi:10.1073/pnas.1530509100.View ArticlePubMed CentralGoogle Scholar
- Koboldt DC, Zhang Q, Larson DE, Shen D, McLellan MD, Lin L, et al. VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 2012;22(3):568–76. doi:10.1101/gr.129684.111.View ArticlePubMed CentralGoogle Scholar
- Picard. http://broadinstitute.github.io/picard/. Accessed 20 Jan 2016.
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, et al. The sequence alignment/map format and SAMtools. Bioinformatics (Oxford, England). 2009;25:2078–9. doi:10.1093/bioinformatics/btp352.View ArticleGoogle Scholar
- UniProt C. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43(Database issue):D204–12. doi:10.1093/nar/gku989.Google Scholar
- Cingolani P, Platts A, le Wang L, Coon M, Nguyen T, Wang L, et al. A program for annotating and predicting the effects of single nucleotide polymorphisms, SnpEff: SNPs in the genome of Drosophila melanogaster strain w1118; iso-2; iso-3. Fly (Austin). 2012;6(2):80–92. doi:10.4161/fly.19695.View ArticleGoogle Scholar
- Stamatakis A. RAxML version 8: a tool for phylogenetic analysis and post-analysis of large phylogenies. Bioinformatics. 2014;30(9):1312–3. doi:10.1093/bioinformatics/btu033.View ArticlePubMed CentralGoogle Scholar
- Kuhner MK, Beerli P, Yamato J, Felsenstein J. Usefulness of single nucleotide polymorphism data for estimating population parameters. Genetics. 2000;156(1):439–47.PubMed CentralGoogle Scholar
- Yu G, Smith DK, Zhu H, Guan Y, Lam TT-Y. ggtree: an r package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol Evol. 2016:n/a-n/a. doi:10.1111/2041-210X.12628.
- Bodenhofer U, Kothmeier A, Hochreiter S. APCluster: an R package for affinity propagation clustering. Bioinformatics. 2011;27(17):2463–4. doi:10.1093/bioinformatics/btr406.View ArticleGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003;13(11):2498–504. doi:10.1101/gr.1239303.View ArticlePubMed CentralGoogle Scholar