Perturbed human sub-networks by Fusobacterium nucleatum candidate virulence proteins

Background Fusobacterium nucleatum is a gram-negative anaerobic species residing in the oral cavity and implicated in several inflammatory processes in the human body. Although F. nucleatum abundance is increased in inflammatory bowel disease subjects and is prevalent in colorectal cancer patients, the causal role of the bacterium in gastrointestinal disorders and the mechanistic details of host cell functions subversion are not fully understood. Results We devised a computational strategy to identify putative secreted F. nucleatum proteins (FusoSecretome) and to infer their interactions with human proteins based on the presence of host molecular mimicry elements. FusoSecretome proteins share similar features with known bacterial virulence factors thereby highlighting their pathogenic potential. We show that they interact with human proteins that participate in infection-related cellular processes and localize in established cellular districts of the host–pathogen interface. Our network-based analysis identified 31 functional modules in the human interactome preferentially targeted by 138 FusoSecretome proteins, among which we selected 26 as main candidate virulence proteins, representing both putative and known virulence proteins. Finally, six of the preferentially targeted functional modules are implicated in the onset and progression of inflammatory bowel diseases and colorectal cancer. Conclusions Overall, our computational analysis identified candidate virulence proteins potentially involved in the F. nucleatum—human cross-talk in the context of gastrointestinal diseases. Electronic supplementary material The online version of this article (doi:10.1186/s40168-017-0307-1) contains supplementary material, which is available to authorized users.


Gut-disease gene set enrichment using a reduced statistical background
The selection of a large proteome background for the analyses on the list of inferred interactors addresses the general question "Do the interactors participate in cellular processes relevant for the microbe-host interaction?". However, the repertoire of domain-domain and SLiM-domain interaction templates is far from complete. Indeed, the total number of human proteins for which we could infer an interaction based on the templates we used (from 2013) is around 56% (11'284 proteins) of the proteincoding genes present in UniprotKB (i.e., 20'254 genes, February 2013). Therefore, we have assessed the gene set and functional enrichments using this reduced background.
The gene set enrichments still hold with the reduced background except when we consider bacterial targets alone (Table S11). Regarding the enrichment of functional categories (BP, CC, KEGG and Reactome), a substantial fraction of enriched annotations listed in Table S6A has been identified as significantly over-represented using the reduced background as well. Namely, we found enriched: 45% of the GO Biological Processes, 53% of the GO Cellular Components, 71% of the KEGG maps and 65% of the Reactome pathways (Table  S6B).

Robustness assessment of the network-based analysis
We first verified if we can obtain the same number of OCG network modules that are enriched in inferred interactors (i.e., 31 modules) by shuffling proteins in the human interactome while maintaining the network structure. We performed 1'000 permutations and, as expected, we found no modules enriched in inferred interactors in any of the permuted networks (empirical P-value < 0.001). We further assessed the robustness of interactome-based to avoid possible biases due to the presence of highly studied proteins. For this purpose, we chose an "unbiased network", built from interaction data generated by systematic yeast-two hybrid screens, hereafter named CSSB network, which we downloaded from the human interactome portal (http://interactome.dfci.harvard.edu/H_sapiens/) and that we already used in [1]) It is to note, however, that the yeast-two hybrid technique has some inherent limitations [2,3]. For instance, proteins with a strong localization signal, such as secreted proteins, seldom enter the nucleus or require specific conditions for proper folding. This leads usually to an under-representation of secreted and membrane proteins in networks built uniquely from yeast-two hybrid screens. Since our interactome building pipeline gathered data from specialized databases (e.g., MatrixDB and InnateDB) that chiefly store interactions between extracellular and/or membrane proteins, we sought to quantify the proportion of such proteins in our interactome and in the CCSB network. To do so, we downloaded from the MatrixDB (http://matrixdb.univ-lyon1.fr/) three overlapping protein lists: (i) membrane proteins, (ii) secreted proteins and (iii) proteins annotated as components of the extracellular matrix (ECM). Indeed, our network includes a higher proportion of ECM (2.5% compared to 1.4%) and secreted (12.2% compared to 7.5%), whereas the proportion of membrane proteins in the two networks is similar (36% compared to 31%). Among the inferred interactors 59 are ECM proteins, 294 are annotated as secreted and 321 are associated to a membrane (union: 443 proteins, 47% of the interactors). In our interactome 41 of the ECM proteins are present (69%), but only 5 are in the CCSB network (8%). Similarly, 192 of the 294 secreted interactors are found in our interactome (65%), but only 29 are in the CCSB network (10%). Similarly, our interactome contains more interactors annotated as membrane proteins (34%) compare to the CSSB network (12%).
Nevertheless, we then repeated the network-based analyses presented in the manuscript on the CSSB network.
Only 193 inferred interactors are in the CCSB network (~21% compared to ~70% in our human interactome). Both inferred interactors average degree and betweenness are slightly higher than non-interactors but they are not significantly different (degree: 8.6 vs 6.7, P-value=0.64; betweenness: 0.0012 vs 0.0006, P-value=0.88; two-sided Mann-Whitney U test). There is no significant enrichment of interactors participating in 2 or more network modules (i.e., multifunctional proteins). However, among the multifunctional inferred human interactors we found an enrichment of extreme multifunctional proteins (7 proteins, 5.0-fold, P-value=7.0x10 -4 , one-sided Fisher's exact test). We next assessed the over-representation of inferred interactors in network modules. None of them was significantly enriched. Moreover, none of the modules was enriched in gut disease genes (i.e., CRC mutated, FusoExpr, CD, IBD and UC). This discrepancy between the two analyses suggests that the higher coverage of the ECM, secreted and membrane proteins in our interactome provides a better biological context for studying the molecular interplay at the microbe-host interface. In addition, these results notably underline the methodological specificity of our approach since none of the modules in the CCSB network is found enriched in interactors, when a lack of specificity could have allowed finding some.

Comparison with other bacterial strains
We have downloaded the proteomes (UniprotKB, March 2017) and predicted the putative secretomes of 7 additional bacterial strains (see Table S12) using the SignalP and SecretomeP algorithms (Table S13). We assessed the disorder propensity of peptide-triggered secreted proteins (i.e. predicted by SignalP only) using five disorder predictors as we did for F. nucleatum subsp nucleatum strain (hereafter ATCC 25586). In all the additional strains, we observed a significantly higher disorder content of secreted proteins compared to non-secreted proteins with at least three predictors out of five (see Figure S2-8). This result reinforces our previous observations that secreted proteins are more disordered than non-secreted proteins. We next compared the list of enriched Pfam domains found in the secretome of the 7 additional strains to the ATCC 22586 secretome. We indeed observed that more than a half of the ATCC 25586's enriched domains are also enriched in the other Fusobacteria secretomes (from 54% to 82%), whereas 45% of them are found enriched in the K-12 secretome (Table S14). Subsequently, we detected the presence of mimicry elements in the predicted secretomes that can mediate the interactions with host proteins. The vast majority of "host-like" interacting domains in ATCC 22586 strains are also found in the other Fusobacteria and to a lesser extent with K-12 (Table S15). Overall, we detected 150 host-like domains: 58 are present in at least two strains, 20 of them are presented in all the strains. Regarding mimicry SLiMs, we also observed that a high fraction of ELM motifs found in ATCC 22586 are also found in other strains (from 80% to 96%, Table S15). Intriguingly, the highest fraction of ATCC 22586 mimicry motifs is found in the K-12 secretome. This can be explained by the fact the K-12 proteome is in the list of 694 proteomes of commensal and pathogenic bacteria that we used to assess the conservation of mimicry motifs found in the ATCC 22586 secretome, whereas the other Fusobacteria are not included. Overall, we detected 107 mimicry SLIMs, 88 of which are detected in at least two strains (48 in all 8 strains).
Starting from the list of detected mimicry domains and SLiMs, we inferred the interactions with human proteins. Sixty to sixty-six percent of the ATCC 22586 inferred interactors is found among the interactors inferred for the Fusobacteria strains and 55% are found also among K12 interactors (Table S16). In total, we inferred the interactions with 3537 human proteins, 1884 which interact with a protein of at least two strains. Moreover, 822 human proteins are targeted by all the strains, whereas 697 are specifically targeted by proteins belonging to at least two Fusobacteria. It is interesting to note that the number of common inferred interactors between ATCC 22586 and the other Fusobacteria strains is much higher than between ATCC 22586 and K-12, suggesting that beside a relatively small core of common interactors, these strains may have evolved specific strategies to invade and maintain themselves within the host cell. We observed a similar pattern when comparing the enrichment of functional categories (Table S17).
When mapping inferred interactors on the human interactome and assessing their over-representation in network modules, we identified 73 network modules preferentially targeted by at least one of the 8 strains. Six modules (i.e., modules 9, 74, 78, 165, 216 and 689) are preferentially targeted by all strains (Table S18). They are involved in immune response and cell-cell interactions and their constituent proteins are localized in the extracellular matrix and the cell surface. This commonality suggests that these modules and their proteins can be important players at the host-microbe interface for all the 8 strains considered. There are four modules preferentially targeted by Fusobacteria only (i.e., modules 19, 246, 577 and 615). The strain with the highest number of specifically targeted modules is K-12 with 22, followed by 22586 with 6, vincentii 3_1_27 with 3 and animalis 4_8 with two. Interestingly, these last two show a low level of active invasion of colonic cells. We finally checked how many of the 26 main candidate virulence proteins (i.e. network module perturbators) have potential orthologs in any of the other strains that has been classified as such by our approach. Ortholog detection is not an easy task and it goes well beyond the focus of this work. However, to identifiy putative ATCC 25586 perturbator ortholog, we first clustered the proteomes of the 8 strains at different level of sequence identity (i.e., 90%, 50% and 40%) using the CD-HIT algorithm [4] by setting the suggested running parameters for each sequence identity threshold. We next look for the network perturbators of the other strains belonging to the same clusters of the ATCC 25586 perturbators. We found that the fraction of "conserved" pertubators is higher among Fusobacteria and varies, as expected, depending on the level of sequence identity (Table S19). Interestingly, only one ATCC 25586 perturbator is found as such among the 90 identified K-12 perturbators (Table S19). Among "conserved" perturbators, only a small fraction shows a similar module perturbation profile. For instance, the protein coded by the ATCC 25586 gene FN0579 and its orthologs in the other Fusobacteria strains targets always the same modules (i.e. Modules 74 and 577), as well as FN1426/FN1950 and their orthologs, which target Module 246. MORN_2 containing proteins are classified as perturbators in the vincentii and periodonticum strains. The vincentii 3_1_35A2 strain preferentially targets 4 out of 6 of the modules perturbed by the ATCC 25586 MORN_2 proteins, whereas all the others 3 modules. Tables   Table S11. Comparison of the enrichment of pathogen proteins targets and gut-disease related gene sets using two different statistical background.
x = enriched (one-sided Fisher's test, corrected P-value<0.1)     Figure S2. Assessment of the disorder propensity of the F. nucleatum subsp. animalis 4_8 secreted proteins (SignalP prediction). Figure S5. Assessment of the disorder propensity of the F. nucleatum subsp. vincentii 3_1_36A secreted proteins (SignalP prediction). Figure S7. Assessment of the disorder propensity of the F. periodonticum 2_1_31 secreted proteins (SignalP prediction). Figure S8. Assessment of the disorder propensity of the E. coli K-12 secreted proteins (SignalP prediction).