1
|
Abstract
BACKGROUND Disease gene prediction is a critical and challenging task. Many computational methods have been developed to predict disease genes, which can reduce the money and time used in the experimental validation. Since proteins (products of genes) usually work together to achieve a specific function, biomolecular networks, such as the protein-protein interaction (PPI) network and gene co-expression networks, are widely used to predict disease genes by analyzing the relationships between known disease genes and other genes in the networks. However, existing methods commonly use a universal static PPI network, which ignore the fact that PPIs are dynamic, and PPIs in various patients should also be different. RESULTS To address these issues, we develop an ensemble algorithm to predict disease genes from clinical sample-based networks (EdgCSN). The algorithm first constructs single sample-based networks for each case sample of the disease under study. Then, these single sample-based networks are merged to several fused networks based on the clustering results of the samples. After that, logistic models are trained with centrality features extracted from the fused networks, and an ensemble strategy is used to predict the finial probability of each gene being disease-associated. EdgCSN is evaluated on breast cancer (BC), thyroid cancer (TC) and Alzheimer's disease (AD) and obtains AUC values of 0.970, 0.971 and 0.966, respectively, which are much better than the competing algorithms. Subsequent de novo validations also demonstrate the ability of EdgCSN in predicting new disease genes. CONCLUSIONS In this study, we propose EdgCSN, which is an ensemble learning algorithm for predicting disease genes with models trained by centrality features extracted from clinical sample-based networks. Results of the leave-one-out cross validation show that our EdgCSN performs much better than the competing algorithms in predicting BC-associated, TC-associated and AD-associated genes. de novo validations also show that EdgCSN is valuable for identifying new disease genes.
Collapse
Affiliation(s)
- Ping Luo
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada
| | - Li-Ping Tian
- School of Information, Beijing Wuzi University, Beijing, 101149, China
| | - Bolin Chen
- School of Computer Science, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Qianghua Xiao
- School of Mathematics and Physics, University of South China, HengYang, 421001, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada. .,Department of Computer Science, University of Saskatchewan, Saskatoon, S7N 5C9, Canada. .,School of Mathematics and Statistics, Hainan Normal University, Haikou, 571158, China. .,Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, S7N 5A9, Canada.
| |
Collapse
|
2
|
Ata SK, Ou-Yang L, Fang Y, Kwoh CK, Wu M, Li XL. Integrating node embeddings and biological annotations for genes to predict disease-gene associations. BMC Syst Biol 2018; 12:138. [PMID: 30598097 PMCID: PMC6311944 DOI: 10.1186/s12918-018-0662-y] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
BACKGROUND Predicting disease causative genes (or simply, disease genes) has played critical roles in understanding the genetic basis of human diseases and further providing disease treatment guidelines. While various computational methods have been proposed for disease gene prediction, with the recent increasing availability of biological information for genes, it is highly motivated to leverage these valuable data sources and extract useful information for accurately predicting disease genes. RESULTS We present an integrative framework called N2VKO to predict disease genes. Firstly, we learn the node embeddings from protein-protein interaction (PPI) network for genes by adapting the well-known representation learning method node2vec. Secondly, we combine the learned node embeddings with various biological annotations as rich feature representation for genes, and subsequently build binary classification models for disease gene prediction. Finally, as the data for disease gene prediction is usually imbalanced (i.e. the number of the causative genes for a specific disease is much less than that of its non-causative genes), we further address this serious data imbalance issue by applying oversampling techniques for imbalance data correction to improve the prediction performance. Comprehensive experiments demonstrate that our proposed N2VKO significantly outperforms four state-of-the-art methods for disease gene prediction across seven diseases. CONCLUSIONS In this study, we show that node embeddings learned from PPI networks work well for disease gene prediction, while integrating node embeddings with other biological annotations further improves the performance of classification models. Moreover, oversampling techniques for imbalance correction further enhances the prediction performance. In addition, the literature search of predicted disease genes also shows the effectiveness of our proposed N2VKO framework for disease gene prediction.
Collapse
Affiliation(s)
- Sezin Kircali Ata
- Department of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Le Ou-Yang
- Department of Electronic Engineering, College of Information Engineering, Shenzhen University, China, Singapore, Singapore
| | - Yuan Fang
- School of Information Systems, Singapore Management University, Singapore, Singapore
| | - Chee-Keong Kwoh
- Department of Computer Science and Engineering, Nanyang Technological University, Singapore, Singapore
| | - Min Wu
- Data Analytics Department, Institute for Infocomm Research, Singapore, Singapore.
| | - Xiao-Li Li
- Data Analytics Department, Institute for Infocomm Research, Singapore, Singapore
| |
Collapse
|
3
|
Nikdelfaz O, Jalili S. Disease genes prediction by HMM based PU-learning using gene expression profiles. J Biomed Inform 2018; 81:102-111. [PMID: 29571901 DOI: 10.1016/j.jbi.2018.03.006] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 11/22/2017] [Accepted: 03/12/2018] [Indexed: 12/24/2022]
Abstract
Predicting disease candidate genes from human genome is a crucial part of nowadays biomedical research. According to observations, diseases with the same phenotype have the similar biological characteristics and genes associated with these same diseases tend to share common functional properties. Therefore, by applying machine learning methods, new disease genes are predicted based on previous ones. In recent studies, some semi-supervised learning methods, called Positive-Unlabeled Learning (PU-Learning) are used for predicting disease candidate genes. In this study, a novel method is introduced to predict disease candidate genes through gene expression profiles by learning hidden Markov models. In order to evaluate the proposed method, it is applied on a mixed part of 398 disease genes from three disease types and 12001 unlabeled genes. Compared to the other methods in literature, the experimental results indicate a significant improvement in favor of the proposed method.
Collapse
Affiliation(s)
- Ozra Nikdelfaz
- Tarbiat Modares University, Computer Engineering Department, Islamic Republic of Iran.
| | - Saeed Jalili
- Tarbiat Modares University, Computer Engineering Department, Islamic Republic of Iran.
| |
Collapse
|
4
|
Lin L, Yang T, Fang L, Yang J, Yang F, Zhao J. Gene gravity-like algorithm for disease gene prediction based on phenotype-specific network. BMC Syst Biol 2017; 11:121. [PMID: 29212543 PMCID: PMC5718078 DOI: 10.1186/s12918-017-0519-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/10/2017] [Accepted: 11/24/2017] [Indexed: 01/24/2023]
Abstract
Background Polygenic diseases are usually caused by the dysfunction of multiple genes. Unravelling such disease genes is crucial to fully understand the genetic landscape of diseases on molecular level. With the advent of ‘omic’ data era, network-based methods have prominently boosted disease gene discovery. However, how to make better use of different types of data for the prediction of disease genes remains a challenge. Results In this study, we improved the performance of disease gene prediction by integrating the similarity of disease phenotype, biological function and network topology. First, for each phenotype, a phenotype-specific network was specially constructed by mapping phenotype similarity information of given phenotype onto the protein-protein interaction (PPI) network. Then, we developed a gene gravity-like algorithm, to score candidate genes based on not only topological similarity but also functional similarity. We tested the proposed network and algorithm by conducting leave-one-out and leave-10%-out cross validation and compared them with state-of-art algorithms. The results showed a preference to phenotype-specific network as well as gene gravity-like algorithm. At last, we tested the predicting capacity of proposed algorithms by test gene set derived from the DisGeNET database. Also, potential disease genes of three polygenic diseases, obesity, prostate cancer and lung cancer, were predicted by proposed methods. We found that the predicted disease genes are highly consistent with literature and database evidence. Conclusions The good performance of phenotype-specific networks indicates that phenotype similarity information has positive effect on the prediction of disease genes. The proposed gene gravity-like algorithm outperforms the algorithm of Random Walk with Restart (RWR), implicating its predicting capacity by combing topological similarity with functional similarity. Our work will give an insight to the discovery of disease genes by fusing multiple similarities of genes and diseases. Electronic supplementary material The online version of this article (10.1186/s12918-017-0519-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Limei Lin
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Tinghong Yang
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Ling Fang
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Jian Yang
- School of Pharmacy, Second Military Medical University, Shanghai, China
| | - Fan Yang
- Department of Mathematics, Army Logistics University of PLA, Chongqing, China
| | - Jing Zhao
- Institute of Interdisciplinary Complex Research, Shanghai University of Traditional Chinese Medicine, Shanghai, China.
| |
Collapse
|
5
|
Abstract
BACKGROUND Understanding the genetic basis of disease is an important challenge in biology and medicine. The observation that disease-related proteins often interact with one another has motivated numerous network-based approaches for deciphering disease mechanisms. In particular, protein-protein interaction networks were successfully used to illuminate disease modules, i.e., interacting proteins working in concert to drive a disease. The identification of these modules can further our understanding of disease mechanisms. METHODS We devised a global method for the prediction of multiple disease modules simultaneously named GLADIATOR (GLobal Approach for DIsease AssociaTed mOdule Reconstruction). GLADIATOR relies on a gold-standard disease phenotypic similarity to obtain a pan-disease view of the underlying modules. To traverse the search space of potential disease modules, we applied a simulated annealing algorithm aimed at maximizing the correlation between module similarity and the gold-standard phenotypic similarity. Importantly, this optimization is employed over hundreds of diseases simultaneously. RESULTS GLADIATOR's predicted modules highly agree with current knowledge about disease-related proteins. Furthermore, the modules exhibit high coherence with respect to functional annotations and are highly enriched with known curated pathways, outperforming previous methods. Examination of the predicted proteins shared by similar diseases demonstrates the diverse role of these proteins in mediating related processes across similar diseases. Last, we provide a detailed analysis of the suggested molecular mechanism predicted by GLADIATOR for hyperinsulinism, suggesting novel proteins involved in its pathology. CONCLUSIONS GLADIATOR predicts disease modules by integrating knowledge of disease-related proteins and phenotypes across multiple diseases. The predicted modules are functionally coherent and are more in line with current biological knowledge compared to modules obtained using previous disease-centric methods. The source code for GLADIATOR can be downloaded from http://www.cs.tau.ac.il/~roded/GLADIATOR.zip .
Collapse
Affiliation(s)
- Yael Silberberg
- Department of Molecular Microbiology and Biotechnology, Tel Aviv University, Tel Aviv, Israel
| | - Martin Kupiec
- Department of Molecular Microbiology and Biotechnology, Tel Aviv University, Tel Aviv, Israel
| | - Roded Sharan
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
6
|
Abstract
BACKGROUND In recent years long non coding RNAs (lncRNAs) have been the subject of increasing interest. Thanks to many recent functional studies, the existence of a large class of lncRNAs with potential regulatory functions is now widely accepted. Although an increasing number of lncRNAs is being characterized and shown to be involved in many biological processes, the functions of the vast majority lncRNA genes is still unknown. Therefore computational methods able to take advantage of the increasing amount of publicly available data to predict lncRNA functions could be very useful. RESULTS Since coding genes are much better annotated than lncRNAs, we attempted to project known functional information regarding proteins onto non coding genes using the guilt by association principle: if a gene shows an expression profile that correlates with those of a set of coding genes involved in a given function, that gene is probably involved in the same function. We computed gene coexpression for 30 human tissues and 9 vertebrates and mined the resulting networks with a methodology inspired by the rank product algorithm used to identify differentially expressed genes. Using different types of reference data we can predict putative new annotations for thousands of lncRNAs and proteins, ranging from cellular localization to relevance for disease and cancer. CONCLUSIONS New function of coding genes and lncRNA can be profitably predicted using tissue specific coexpression, as well as expression of orthologous genes in different species. The data are available for download and through a user-friendly web interface at www.funcpred.com .
Collapse
Affiliation(s)
- Umberto Perron
- Department of Molecular Biotechnology and Health Sciences, University of Turin, via Nizza 52, Torino, 10126 Italy
| | - Paolo Provero
- Department of Molecular Biotechnology and Health Sciences, University of Turin, via Nizza 52, Torino, 10126 Italy
- Center for Translational Genomics and Bioinformatics, San Raffaele Scientific Institute, via Olgettina 60, Milan, 20132 Italy
| | - Ivan Molineris
- Department of Molecular Biotechnology and Health Sciences, University of Turin, via Nizza 52, Torino, 10126 Italy
| |
Collapse
|
7
|
Abstract
Background Identifying the genes associated to human diseases is crucial for disease diagnosis and drug design. Computational approaches, esp. the network-based approaches, have been recently developed to identify disease-related genes effectively from the existing biomedical networks. Meanwhile, the advance in biotechnology enables researchers to produce multi-omics data, enriching our understanding on human diseases, and revealing the complex relationships between genes and diseases. However, none of the existing computational approaches is able to integrate the huge amount of omics data into a weighted integrated network and utilize it to enhance disease related gene discovery. Results We propose a new network-based disease gene prediction method called SLN-SRW (Simplified Laplacian Normalization-Supervised Random Walk) to generate and model the edge weights of a new biomedical network that integrates biomedical data from heterogeneous sources, thus far enhancing the disease related gene discovery. Conclusions The experiment results show that SLN-SRW significantly improves the performance of disease gene prediction on both the real and the synthetic data sets. Electronic supplementary material The online version of this article (doi:10.1186/s12864-016-3263-4) contains supplementary material, which is available to authorized users.
Collapse
|
8
|
Abstract
BACKGROUND Parkinson disease (PD) is a severe neurodegenerative disease without curative drugs. The highly complex and heterogeneous disease mechanisms are still unclear. Detecting novel PD associated genes not only contributes in revealing the disease pathogenesis, but also facilitates discovering new targets for drugs. METHODS We propose a phenome-based gene prediction strategy to identify disease-associated genes for PD. We integrated multiple disease phenotype networks, a gene functional relationship network, and known PD genes to predict novel candidate genes. Then we investigated the translational potential of the predicted genes in drug discovery. RESULTS In a cross validation analysis, the average rank for 15 known PD genes is within top 0.8 %. We also tested the algorithm with an independent validation set of 669 PD-associated genes detected by genome-wide association studies. The top ranked genes predicted by our approach are enriched for these validation genes. In addition, our approach prioritized the target genes for FDA-approved PD drugs and the drugs that have been tested for PD in clinical trials. Pathway analysis shows that the prioritized drug target genes are closely associated with PD pathogenesis. The result provides empirical evidence that our computational gene prediction approach identifies novel candidate genes for PD, and has the potential to lead to rapid drug discovery.
Collapse
Affiliation(s)
- Yang Chen
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA
| | - Rong Xu
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA.
| |
Collapse
|
9
|
Tran TD, Kwon YK. Hierarchical closeness efficiently predicts disease genes in a directed signaling network. Comput Biol Chem 2014; 53PB:191-7. [PMID: 25462327 DOI: 10.1016/j.compbiolchem.2014.08.023] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Revised: 08/13/2014] [Accepted: 08/25/2014] [Indexed: 11/21/2022]
Abstract
BACKGROUND Many structural centrality measures were proposed to predict putative disease genes on biological networks. Closeness is one of the best-known structural centrality measures, and its effectiveness for disease gene prediction on undirected biological networks has been frequently reported. However, it is not clear whether closeness is effective for disease gene prediction on directed biological networks such as signaling networks. RESULTS In this paper, we first show that closeness does not significantly outperform other well-known centrality measures such as Degree, Betweenness, and PageRank for disease gene prediction on a human signaling network. In addition, we observed that prediction accuracy by the closeness measure was worse than that by a reachability measure, but closeness could efficiently predict disease genes among a set of genes with the same reachability value. Based on this observation, we devised a novel structural measure, hierarchical closeness, by combining reachability and closeness such that all genes are first ranked by the degree of reachability and then the tied genes are further ranked by closeness. We discovered that hierarchical closeness outperforms other structural centrality measures in disease gene prediction. We also found that the set of highly ranked genes in terms of hierarchical closeness is clearly different from that of hub genes with high connectivity. More interestingly, these findings were consistently reproduced in a random Boolean network model. Finally, we found that genes with relatively high hierarchical closeness are significantly likely to encode proteins in the extracellular matrix and receptor proteins in a human signaling network, supporting the fact that half of all modern medicinal drugs target receptor-encoding genes. CONCLUSION Taken together, hierarchical closeness proposed in this study is a novel structural measure to efficiently predict putative disease genes in a directed signaling network.
Collapse
|