1
|
Liu D, Ames C, Khader S, Rapaport F. SciLinker: a large-scale text mining framework for mapping associations among biological entities. Front Artif Intell 2025; 8:1528562. [PMID: 40212086 PMCID: PMC11983328 DOI: 10.3389/frai.2025.1528562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Accepted: 02/27/2025] [Indexed: 04/13/2025] Open
Abstract
Introduction The biomedical literature is the go-to source of information regarding relationships between biological entities, including genes, diseases, cell types, and drugs, but the rapid pace of publication makes an exhaustive manual exploration impossible. In order to efficiently explore an up-to-date repository of millions of abstracts, we constructed an efficient and modular natural language processing pipeline and applied it to the entire PubMed abstract corpora. Methods We developed SciLinker using open-source libraries and pre-trained named entity recognition models to identify human genes, diseases, cell types and drugs, normalizing these biological entities to the Unified Medical Language System (UMLS). We implemented a scoring schema to quantify the statistical significance of entity co-occurrences and applied a fine-tuned PubMedBERT model for gene-disease relationship extraction. Results We identified and analyzed over 30 million association sentences, including more than 11 million gene-disease co-occurrence sentences, revealing more than 1.25 million unique gene-disease associations. We demonstrate SciLinker's ability to extract specific gene-disease relationships using osteoporosis as a case study. We show how such an analysis benefits target identification as clinically validated targets are enriched in SciLinker-derived disease-associated genes. Moreover, this co-occurrence data can be used to construct disease-specific networks, providing insights into significant relationships among biological entities from scientific literature. Conclusion SciLinker represents a novel text mining approach that extracts and quantifies associations between biomedical entities through co-occurrence analysis and relationship extraction from PubMed abstracts. Its modular design enables expansion to additional entities and text corpora, making it a versatile tool for transforming unstructured biomedical data into actionable insights for drug discovery.
Collapse
Affiliation(s)
| | | | | | - Franck Rapaport
- Target, Disease and Systems Biology, Cambridge, MA, United States
| |
Collapse
|
2
|
Johnson R, Li MM, Noori A, Queen O, Zitnik M. Graph Artificial Intelligence in Medicine. Annu Rev Biomed Data Sci 2024; 7:345-368. [PMID: 38749465 PMCID: PMC11344018 DOI: 10.1146/annurev-biodatasci-110723-024625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/23/2024]
Abstract
In clinical artificial intelligence (AI), graph representation learning, mainly through graph neural networks and graph transformer architectures, stands out for its capability to capture intricate relationships and structures within clinical datasets. With diverse data-from patient records to imaging-graph AI models process data holistically by viewing modalities and entities within them as nodes interconnected by their relationships. Graph AI facilitates model transfer across clinical tasks, enabling models to generalize across patient populations without additional parameters and with minimal to no retraining. However, the importance of human-centered design and model interpretability in clinical decision-making cannot be overstated. Since graph AI models capture information through localized neural transformations defined on relational datasets, they offer both an opportunity and a challenge in elucidating model rationale. Knowledge graphs can enhance interpretability by aligning model-driven insights with medical knowledge. Emerging graph AI models integrate diverse data modalities through pretraining, facilitate interactive feedback loops, and foster human-AI collaboration, paving the way toward clinically meaningful predictions.
Collapse
Affiliation(s)
- Ruth Johnson
- Berkowitz Family Living Laboratory, Harvard Medical School, Boston, Massachusetts, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA;
| | - Michelle M Li
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, Massachusetts, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA;
| | - Ayush Noori
- Department of Computer Science, Harvard John A. Paulson School of Engineering and Applied Sciences, Allston, Massachusetts, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA;
| | - Owen Queen
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA;
| | - Marinka Zitnik
- Harvard Data Science Initiative, Cambridge, Massachusetts, USA
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Allston, Massachusetts, USA
- Broad Institute of MIT and Harvard, Cambridge, Massachusetts, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, Massachusetts, USA;
| |
Collapse
|
3
|
Wang X, Yang K, Jia T, Gu F, Wang C, Xu K, Shu Z, Xia J, Zhu Q, Zhou X. KDGene: knowledge graph completion for disease gene prediction using interactional tensor decomposition. Brief Bioinform 2024; 25:bbae161. [PMID: 38605639 PMCID: PMC11009469 DOI: 10.1093/bib/bbae161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2023] [Revised: 02/20/2024] [Accepted: 03/13/2024] [Indexed: 04/13/2024] Open
Abstract
The accurate identification of disease-associated genes is crucial for understanding the molecular mechanisms underlying various diseases. Most current methods focus on constructing biological networks and utilizing machine learning, particularly deep learning, to identify disease genes. However, these methods overlook complex relations among entities in biological knowledge graphs. Such information has been successfully applied in other areas of life science research, demonstrating their effectiveness. Knowledge graph embedding methods can learn the semantic information of different relations within the knowledge graphs. Nonetheless, the performance of existing representation learning techniques, when applied to domain-specific biological data, remains suboptimal. To solve these problems, we construct a biological knowledge graph centered on diseases and genes, and develop an end-to-end knowledge graph completion framework for disease gene prediction using interactional tensor decomposition named KDGene. KDGene incorporates an interaction module that bridges entity and relation embeddings within tensor decomposition, aiming to improve the representation of semantically similar concepts in specific domains and enhance the ability to accurately predict disease genes. Experimental results show that KDGene significantly outperforms state-of-the-art algorithms, whether existing disease gene prediction methods or knowledge graph embedding methods for general domains. Moreover, the comprehensive biological analysis of the predicted results further validates KDGene's capability to accurately identify new candidate genes. This work proposes a scalable knowledge graph completion framework to identify disease candidate genes, from which the results are promising to provide valuable references for further wet experiments. Data and source codes are available at https://github.com/2020MEAI/KDGene.
Collapse
Affiliation(s)
| | - Kuo Yang
- Corresponding author: Kuo Yang and Xuezhong Zhou, Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China. E-mail: and
| | | | | | | | | | | | | | | | - Xuezhong Zhou
- Corresponding author: Kuo Yang and Xuezhong Zhou, Institute of Medical Intelligence, Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer Science & Technology, Beijing Jiaotong University, Beijing 100044, China. E-mail: and
| |
Collapse
|
4
|
Meem TM, Khan U, Mredul MBR, Awal MA, Rahman MH, Khan MS. A Comprehensive Bioinformatics Approach to Identify Molecular Signatures and Key Pathways for the Huntington Disease. Bioinform Biol Insights 2023; 17:11779322231210098. [PMID: 38033382 PMCID: PMC10683407 DOI: 10.1177/11779322231210098] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 10/07/2023] [Indexed: 12/02/2023] Open
Abstract
Huntington disease (HD) is a degenerative brain disease caused by the expansion of CAG (cytosine-adenine-guanine) repeats, which is inherited as a dominant trait and progressively worsens over time possessing threat. Although HD is monogenetic, the specific pathophysiology and biomarkers are yet unknown specifically, also, complex to diagnose at an early stage, and identification is restricted in accuracy and precision. This study combined bioinformatics analysis and network-based system biology approaches to discover the biomarker, pathways, and drug targets related to molecular mechanism of HD etiology. The gene expression profile data sets GSE64810 and GSE95343 were analyzed to predict the molecular markers in HD where 162 mutual differentially expressed genes (DEGs) were detected. Ten hub genes among them (DUSP1, NKX2-5, GLI1, KLF4, SCNN1B, NPHS1, SGK2, PITX2, S100A4, and MSX1) were identified from protein-protein interaction (PPI) network which were mostly expressed as down-regulated. Following that, transcription factors (TFs)-DEGs interactions (FOXC1, GATA2, etc), TF-microRNA (miRNA) interactions (hsa-miR-340, hsa-miR-34a, etc), protein-drug interactions, and disorders associated with DEGs were predicted. Furthermore, we used gene set enrichment analysis (GSEA) to emphasize relevant gene ontology terms (eg, TF activity, sequence-specific DNA binding) linked to DEGs in HD. Disease interactions revealed the diseases that are linked to HD, and the prospective small drug molecules like cytarabine and arsenite was predicted against HD. This study reveals molecular biomarkers at the RNA and protein levels that may be beneficial to improve the understanding of molecular mechanisms, early diagnosis, as well as prospective pharmacologic targets for designing beneficial HD treatment.
Collapse
Affiliation(s)
- Tahera Mahnaz Meem
- Statistics Discipline, Science, Engineering & Technology School, Khulna University, Khulna, Bangladesh
| | - Umama Khan
- Biotechnology & Genetic Engineering Discipline, Khulna University, Khulna, Bangladesh
| | - Md Bazlur Rahman Mredul
- Statistics Discipline, Science, Engineering & Technology School, Khulna University, Khulna, Bangladesh
| | - Md Abdul Awal
- Electronics and Communication Engineering Discipline, Khulna University, Khulna, Bangladesh
| | - Md Habibur Rahman
- Department of Computer Science and Engineering, Islamic University, Kushtia, Bangladesh
| | - Md Salauddin Khan
- Statistics Discipline, Science, Engineering & Technology School, Khulna University, Khulna, Bangladesh
| |
Collapse
|
5
|
Wang Z, Gu Y, Zheng S, Yang L, Li J. MGREL: A multi-graph representation learning-based ensemble learning method for gene-disease association prediction. Comput Biol Med 2023; 155:106642. [PMID: 36805231 DOI: 10.1016/j.compbiomed.2023.106642] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 01/15/2023] [Accepted: 02/05/2023] [Indexed: 02/12/2023]
Abstract
The identification of gene-disease associations plays an important role in the exploration of pathogenic mechanisms and therapeutic targets. Computational methods have been regarded as an effective way to discover the potential gene-disease associations in recent years. However, most of them ignored the combination of abundant genetic, therapeutic information, and gene-disease network topology. To this end, we re-organized the current gene-disease association benchmark dataset by extracting the newest gene-disease associations from the OMIM database. Then, we developed a multi-graph representation learning-based ensemble model, named MGREL to predict gene-disease associations. MGREL integrated two feature generation channels to extract gene and disease features, including a knowledge extraction channel which learned high-order representations from genetic and therapeutic information, and a graph learning channel which acquired network topological representations through multiple advanced graph representation learning methods. Then, an ensemble learning method with 5 machine learning models was used as the classifier to predict the gene-disease association. Comprehensive experiments have demonstrated the significant performance achieved by MGREL compared to 5 state-of-the-art methods. For the major measurements (AUC = 0.925, AUPR = 0.935), the relative improvements of MGREL compared to the suboptimal methods are 3.24%, and 2.75%, respectively. MGREL also achieved impressive improvements in the challenging tasks of predicting potential associations for unknown genes/diseases. In addition, case studies implied potential applications for MGREL in the discovery of potential therapeutic targets.
Collapse
Affiliation(s)
- Ziyang Wang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Yaowen Gu
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Si Zheng
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China; Institute for Artificial Intelligence, Department of Computer Science and Technology, BNRist, Tsinghua University, Beijing, 100084, China
| | - Lin Yang
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China
| | - Jiao Li
- Institute of Medical Information IMI, Chinese Academy of Medical Sciences and Peking Union Medical College CAMS & PUMC, Beijing, 100020, China.
| |
Collapse
|
6
|
Jagodnik KM, Shvili Y, Bartal A. HetIG-PreDiG: A Heterogeneous Integrated Graph Model for Predicting Human Disease Genes based on gene expression. PLoS One 2023; 18:e0280839. [PMID: 36791052 PMCID: PMC9931161 DOI: 10.1371/journal.pone.0280839] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Accepted: 01/10/2023] [Indexed: 02/16/2023] Open
Abstract
Graph analytical approaches permit identifying novel genes involved in complex diseases, but are limited by (i) inferring structural network similarity of connected gene nodes, ignoring potentially relevant unconnected nodes; (ii) using homogeneous graphs, missing gene-disease associations' complexity; (iii) relying on disease/gene-phenotype associations' similarities, involving highly incomplete data; (iv) using binary classification, with gene-disease edges as positive training samples, and non-associated gene and disease nodes as negative samples that may include currently unknown disease genes; or (v) reporting predicted novel associations without systematically evaluating their accuracy. Addressing these limitations, we develop the Heterogeneous Integrated Graph for Predicting Disease Genes (HetIG-PreDiG) model that includes gene-gene, gene-disease, and gene-tissue associations. We predict novel disease genes using low-dimensional representation of nodes accounting for network structure, and extending beyond network structure using the developed Gene-Disease Prioritization Score (GDPS) reflecting the degree of gene-disease association via gene co-expression data. For negative training samples, we select non-associated gene and disease nodes with lower GDPS that are less likely to be affiliated. We evaluate the developed model's success in predicting novel disease genes by analyzing the prediction probabilities of gene-disease associations. HetIG-PreDiG successfully predicts (Micro-F1 = 0.95) gene-disease associations, outperforming baseline models, and is validated using published literature, thus advancing our understanding of complex genetic diseases.
Collapse
Affiliation(s)
- Kathleen M. Jagodnik
- The School of Business Administration, Bar-Ilan University, Ramat Gan, Israel
- Department of Psychiatry, Harvard Medical School, Boston, MA, United States of America
- Department of Psychiatry, Massachusetts General Hospital, Boston, MA, United States of America
| | - Yael Shvili
- Department of Surgery A, Meir Medical Center, Kfar Sava, Israel
| | - Alon Bartal
- The School of Business Administration, Bar-Ilan University, Ramat Gan, Israel
- * E-mail:
| |
Collapse
|
7
|
Abdelhalim H, Berber A, Lodi M, Jain R, Nair A, Pappu A, Patel K, Venkat V, Venkatesan C, Wable R, Dinatale M, Fu A, Iyer V, Kalove I, Kleyman M, Koutsoutis J, Menna D, Paliwal M, Patel N, Patel T, Rafique Z, Samadi R, Varadhan R, Bolla S, Vadapalli S, Ahmed Z. Artificial Intelligence, Healthcare, Clinical Genomics, and Pharmacogenomics Approaches in Precision Medicine. Front Genet 2022; 13:929736. [PMID: 35873469 PMCID: PMC9299079 DOI: 10.3389/fgene.2022.929736] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Accepted: 05/25/2022] [Indexed: 12/13/2022] Open
Abstract
Precision medicine has greatly aided in improving health outcomes using earlier diagnosis and better prognosis for chronic diseases. It makes use of clinical data associated with the patient as well as their multi-omics/genomic data to reach a conclusion regarding how a physician should proceed with a specific treatment. Compared to the symptom-driven approach in medicine, precision medicine considers the critical fact that all patients do not react to the same treatment or medication in the same way. When considering the intersection of traditionally distinct arenas of medicine, that is, artificial intelligence, healthcare, clinical genomics, and pharmacogenomics—what ties them together is their impact on the development of precision medicine as a field and how they each contribute to patient-specific, rather than symptom-specific patient outcomes. This study discusses the impact and integration of these different fields in the scope of precision medicine and how they can be used in preventing and predicting acute or chronic diseases. Additionally, this study also discusses the advantages as well as the current challenges associated with artificial intelligence, healthcare, clinical genomics, and pharmacogenomics.
Collapse
Affiliation(s)
- Habiba Abdelhalim
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Asude Berber
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Mudassir Lodi
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Rihi Jain
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Achuth Nair
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Anirudh Pappu
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Kush Patel
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Vignesh Venkat
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Cynthia Venkatesan
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Raghu Wable
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Matthew Dinatale
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Allyson Fu
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Vikram Iyer
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Ishan Kalove
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Marc Kleyman
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Joseph Koutsoutis
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - David Menna
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Mayank Paliwal
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Nishi Patel
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Thirth Patel
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Zara Rafique
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Rothela Samadi
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Roshan Varadhan
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Shreyas Bolla
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Sreya Vadapalli
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States
| | - Zeeshan Ahmed
- Rutgers Institute for Health, Health Care Policy and Aging Research, Rutgers University, New Brunswick, NJ, United States.,Department of Medicine, Rutgers Robert Wood Johnson Medical School, Rutgers Biomedical and Health Sciences, New Brunswick, NJ, United States
| |
Collapse
|
8
|
Hossain MA, Al Amin M, Hasan MI, Sohel M, Ahammed MA, Mahmud SH, Rahman MR, Rahman MH. Bioinformatics and system biology approaches to identify molecular pathogenesis of polycystic ovarian syndrome, type 2 diabetes, obesity, and cardiovascular disease that are linked to the progression of female infertility. INFORMATICS IN MEDICINE UNLOCKED 2022. [DOI: 10.1016/j.imu.2022.100960] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
|
9
|
Zhang Y, Chen L, Li S. CIPHER-SC: Disease-Gene Association Inference Using Graph Convolution on a Context-Aware Network With Single-Cell Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:819-829. [PMID: 32809944 DOI: 10.1109/tcbb.2020.3017547] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Inference of disease-gene associations helps unravel the pathogenesis of diseases and contributes to the treatment. Although many machine learning-based methods have been developed to predict causative genes, accurate association inference remains challenging. One major reason is the inaccurate feature selection and accumulation of error brought by commonly used multi-stage training architecture. In addition, the existing methods do not incorporate cell-type-specific information, thus fail to study gene functions at a higher resolution. Therefore, we introduce single-cell transcriptome data and construct a context-aware network to unbiasedly integrate all data sources. Then we develop a graph convolution-based approach named CIPHER-SC to realize a complete end-to-end learning architecture. Our approach outperforms four state-of-the-art approaches in five-fold cross-validations on three distinct test sets with the best AUC of 0.9501, demonstrating its stable ability either to predict the novel genes or to predict with genetic basis. The ablation study shows that our complete end-to-end design and unbiased data integration boost the performance from 0.8727 to 0.9443 in AUC. The addition of single-cell data further improves the prediction accuracy and makes our results be enriched for cell-type-specific genes. These results confirm the ability of CIPHER-SC to discover reliable disease genes. Our implementation is available at http://github.com/YidingZhang117/CIPHER-SC.
Collapse
|
10
|
Yang K, Zheng Y, Lu K, Chang K, Wang N, Shu Z, Yu J, Liu B, Gao Z, Zhou X. PDGNet: Predicting Disease Genes Using a Deep Neural Network With Multi-View Features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:575-584. [PMID: 32750864 DOI: 10.1109/tcbb.2020.3002771] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The knowledge of phenotype-genotype associations is crucial for the understanding of disease mechanisms. Numerous studies have focused on developing efficient and accurate computing approaches to predict disease genes. However, owing to the sparseness and complexity of medical data, developing an efficient deep neural network model to identify disease genes remains a huge challenge. Therefore, we develop a novel deep neural network model that fuses the multi-view features of phenotypes and genotypes to identify disease genes (termed PDGNet). Our model integrated the multi-view features of diseases and genes and leveraged the feedback information of training samples to optimize the parameters of deep neural network and obtain the deep vector features of diseases and genes. The evaluation experiments on a large data set indicated that PDGNet obtained higher performance than the state-of-the-art method (precision and recall improved by 9.55 and 9.63 percent). The analysis results for the candidate genes indicated that the predicted genes have strong functional homogeneity and dense interactions with known genes. We validated the top predicted genes of Parkinson's disease based on external curated data and published medical literatures, which indicated that the candidate genes have a huge potential to guide the selection of causal genes in the 'wet experiment'. The source codes and the data of PDGNet are available at https://github.com/yangkuoone/PDGNet.
Collapse
|
11
|
Lee PY, Yeoh Y, Omar N, Pung YF, Lim LC, Low TY. Molecular tissue profiling by MALDI imaging: recent progress and applications in cancer research. Crit Rev Clin Lab Sci 2021; 58:513-529. [PMID: 34615421 DOI: 10.1080/10408363.2021.1942781] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Matrix-assisted laser desorption/ionization (MALDI) imaging is an emergent technology that has been increasingly adopted in cancer research. MALDI imaging is capable of providing global molecular mapping of the abundance and spatial information of biomolecules directly in the tissues without labeling. It enables the characterization of a wide spectrum of analytes, including proteins, peptides, glycans, lipids, drugs, and metabolites and is well suited for both discovery and targeted analysis. An advantage of MALDI imaging is that it maintains tissue integrity, which allows correlation with histological features. It has proven to be a valuable tool for probing tumor heterogeneity and has been increasingly applied to interrogate molecular events associated with cancer. It provides unique insights into both the molecular content and spatial details that are not accessible by other techniques, and it has allowed considerable progress in the field of cancer research. In this review, we first provide an overview of the MALDI imaging workflow and approach. We then highlight some useful applications in various niches of cancer research, followed by a discussion of the challenges, recent developments and future prospect of this technique in the field.
Collapse
Affiliation(s)
- Pey Yee Lee
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Yeelon Yeoh
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Nursyazwani Omar
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| | - Yuh-Fen Pung
- Division of Biomedical Science, University of Nottingham Malaysia, Selangor, Malaysia
| | - Lay Cheng Lim
- Department of Life Sciences, School of Pharmacy, International Medical University (IMU), Kuala Lumpur, Malaysia
| | - Teck Yew Low
- UKM Medical Molecular Biology Institute (UMBI), Universiti Kebangsaan Malaysia, Kuala Lumpur, Malaysia
| |
Collapse
|
12
|
Yang H, Ding Y, Tang J, Guo F. Identifying potential association on gene-disease network via dual hypergraph regularized least squares. BMC Genomics 2021; 22:605. [PMID: 34372777 PMCID: PMC8351363 DOI: 10.1186/s12864-021-07864-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2021] [Accepted: 06/29/2021] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND Identifying potential associations between genes and diseases via biomedical experiments must be the time-consuming and expensive research works. The computational technologies based on machine learning models have been widely utilized to explore genetic information related to complex diseases. Importantly, the gene-disease association detection can be defined as the link prediction problem in bipartite network. However, many existing methods do not utilize multiple sources of biological information; Additionally, they do not extract higher-order relationships among genes and diseases. RESULTS In this study, we propose a novel method called Dual Hypergraph Regularized Least Squares (DHRLS) with Centered Kernel Alignment-based Multiple Kernel Learning (CKA-MKL), in order to detect all potential gene-disease associations. First, we construct multiple kernels based on various biological data sources in gene and disease spaces respectively. After that, we use CAK-MKL to obtain the optimal kernels in the two spaces respectively. To specific, hypergraph can be employed to establish higher-order relationships. Finally, our DHRLS model is solved by the Alternating Least squares algorithm (ALSA), for predicting gene-disease associations. CONCLUSION Comparing with many outstanding prediction tools, DHRLS achieves best performance on gene-disease associations network under two types of cross validation. To verify robustness, our proposed approach has excellent prediction performance on six real-world networks. Our research work can effectively discover potential disease-associated genes and provide guidance for the follow-up verification methods of complex diseases.
Collapse
Affiliation(s)
- Hongpeng Yang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- Yangtze Delta Region Institute, University of Electronic Science and Technology of China, Quzhou, China.
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Fei Guo
- School of Computer Science and Engineering, Central South University, Changsha, China.
| |
Collapse
|
13
|
Zhang XM, Liang L, Liu L, Tang MJ. Graph Neural Networks and Their Current Applications in Bioinformatics. Front Genet 2021; 12:690049. [PMID: 34394185 PMCID: PMC8360394 DOI: 10.3389/fgene.2021.690049] [Citation(s) in RCA: 78] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Accepted: 05/28/2021] [Indexed: 12/22/2022] Open
Abstract
Graph neural networks (GNNs), as a branch of deep learning in non-Euclidean space, perform particularly well in various tasks that process graph structure data. With the rapid accumulation of biological network data, GNNs have also become an important tool in bioinformatics. In this research, a systematic survey of GNNs and their advances in bioinformatics is presented from multiple perspectives. We first introduce some commonly used GNN models and their basic principles. Then, three representative tasks are proposed based on the three levels of structural information that can be learned by GNNs: node classification, link prediction, and graph generation. Meanwhile, according to the specific applications for various omics data, we categorize and discuss the related studies in three aspects: disease prediction, drug discovery, and biomedical imaging. Based on the analysis, we provide an outlook on the shortcomings of current studies and point out their developing prospect. Although GNNs have achieved excellent results in many biological tasks at present, they still face challenges in terms of low-quality data processing, methodology, and interpretability and have a long road ahead. We believe that GNNs are potentially an excellent method that solves various biological problems in bioinformatics research.
Collapse
Affiliation(s)
- Xiao-Meng Zhang
- School of Information, Yunnan Normal University, Kunming, China
| | - Li Liang
- School of Information, Yunnan Normal University, Kunming, China
| | - Lin Liu
- School of Information, Yunnan Normal University, Kunming, China
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming, China
| | - Ming-Jing Tang
- Key Laboratory of Educational Informatization for Nationalities Ministry of Education, Yunnan Normal University, Kunming, China
- School of Life Sciences, Yunnan Normal University, Kunming, China
| |
Collapse
|
14
|
Huang Q, Wang J, Zhang X, Guo M, Yu G. IsoDA: Isoform-Disease Association Prediction by Multiomics Data Fusion. J Comput Biol 2021; 28:804-819. [PMID: 33826865 DOI: 10.1089/cmb.2020.0626] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022] Open
Abstract
A gene can be spliced into different isoforms by alternative splicing, which contributes to the functional diversity of protein species. Computational prediction of gene-disease associations (GDAs) has been studied for decades. However, the process of identifying the isoform-disease associations (IDAs) at a large scale is rarely explored, which can decipher the pathology at a more granular level. The main bottleneck is the lack of IDAs in current databases and the multilevel omics data fusion. To bridge this gap, we propose a computational approach called Isoform-Disease Association prediction by multiomics data fusion (IsoDA) to predict IDAs. Based on the relationship between a gene and its spliced isoforms, IsoDA first introduces a dispatch and aggregation term to dispatch gene-disease associations to individual isoforms, and reversely aggregate these dispatched associations to their hosting genes. At the same time, it fuses the genome, transcriptome, and proteome data by joint matrix factorization to improve the prediction of IDAs. Experimental results show that IsoDA significantly outperforms the related state-of-the-art methods at both the gene level and isoform level. A case study further shows that IsoDA credibly identifies three isoforms spliced from apolipoprotein E, which have individual associations with Alzheimer's disease, and two isoforms spliced from vascular endothelial growth factor A, which have different associations with coronary heart disease. The codes of IsoDA are available at http://mlda.swu.edu.cn/codes.php?name=IsoDA.
Collapse
Affiliation(s)
- Qiuyue Huang
- College of Computer and Information Science, Southwest University, Chongqing, China.,School of Software, Shandong University, Jinan, China
| | - Jun Wang
- School of Software, Shandong University, Jinan, China
| | - Xiangliang Zhang
- Department of Computer Science, Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- Department of Computer Science, College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China.,School of Software, Shandong University, Jinan, China.,Department of Computer Science, Computer, Electrical and Mathematical Science and Engineering Division, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
15
|
Luo P, Chen B, Liao B, Wu F. Predicting disease‐associated genes: Computational methods, databases, and evaluations. WIRES DATA MINING AND KNOWLEDGE DISCOVERY 2021; 11. [DOI: 10.1002/widm.1383] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/28/2019] [Accepted: 06/13/2020] [Indexed: 09/09/2024]
Abstract
AbstractComplex diseases are associated with a set of genes (called disease genes), the identification of which can help scientists uncover the mechanisms of diseases and develop new drugs and treatment strategies. Due to the huge cost and time of experimental identification techniques, many computational algorithms have been proposed to predict disease genes. Although several review publications in recent years have discussed many computational methods, some of them focus on cancer driver genes while others focus on biomolecular networks, which only cover a specific aspect of existing methods. In this review, we summarize existing methods and classify them into three categories based on their rationales. Then, the algorithms, biological data, and evaluation methods used in the computational prediction are discussed. Finally, we highlight the limitations of existing methods and point out some future directions for improving these algorithms. This review could help investigators understand the principles of existing methods, and thus develop new methods to advance the computational prediction of disease genes.This article is categorized under:Technologies > Machine LearningTechnologies > PredictionAlgorithmic Development > Biological Data Mining
Collapse
Affiliation(s)
- Ping Luo
- Division of Biomedical Engineering University of Saskatchewan Saskatoon Canada
- Princess Margaret Cancer Centre University Health Network Toronto Canada
| | - Bolin Chen
- School of Computer Science and Technology Northwestern Polytechnical University China
| | - Bo Liao
- School of Mathematics and Statistics Hainan Normal University Haikou China
| | - Fang‐Xiang Wu
- Department of Mechanical Engineering and Department of Computer Science University of Saskatchewan Saskatoon Canada
| |
Collapse
|
16
|
Yang K, Lu K, Wu Y, Yu J, Liu B, Zhao Y, Chen J, Zhou X. A network-based machine-learning framework to identify both functional modules and disease genes. Hum Genet 2021; 140:897-913. [PMID: 33409574 DOI: 10.1007/s00439-020-02253-0] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2020] [Accepted: 12/22/2020] [Indexed: 01/20/2023]
Abstract
Disease gene identification is a critical step towards uncovering the molecular mechanisms of diseases and systematically investigating complex disease phenotypes. Despite considerable efforts to develop powerful computing methods, candidate gene identification remains a severe challenge owing to the connectivity of an incomplete interactome network, which hampers the discovery of true novel candidate genes. We developed a network-based machine-learning framework to identify both functional modules and disease candidate genes. In this framework, we designed a semi-supervised non-negative matrix factorization model to obtain the functional modules related to the diseases and genes. Of note, we proposed a disease gene-prioritizing method called MapGene that integrates the correlations from both functional modules and network closeness. Our framework identified a set of functional modules with highly functional homogeneity and close gene interactions. Experiments on a large-scale benchmark dataset showed that MapGene performs significantly better than the state-of-the-art algorithms. Further analysis demonstrates MapGene can effectively relieve the impact of the incompleteness of interactome networks and obtain highly reliable rankings of candidate genes. In addition, disease cases on Parkinson's disease and diabetes mellitus confirmed the generalization of MapGene for novel candidate gene identification. This work proposed, for the first time, an integrated computing framework to predict both functional modules and disease candidate genes. The methodology and results support that our framework has the potential to help discover underlying functional modules and reliable candidate genes in human disease.
Collapse
Affiliation(s)
- Kuo Yang
- School of Computer and Information Technology, Institute of Medical Intelligence, Beijing Jiaotong University, Beijing, 100044, China.,Institute for TCM-X, MOE Key Laboratory of Bioinformatics / Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing, 10084, China
| | - Kezhi Lu
- School of Computer and Information Technology, Institute of Medical Intelligence, Beijing Jiaotong University, Beijing, 100044, China.,imec-DistriNet, KU Leuven, Leuven, 3001, Belgium
| | - Yang Wu
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Jian Yu
- Beijing Key Laboratory of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing, 100044, China
| | - Baoyan Liu
- Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China
| | - Yi Zhao
- Key Laboratory of Intelligent Information Processing, Advanced Computer Research Center, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, 100190, China
| | - Jianxin Chen
- Beijing University of Chinese Medicine, Beijing, 100029, China
| | - Xuezhong Zhou
- School of Computer and Information Technology, Institute of Medical Intelligence, Beijing Jiaotong University, Beijing, 100044, China. .,Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, 100700, China.
| |
Collapse
|
17
|
Bean DM, Al-Chalabi A, Dobson RJB, Iacoangeli A. A Knowledge-Based Machine Learning Approach to Gene Prioritisation in Amyotrophic Lateral Sclerosis. Genes (Basel) 2020; 11:E668. [PMID: 32575372 PMCID: PMC7349022 DOI: 10.3390/genes11060668] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2020] [Revised: 06/13/2020] [Accepted: 06/16/2020] [Indexed: 02/07/2023] Open
Abstract
Amyotrophic lateral sclerosis is a neurodegenerative disease of the upper and lower motor neurons resulting in death from neuromuscular respiratory failure, typically within two to five years of first symptoms. Several rare disruptive gene variants have been associated with ALS and are responsible for about 15% of all cases. Although our knowledge of the genetic landscape of this disease is improving, it remains limited. Machine learning models trained on the available protein-protein interaction and phenotype-genotype association data can use our current knowledge of the disease genetics for the prediction of novel candidate genes. Here, we describe a knowledge-based machine learning method for this purpose. We trained our model on protein-protein interaction data from IntAct, gene function annotation from Gene Ontology, and known disease-gene associations from DisGeNet. Using several sets of known ALS genes from public databases and a manual review as input, we generated a list of new candidate genes for each input set. We investigated the relevance of the predicted genes in ALS by using the available summary statistics from the largest ALS genome-wide association study and by performing functional and phenotype enrichment analysis. The predicted sets were enriched for genes associated with other neurodegenerative diseases known to overlap with ALS genetically and phenotypically, as well as for biological processes associated with the disease. Moreover, using ALS genes from ClinVar and our manual review as input, the predicted sets were enriched for ALS-associated genes (ClinVar p = 0.038 and manual review p = 0.060) when used for gene prioritisation in a genome-wide association study.
Collapse
Affiliation(s)
- Daniel M. Bean
- Department of Biostatistics & Health Informatics, King′s College London, 16 De Crespigny Park, London SE5 8AF, UK;
- Health Data Research UK London, University College London, 16 De Crespigny Park, London SE5 8AF, UK
| | - Ammar Al-Chalabi
- King′s College Hospital, Bessemer Road, Denmark Hill, Brixton, London SE5 9RS, UK;
- Maurice Wohl Clinical Neuroscience Institute, Department of Basic and Clinical Neuroscience, King′s College London, London, 5 Cutcombe Rd, Brixton, London SE5 9RT, UK
| | - Richard J. B. Dobson
- Department of Biostatistics & Health Informatics, King′s College London, 16 De Crespigny Park, London SE5 8AF, UK;
- Health Data Research UK London, University College London, 16 De Crespigny Park, London SE5 8AF, UK
- Institute of Health Informatics, University College London, 222 Euston Rd, London NW1 2DA, UK
| | - Alfredo Iacoangeli
- Department of Biostatistics & Health Informatics, King′s College London, 16 De Crespigny Park, London SE5 8AF, UK;
- Maurice Wohl Clinical Neuroscience Institute, Department of Basic and Clinical Neuroscience, King′s College London, London, 5 Cutcombe Rd, Brixton, London SE5 9RT, UK
| |
Collapse
|
18
|
Yang Q, Li B, Tang J, Cui X, Wang Y, Li X, Hu J, Chen Y, Xue W, Lou Y, Qiu Y, Zhu F. Consistent gene signature of schizophrenia identified by a novel feature selection strategy from comprehensive sets of transcriptomic data. Brief Bioinform 2020; 21:1058-1068. [PMID: 31157371 DOI: 10.1093/bib/bbz049] [Citation(s) in RCA: 190] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2019] [Revised: 03/11/2019] [Accepted: 03/30/2019] [Indexed: 05/16/2025] Open
Abstract
The etiology of schizophrenia (SCZ) is regarded as one of the most fundamental puzzles in current medical research, and its diagnosis is limited by the lack of objective molecular criteria. Although plenty of studies were conducted, SCZ gene signatures identified by these independent studies are found highly inconsistent. As one of the most important factors contributing to this inconsistency, the feature selection methods used currently do not fully consider the reproducibility among the signatures discovered from different datasets. Therefore, it is crucial to develop new bioinformatics tools of novel strategy for ensuring a stable discovery of gene signature for SCZ. In this study, a novel feature selection strategy (1) integrating repeated random sampling with consensus scoring and (2) evaluating the consistency of gene rank among different datasets was constructed. By systematically assessing the identified SCZ signature comprising 135 differentially expressed genes, this newly constructed strategy demonstrated significantly enhanced stability and better differentiating ability compared with the feature selection methods popular in current SCZ research. Based on a first-ever assessment on methods' reproducibility cross-validated by independent datasets from three representative studies, the new strategy stood out among the popular methods by showing superior stability and differentiating ability. Finally, 2 novel and 17 previously reported transcription factors were identified and showed great potential in revealing the etiology of SCZ. In sum, the SCZ signature identified in this study would provide valuable clues for discovering diagnostic molecules and potential targets for SCZ.
Collapse
Affiliation(s)
- Qingxia Yang
- Innovative Drug Research and Bioinformatics Group, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Bo Li
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Jing Tang
- Innovative Drug Research and Bioinformatics Group, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Xuejiao Cui
- Innovative Drug Research and Bioinformatics Group, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Yunxia Wang
- Innovative Drug Research and Bioinformatics Group, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Xiaofeng Li
- Innovative Drug Research and Bioinformatics Group, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Jie Hu
- Innovative Drug Research and Bioinformatics Group, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Yuzong Chen
- Bioinformatics and Drug Design Group, Department of Pharmacy, and Center for Computational Science and Engineering, National University of Singapore, Singapore, Singapore
| | - Weiwei Xue
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| | - Yan Lou
- Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang, China
| | - Yunqing Qiu
- Zhejiang Provincial Key Laboratory for Drug Clinical Research and Evaluation, The First Affiliated Hospital, Zhejiang University, Hangzhou, Zhejiang, China
| | - Feng Zhu
- Innovative Drug Research and Bioinformatics Group, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
- Innovative Drug Research and Bioinformatics Group, School of Pharmaceutical Sciences and Collaborative Innovation Center for Brain Science, Chongqing University, Chongqing, China
| |
Collapse
|
19
|
Zhou H, Cao H, Matyunina L, Shelby M, Cassels L, McDonald JF, Skolnick J. MEDICASCY: A Machine Learning Approach for Predicting Small-Molecule Drug Side Effects, Indications, Efficacy, and Modes of Action. Mol Pharm 2020; 17:1558-1574. [PMID: 32237745 PMCID: PMC7319183 DOI: 10.1021/acs.molpharmaceut.9b01248] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/27/2022]
Abstract
To improve the drug discovery yield, a method which is implemented at the beginning of drug discovery that accurately predicts drug side effects, indications, efficacy, and mode of action based solely on the input of the drug's chemical structure is needed. In contrast, extant predictive methods do not comprehensively address these aspects of drug discovery and rely on features derived from extensive, often unavailable experimental information for novel molecules. To address these issues, we developed MEDICASCY, a multilabel-based boosted random forest machine learning method that only requires the small molecule's chemical structure for the drug side effect, indication, efficacy, and probable mode of action target predictions; however, it has comparable or even significantly better performance than existing approaches requiring far more information. In retrospective benchmarking on high confidence predictions, MEDICASCY shows about 78% precision and recall for predicting at least one severe side effect and 72% precision drug efficacy. Experimental validation of MEDICASCY's efficacy predictions on novel molecules shows close to 80% precision for the inhibition of growth in ovarian, breast, and prostate cancer cell lines. Thus, MEDICASCY should improve the success rate for new drug approval. A web service for academic users is available at http://pwp.gatech.edu/cssb/MEDICASCY.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, N.W., Atlanta, GA 30332
| | - Hongnan Cao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, N.W., Atlanta, GA 30332
| | - Lilya Matyunina
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230, USA
| | - Madelyn Shelby
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230, USA
| | - Lauren Cassels
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230, USA
| | - John F. McDonald
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, 30332-0230, USA
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, N.W., Atlanta, GA 30332
| |
Collapse
|
20
|
Ni P, Wang J, Zhong P, Li Y, Wu FX, Pan Y. Constructing Disease Similarity Networks Based on Disease Module Theory. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:906-915. [PMID: 29993782 DOI: 10.1109/tcbb.2018.2817624] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Quantifying the associations between diseases is now playing an important role in modern biology and medicine. Actually discovering associations between diseases could help us gain deeper insights into pathogenic mechanisms of complex diseases, thus could lead to improvements in disease diagnosis, drug repositioning, and drug development. Due to the growing body of high-throughput biological data, a number of methods have been developed for computing similarity between diseases during the past decade. However, these methods rarely consider the interconnections of genes related to each disease in protein-protein interaction network (PPIN). Recently, the disease module theory has been proposed, which states that disease-related genes or proteins tend to interact with each other in the same neighborhood of a PPIN. In this study, we propose a new method called ModuleSim to measure associations between diseases by using disease-gene association data and PPIN data based on disease module theory. The experimental results show that by considering the interactions between disease modules and their modularity, the disease similarity calculated by ModuleSim has a significant correlation with disease classification of Disease Ontology (DO). Furthermore, ModuleSim outperforms other four popular methods which are all using disease-gene association data and PPIN data to measure disease-disease associations. In addition, the disease similarity network constructed by MoudleSim suggests that ModuleSim is capable of finding potential associations between diseases.
Collapse
|
21
|
Cao H, Jin M, Gao M, Zhou H, Tao YJ, Skolnick J. Differential kinase activity of ACVR1 G328V and R206H mutations with implications to possible TβRI cross-talk in diffuse intrinsic pontine glioma. Sci Rep 2020; 10:6140. [PMID: 32273545 PMCID: PMC7145857 DOI: 10.1038/s41598-020-63061-0] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2019] [Accepted: 03/19/2020] [Indexed: 01/17/2023] Open
Abstract
Diffuse intrinsic pontine glioma (DIPG) is a lethal pediatric brain cancer whose median survival time is under one year. The possible roles of the two most common DIPG associated cytoplasmic ACVR1 receptor kinase domain mutants, G328V and R206H, are reexamined in the context of new biochemical results regarding their intrinsic relative ATPase activities. At 37 °C, the G328V mutant displays a 1.8-fold increase in intrinsic kinase activity over wild-type, whereas the R206H mutant shows similar activity. The higher G328V mutant intrinsic kinase activity is consistent with the statistically significant longer overall survival times of DIPG patients harboring ACVR1 G328V tumors. Based on the potential cross-talk between ACVR1 and TβRI pathways and known and predicted off-targets of ACVR1 inhibitors, we further validated the inhibition effects of several TβRI inhibitors on ACVR1 wild-type and G328V mutant patient tumor derived DIPG cell lines at 20–50 µM doses. SU-DIPG-IV cells harboring the histone H3.1K27M and activating ACVR1 G328V mutations appeared to be less susceptible to TβRI inhibition than SF8628 cells harboring the H3.3K27M mutation and wild-type ACVR1. Thus, inhibition of hidden oncogenic signaling pathways in DIPG such as TβRI that are not limited to ACVR1 itself may provide alternative entry points for DIPG therapeutics.
Collapse
Affiliation(s)
- Hongnan Cao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, NW, Atlanta, Georgia, 30332, United States
| | - Miao Jin
- Department of BioSciences, Rice University, Houston, Texas, 77005, United States
| | - Mu Gao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, NW, Atlanta, Georgia, 30332, United States
| | - Hongyi Zhou
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, NW, Atlanta, Georgia, 30332, United States
| | - Yizhi Jane Tao
- Department of BioSciences, Rice University, Houston, Texas, 77005, United States
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, 950 Atlantic Drive, NW, Atlanta, Georgia, 30332, United States.
| |
Collapse
|
22
|
Wang S, Wang W, Wang W, Xia P, Yu L, Lu Y, Chen X, Xu C, Liu H. Context-Specific Coordinately Regulatory Network Prioritize Breast Cancer Genetic Risk Factors. Front Genet 2020; 11:255. [PMID: 32273883 PMCID: PMC7113376 DOI: 10.3389/fgene.2020.00255] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2019] [Accepted: 03/03/2020] [Indexed: 12/16/2022] Open
Abstract
Breast cancer (BC) is one of the most common tumors, leading the causes of cancer death in women. However, the pathogenesis of BC still remains unclear, and the atlas of BC-associated risk factors is far from complete. In this study, we constructed a BC-specific coordinately regulatory network (CRN) to prioritize potential BC-associated protein-coding genes (PCGs) and non-coding RNAs (ncRNAs). We integrated 813 BC sample transcriptome data from The Cancer Genome Atlas (TCGA) and eight types of regulatory relationships to construct BC-specific CRN, including 387 transcription factors (TFs), 174 microRNAs (miRNAs), 407 long non-coding RNAs (lncRNAs), and 905 PCGs. After that, the random walk with restart (RWR) method was performed on the CRN by using the known BC-associated factors as seeds, and potential BC-associated risk factors were prioritized. The leave-one-out cross-validation (LOOCV) was utilized on the BC-specific CRN and achieved an area under the curve (AUC) of 0.92. The performances of common CRN, common protein-protein interaction (PPI) network, and BC-specific PPI network were also evaluated, demonstrating that the context-specific CRN prioritizes BC risk factors. Functional analysis for the top 100-ranked risk factors in the candidate list revealed that these factors were significantly enriched in cancer-related functions and had significant semantic similarity with BC-related gene ontology (GO) terms. Differential expression analysis and survival analysis proved that the prioritized risk factors significantly associated with BC progression and prognosis. In total, we provided a computational method to predict reliable BC-associated risk factors, which would help improve the understanding of the pathology of BC and benefit disease diagnosis and prognosis.
Collapse
Affiliation(s)
- Shuyuan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Wencan Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Weida Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Peng Xia
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Lei Yu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Ye Lu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Xiaowen Chen
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Chaohan Xu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| | - Hui Liu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, China
| |
Collapse
|
23
|
Guan X, Runger G, Liu L. Dynamic incorporation of prior knowledge from multiple domains in biomarker discovery. BMC Bioinformatics 2020; 21:77. [PMID: 32164534 PMCID: PMC7068914 DOI: 10.1186/s12859-020-3344-x] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022] Open
Abstract
Background In biomarker discovery, applying domain knowledge is an effective approach to eliminating false positive features, prioritizing functionally impactful markers and facilitating the interpretation of predictive signatures. Several computational methods have been developed that formulate the knowledge-based biomarker discovery as a feature selection problem guided by prior information. These methods often require that prior information is encoded as a single score and the algorithms are optimized for biological knowledge of a specific type. However, in practice, domain knowledge from diverse resources can provide complementary information. But no current methods can integrate heterogeneous prior information for biomarker discovery. To address this problem, we developed the Know-GRRF (know-guided regularized random forest) method that enables dynamic incorporation of domain knowledge from multiple disciplines to guide feature selection. Results Know-GRRF embeds domain knowledge in a regularized random forest framework. It combines prior information from multiple domains in a linear model to derive a composite score, which, together with other tuning parameters, controls the regularization of the random forests model. Know-GRRF concurrently optimizes the weight given to each type of domain knowledge and other tuning parameters to minimize the AIC of out-of-bag predictions. The objective is to select a compact feature subset that has a high discriminative power and strong functional relevance to the biological phenotype. Via rigorous simulations, we show that Know-GRRF guided by multiple-domain prior information outperforms feature selection methods guided by single-domain prior information or no prior information. We then applied Known-GRRF to a real-world study to identify prognostic biomarkers of prostate cancers. We evaluated the combination of cancer-related gene annotations, evolutionary conservation and pre-computed statistical scores as the prior knowledge to assemble a panel of biomarkers. We discovered a compact set of biomarkers with significant improvements on prediction accuracies. Conclusions Know-GRRF is a powerful novel method to incorporate knowledge from multiple domains for feature selection. It has a broad range of applications in biomarker discoveries. We implemented this method and released a KnowGRRF package in the R/CRAN archive.
Collapse
Affiliation(s)
- Xin Guan
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA.,Intel Corporation, Chandler, AZ, 85226, USA
| | - George Runger
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA
| | - Li Liu
- College of Health Solutions, Arizona State University, Phoenix, AZ, 85004, USA. .,Biodesign Institute, Arizona State University, Tempe, AZ, 85287, USA. .,Department of Neurology, Mayo Clinic, Scottsdale, AZ, 85259, USA.
| |
Collapse
|
24
|
Association extraction from biomedical literature based on representation and transfer learning. J Theor Biol 2020; 488:110112. [DOI: 10.1016/j.jtbi.2019.110112] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2019] [Accepted: 12/08/2019] [Indexed: 12/17/2022]
|
25
|
Yang K, Wang R, Liu G, Shu Z, Wang N, Zhang R, Yu J, Chen J, Li X, Zhou X. HerGePred: Heterogeneous Network Embedding Representation for Disease Gene Prediction. IEEE J Biomed Health Inform 2020; 23:1805-1815. [PMID: 31283472 DOI: 10.1109/jbhi.2018.2870728] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
The discovery of disease-causing genes is a critical step towards understanding the nature of a disease and determining a possible cure for it. In recent years, many computational methods to identify disease genes have been proposed. However, making full use of disease-related (e.g., symptoms) and gene-related (e.g., gene ontology and protein-protein interactions) information to improve the performance of disease gene prediction is still an issue. Here, we develop a heterogeneous disease-gene-related network (HDGN) embedding representation framework for disease gene prediction (called HerGePred). Based on this framework, a low-dimensional vector representation (LVR) of the nodes in the HDGN can be obtained. Then, we propose two specific algorithms, namely, an LVR-based similarity prediction and a random walk with restart on a reconstructed heterogeneous disease-gene network (RW-RDGN), to predict disease genes with high performance. First, to validate the rationality of the framework, we analyze the similarity-based overlap distribution of disease pairs and design an experiment for disease-gene association recovery, the results of which revealed that the LVR of nodes performs well at preserving the local and global network structure of the HDGN. Then, we apply tenfold cross validation and external validation to compare our methods with other well-known disease gene prediction algorithms. The experimental results show that the RW-RDGN performs better than the state-of-the-art algorithm. The prediction results of disease candidate genes are essential for molecular mechanism investigation and experimental validation. The source codes of HerGePred and experimental data are available at https://github.com/yangkuoone/HerGePred.
Collapse
|
26
|
A novel one-class classification approach to accurately predict disease-gene association in acute myeloid leukemia cancer. PLoS One 2019; 14:e0226115. [PMID: 31825992 PMCID: PMC6905554 DOI: 10.1371/journal.pone.0226115] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Accepted: 11/19/2019] [Indexed: 01/02/2023] Open
Abstract
Disease causing gene identification is considered as an important step towards drug design and drug discovery. In disease gene identification and classification, the main aim is to identify disease genes while identifying non-disease genes are of less or no significant. Hence, this task can be defined as a one-class classification problem. Existing machine learning methods typically take into consideration known disease genes as positive training set and unknown genes as negative samples to build a binary-class classification model. Here we propose a new One-class Classification Support Vector Machines (OCSVM) method to precisely classify candidate disease genes. Our aim is to build a model that concentrate its focus on detecting known disease-causing gene to increase sensitivity and precision. We investigate the impact of our proposed model using a benchmark consisting of the gene expression dataset for Acute Myeloid Leukemia (AML) cancer. Compared with the traditional methods, our experimental result shows the superiority of our proposed method in terms of precision, recall, and F-measure to detect disease causing genes for AML. OCSVM codes and our extracted AML benchmark are publicly available at: https://github.com/imandehzangi/OCSVM.
Collapse
|
27
|
Alshahrani M, Hoehndorf R. Semantic Disease Gene Embeddings (SmuDGE): phenotype-based disease gene prioritization without phenotypes. Bioinformatics 2019; 34:i901-i907. [PMID: 30423077 PMCID: PMC6129260 DOI: 10.1093/bioinformatics/bty559] [Citation(s) in RCA: 34] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/09/2023] Open
Abstract
Motivation In the past years, several methods have been developed to incorporate information about phenotypes into computational disease gene prioritization methods. These methods commonly compute the similarity between a disease’s (or patient’s) phenotypes and a database of gene-to-phenotype associations to find the phenotypically most similar match. A key limitation of these methods is their reliance on knowledge about phenotypes associated with particular genes which is highly incomplete in humans as well as in many model organisms such as the mouse. Results We developed SmuDGE, a method that uses feature learning to generate vector-based representations of phenotypes associated with an entity. SmuDGE can be used as a trainable semantic similarity measure to compare two sets of phenotypes (such as between a disease and gene, or a disease and patient). More importantly, SmuDGE can generate phenotype representations for entities that are only indirectly associated with phenotypes through an interaction network; for this purpose, SmuDGE exploits background knowledge in interaction networks comprised of multiple types of interactions. We demonstrate that SmuDGE can match or outperform semantic similarity in phenotype-based disease gene prioritization, and furthermore significantly extends the coverage of phenotype-based methods to all genes in a connected interaction network. Availability and implementation https://github.com/bio-ontology-research-group/SmuDGE
Collapse
Affiliation(s)
- Mona Alshahrani
- Computer, Electrical and Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Robert Hoehndorf
- Computer, Electrical and Mathematical Sciences and Engineering Division, Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
28
|
Abstract
Multifunctional genes are important genes because of their essential roles in human cells. Studying and analyzing multifunctional genes can help understand disease mechanisms and drug discovery. We propose a computational method for scoring gene multifunctionality based on functional annotations of the target gene from the Gene Ontology. The method is based on identifying pairs of GO annotations that represent semantically different biological functions and any gene annotated with two annotations from one pair is considered multifunctional. The proposed method can be employed to identify multifunctional genes in the entire human genome using solely the GO annotations. We evaluated the proposed method in scoring multifunctionality of all human genes using four criteria: gene-disease associations; protein-protein interactions; gene studies with PubMed publications; and published known multifunctional gene sets. The evaluation results confirm the validity and reliability of the proposed method for identifying multifunctional human genes. The results across all four evaluation criteria were statistically significant in determining multifunctionality. For example, the method confirmed that multifunctional genes tend to be associated with diseases more than other genes, with significance [Formula: see text]. Moreover, consistent with all previous studies, proteins encoded by multifunctional genes, based on our method, are involved in protein-protein interactions significantly more ([Formula: see text]) than other proteins.
Collapse
Affiliation(s)
- Hisham Al-Mubaid
- 1 Computer Science Department, University of Houston-Clear Lake, Houston, TX 77062, USA
| |
Collapse
|
29
|
Failli M, Paananen J, Fortino V. Prioritizing target-disease associations with novel safety and efficacy scoring methods. Sci Rep 2019; 9:9852. [PMID: 31285471 PMCID: PMC6614395 DOI: 10.1038/s41598-019-46293-7] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2018] [Accepted: 06/25/2019] [Indexed: 01/24/2023] Open
Abstract
Biological target (commonly genes or proteins) identification is still largely a manual process, where experts manually try to collect and combine information from hundreds of data sources, ranging from scientific publications to omics databases. Targeting the wrong gene or protein will lead to failure of the drug development process, as well as incur delays and costs. To improve this process, different software platforms are being developed. These platforms rely strongly on efficacy estimates based on target-disease association scores created by computational methods for drug target prioritization. Here novel computational methods are presented to more accurately evaluate the efficacy and safety of potential drug targets. The proposed efficacy scores utilize existing gene expression data and tissue/disease specific networks to improve the inference of target-disease associations. Conversely, safety scores enable the identification of genes that are essential, potentially susceptible to adverse effects or carcinogenic. Benchmark results demonstrate that our transcriptome-based methods for drug target prioritization can increase the true positive rate of target-disease associations. Additionally, the proposed safety evaluation system enables accurate predictions of targets of withdrawn drugs and targets of drug trials prematurely discontinued.
Collapse
Affiliation(s)
- Mario Failli
- Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland
| | - Jussi Paananen
- Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland
| | - Vittorio Fortino
- Institute of Biomedicine, University of Eastern Finland, Kuopio, Finland.
| |
Collapse
|
30
|
|
31
|
Luo P, Li Y, Tian LP, Wu FX. Enhancing the prediction of disease–gene associations with multimodal deep learning. Bioinformatics 2019; 35:3735-3742. [DOI: 10.1093/bioinformatics/btz155] [Citation(s) in RCA: 33] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2018] [Revised: 02/11/2019] [Accepted: 02/27/2019] [Indexed: 12/20/2022] Open
Abstract
Abstract
Motivation
Computationally predicting disease genes helps scientists optimize the in-depth experimental validation and accelerates the identification of real disease-associated genes. Modern high-throughput technologies have generated a vast amount of omics data, and integrating them is expected to improve the accuracy of computational prediction. As an integrative model, multimodal deep belief net (DBN) can capture cross-modality features from heterogeneous datasets to model a complex system. Studies have shown its power in image classification and tumor subtype prediction. However, multimodal DBN has not been used in predicting disease–gene associations.
Results
In this study, we propose a method to predict disease–gene associations by multimodal DBN (dgMDL). Specifically, latent representations of protein-protein interaction networks and gene ontology terms are first learned by two DBNs independently. Then, a joint DBN is used to learn cross-modality representations from the two sub-models by taking the concatenation of their obtained latent representations as the multimodal input. Finally, disease–gene associations are predicted with the learned cross-modality representations. The proposed method is compared with two state-of-the-art algorithms in terms of 5-fold cross-validation on a set of curated disease–gene associations. dgMDL achieves an AUC of 0.969 which is superior to the competing algorithms. Further analysis of the top-10 unknown disease–gene pairs also demonstrates the ability of dgMDL in predicting new disease–gene associations.
Availability and implementation
Prediction results and a reference implementation of dgMDL in Python is available on https://github.com/luoping1004/dgMDL.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ping Luo
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, Canada
| | - Yuanyuan Li
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, Canada
- School of Mathematics and Physics, Wuhan Institute of Technology, Wuhan, China
| | - Li-Ping Tian
- School of Information, Beijing Wuzi University, Beijing, China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, Canada
- Department of Mechanical Engineering, University of Saskatchewan, Saskatoon, Canada
- Department of Computer Science, University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
32
|
Yang K, Wang N, Liu G, Wang R, Yu J, Zhang R, Chen J, Zhou X. Heterogeneous network embedding for identifying symptom candidate genes. J Am Med Inform Assoc 2018; 25:1452-1459. [PMID: 30357378 PMCID: PMC7646926 DOI: 10.1093/jamia/ocy117] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2018] [Revised: 07/24/2018] [Accepted: 08/11/2018] [Indexed: 11/12/2022] Open
Abstract
Objective Investigating the molecular mechanisms of symptoms is a vital task in precision medicine to refine disease taxonomy and improve the personalized management of chronic diseases. Although there are abundant experimental studies and computational efforts to obtain the candidate genes of diseases, the identification of symptom genes is rarely addressed. We curated a high-quality benchmark dataset of symptom-gene associations and proposed a heterogeneous network embedding for identifying symptom genes. Methods We proposed a heterogeneous network embedding representation algorithm, which constructed a heterogeneous symptom-related network that integrated symptom-related associations and applied an embedding representation algorithm to obtain the low-dimensional vector representation of nodes. By measuring the relevance between symptoms and genes via calculating the similarities of their vectors, the candidate genes of given symptoms can be obtained. Results A benchmark dataset of 18 270 symptom-gene associations between 505 symptoms and 4549 genes was curated. We compared our method to baseline algorithms (FSGER and PRINCE). The experimental results indicated our algorithm achieved a significant improvement over the state-of-the-art method, with precision and recall improved by 66.80% (0.844 vs 0.506) and 53.96% (0.311 vs 0.202), respectively, for TOP@3 and association precision improved by 37.71% (0.723 vs 0.525) over the PRINCE. Conclusions The experimental validation of the algorithms and the literature validation of typical symptoms indicated our method achieved excellent performance. Hence, we curated a prediction dataset of 17 479 symptom-candidate genes. The benchmark and prediction datasets have the potential to promote investigations of the molecular mechanisms of symptoms and provide candidate genes for validation in experimental settings.
Collapse
Affiliation(s)
- Kuo Yang
- School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
| | - Ning Wang
- School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
| | - Guangming Liu
- School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
| | - Ruyu Wang
- School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
| | - Jian Yu
- School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
| | - Runshun Zhang
- Guanganmen Hospital, China Academy of Chinese Medical Sciences, Beijing, China
| | - Jianxin Chen
- Beijing University of Chinese Medicine, Beijing, China
| | - Xuezhong Zhou
- School of Computer and Information Technology and Beijing Key Laboratory of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
- Data Center of Traditional Chinese Medicine, China Academy of Chinese Medical Sciences, Beijing, China
| |
Collapse
|
33
|
Bhasuran B, Natarajan J. Automatic extraction of gene-disease associations from literature using joint ensemble learning. PLoS One 2018; 13:e0200699. [PMID: 30048465 PMCID: PMC6061985 DOI: 10.1371/journal.pone.0200699] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2018] [Accepted: 07/02/2018] [Indexed: 12/26/2022] Open
Abstract
A wealth of knowledge concerning relations between genes and its associated diseases is present in biomedical literature. Mining these biological associations from literature can provide immense support to research ranging from drug-targetable pathways to biomarker discovery. However, time and cost of manual curation heavily slows it down. In this current scenario one of the crucial technologies is biomedical text mining, and relation extraction shows the promising result to explore the research of genes associated with diseases. By developing automatic extraction of gene-disease associations from the literature using joint ensemble learning we addressed this problem from a text mining perspective. In the proposed work, we employ a supervised machine learning approach in which a rich feature set covering conceptual, syntax and semantic properties jointly learned with word embedding are trained using ensemble support vector machine for extracting gene-disease relations from four gold standard corpora. Upon evaluating the machine learning approach shows promised results of 85.34%, 83.93%,87.39% and 85.57% of F-measure on EUADR, GAD, CoMAGC and PolySearch corpora respectively. We strongly believe that the presented novel approach combining rich syntax and semantic feature set with domain-specific word embedding through ensemble support vector machines evaluated on four gold standard corpora can act as a new baseline for future works in gene-disease relation extraction from literature.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
| | - Jeyakumar Natarajan
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore, Tamilnadu, India
- Data mining and Text mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore, Tamilnadu, India
- * E-mail:
| |
Collapse
|
34
|
Zhou H, Gao M, Skolnick J. ENTPRISE-X: Predicting disease-associated frameshift and nonsense mutations. PLoS One 2018; 13:e0196849. [PMID: 29723276 PMCID: PMC5933770 DOI: 10.1371/journal.pone.0196849] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2017] [Accepted: 04/20/2018] [Indexed: 01/11/2023] Open
Abstract
To exploit the plethora of information provided by Next Generation Sequencing, the identification of the genetic mutations responsible for disease in general or cancer in particular, among the thousands of neutral germline or somatic variations is a crucial task. Genome-wide association studies for the detection of disease-associated genes or cancer drivers can only identify common variations or driver genes in a cohort of patients. Thus, they cannot discover unique disease-associated mutations or cancer driver genes on a personal basis. Moreover, even when there are such common variations, their significance is unknown. Here, we extend the machine learning based approach ENTPRISE developed for predicting the disease association of missense mutations to frameshift and nonsense mutations. The new approach, ENTPRISE-X, is shown to outperform the state-of-the-art methods VEST-indel and DDIG-in for predicting the disease association of germline frameshift mutations in terms of balanced measure Matthew’s correlation coefficient, MCC, with a MCC of 0.586 for ENTPRISE-X, versus 0.412 by VEST-indel and 0.321 by DDIG-in, respectively. Large scale testing on the ExAC dataset shows ENTPRISE-X has a much lower fraction of 16% of variations classified as disease causing, as compared to VEST-indel’s 26% and DDIG-in’s 65% of predictions as being disease-associated. A web server for ENTPRISE-X is freely available for academic users at http://cssb2.biology.gatech.edu/entprise-x.
Collapse
Affiliation(s)
- Hongyi Zhou
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Mu Gao
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| | - Jeffrey Skolnick
- Center for the Study of Systems Biology, School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, United States of America
- * E-mail:
| |
Collapse
|
35
|
Hayashi Y, Oishi T, Shirotori K, Marumo Y, Kosugi A, Kumada S, Hirai D, Takayama K, Onuki Y. Modeling of quantitative relationships between physicochemical properties of active pharmaceutical ingredients and tensile strength of tablets using a boosted tree. Drug Dev Ind Pharm 2018; 44:1090-1098. [DOI: 10.1080/03639045.2018.1434195] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
Affiliation(s)
- Yoshihiro Hayashi
- Department of Pharmaceutical Technology, Graduate School of Medicine and Pharmaceutical Science for Research, University of Toyama, Toyama-shi, Japan
| | - Takuya Oishi
- Department of Pharmaceutical Technology, Graduate School of Medicine and Pharmaceutical Science for Research, University of Toyama, Toyama-shi, Japan
| | - Kaede Shirotori
- Department of Pharmaceutical Technology, Graduate School of Medicine and Pharmaceutical Science for Research, University of Toyama, Toyama-shi, Japan
| | - Yuki Marumo
- Department of Pharmaceutical Technology, Graduate School of Medicine and Pharmaceutical Science for Research, University of Toyama, Toyama-shi, Japan
| | - Atsushi Kosugi
- Formulation Development Department, Development and Planning Division, Nichi-Iko Pharmaceutical Co., Ltd., Namerikawa-shi, Japan
| | - Shungo Kumada
- Formulation Development Department, Development and Planning Division, Nichi-Iko Pharmaceutical Co., Ltd., Namerikawa-shi, Japan
| | - Daijiro Hirai
- Formulation Development Department, Development and Planning Division, Nichi-Iko Pharmaceutical Co., Ltd., Namerikawa-shi, Japan
| | - Kozo Takayama
- Faculty of Pharmacy and Pharmaceutical Sciences, Josai University, Sakado, Japan
| | - Yoshinori Onuki
- Department of Pharmaceutical Technology, Graduate School of Medicine and Pharmaceutical Science for Research, University of Toyama, Toyama-shi, Japan
| |
Collapse
|
36
|
Agrawal M, Zitnik M, Leskovec J. Large-scale analysis of disease pathways in the human interactome. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2018; 23:111-122. [PMID: 29218874 PMCID: PMC5731453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Discovering disease pathways, which can be defined as sets of proteins associated with a given disease, is an important problem that has the potential to provide clinically actionable insights for disease diagnosis, prognosis, and treatment. Computational methods aid the discovery by relying on protein-protein interaction (PPI) networks. They start with a few known disease-associated proteins and aim to find the rest of the pathway by exploring the PPI network around the known disease proteins. However, the success of such methods has been limited, and failure cases have not been well understood. Here we study the PPI network structure of 519 disease pathways. We find that 90% of pathways do not correspond to single well-connected components in the PPI network. Instead, proteins associated with a single disease tend to form many separate connected components/regions in the network. We then evaluate state-of-the-art disease pathway discovery methods and show that their performance is especially poor on diseases with disconnected pathways. Thus, we conclude that network connectivity structure alone may not be sufficient for disease pathway discovery. However, we show that higher-order network structures, such as small subgraphs of the pathway, provide a promising direction for the development of new methods.
Collapse
Affiliation(s)
- Monica Agrawal
- Department of Computer Science, Stanford University, Stanford, CA, USA,
| | | | | |
Collapse
|
37
|
Frasca M. Gene2DisCo: Gene to disease using disease commonalities. Artif Intell Med 2017; 82:34-46. [DOI: 10.1016/j.artmed.2017.08.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2017] [Revised: 07/24/2017] [Accepted: 08/13/2017] [Indexed: 01/10/2023]
|
38
|
Tian Z, Guo M, Wang C, Xing L, Wang L, Zhang Y. Constructing an integrated gene similarity network for the identification of disease genes. J Biomed Semantics 2017; 8:32. [PMID: 29297379 PMCID: PMC5763299 DOI: 10.1186/s13326-017-0141-1] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND Discovering novel genes that are involved human diseases is a challenging task in biomedical research. In recent years, several computational approaches have been proposed to prioritize candidate disease genes. Most of these methods are mainly based on protein-protein interaction (PPI) networks. However, since these PPI networks contain false positives and only cover less half of known human genes, their reliability and coverage are very low. Therefore, it is highly necessary to fuse multiple genomic data to construct a credible gene similarity network and then infer disease genes on the whole genomic scale. RESULTS We proposed a novel method, named RWRB, to infer causal genes of interested diseases. First, we construct five individual gene (protein) similarity networks based on multiple genomic data of human genes. Then, an integrated gene similarity network (IGSN) is reconstructed based on similarity network fusion (SNF) method. Finally, we employee the random walk with restart algorithm on the phenotype-gene bilayer network, which combines phenotype similarity network, IGSN as well as phenotype-gene association network, to prioritize candidate disease genes. We investigate the effectiveness of RWRB through leave-one-out cross-validation methods in inferring phenotype-gene relationships. Results show that RWRB is more accurate than state-of-the-art methods on most evaluation metrics. Further analysis shows that the success of RWRB is benefited from IGSN which has a wider coverage and higher reliability comparing with current PPI networks. Moreover, we conduct a comprehensive case study for Alzheimer's disease and predict some novel disease genes that supported by literature. CONCLUSIONS RWRB is an effective and reliable algorithm in prioritizing candidate disease genes on the genomic scale. Software and supplementary information are available at http://nclab.hit.edu.cn/~tianzhen/RWRB/ .
Collapse
Affiliation(s)
- Zhen Tian
- School of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001 People’s Republic of China
| | - Maozu Guo
- School of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001 People’s Republic of China
| | - Chunyu Wang
- School of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001 People’s Republic of China
| | - LinLin Xing
- School of Computer Science and Engineering, Harbin Institute of Technology, Harbin, 150001 People’s Republic of China
| | - Lei Wang
- Institute of Health Service and Medical Information Academy of Military Medical Sciences Beijing, Beijing, 100850 China
| | - Yin Zhang
- Institute of Health Service and Medical Information Academy of Military Medical Sciences Beijing, Beijing, 100850 China
| |
Collapse
|