1
|
Adamek L, Padiasek G, Zhang C, O'Dwyer I, Capit N, Dormont F, Hernandez R, Bar-Joseph Z, Rufino B. Identifying indications for novel drugs using electronic health records. Comput Biol Med 2024; 183:109158. [PMID: 39437603 DOI: 10.1016/j.compbiomed.2024.109158] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Revised: 09/12/2024] [Accepted: 09/13/2024] [Indexed: 10/25/2024]
Abstract
OBJECTIVE Computational drug re-purposing has received a lot of attention in the past decade. However, methods developed to date focused on established compounds for which information on both, successfully treated patients and chemical and genomic impact, were known. Such information does not always exist for first-in-class drugs under development. METHODS To identify indications (diseases) for drugs under development we extended and tested several unsupervised computational methods that utilize Electronic Health Record (EHR) data. RESULTS We tested the methods on known drugs with multiple indications and show that a variant of matrix factorization leads to the best performance for first-in-line drugs improving upon prior methods that were developed for established drugs. The method also identifies novel predictions for key immunology and oncology drugs. Our results show that the performance of re-purposing methods differ greatly between oncology and inflammation/immunology. We hypothesize that the lower performance in oncology can be explained by the fact that many chemotherapies are not targeted therapies. CONCLUSION Finding new indications for drugs is extremely valuable. Our results explore how to best use EHR data for finding new indications for first in class drugs drug using a phenotypical-similarity driven approach. Our methods can be integrated with others methods using multiple data modalities such as chemical, molecular, genetic data.
Collapse
Affiliation(s)
- Lukas Adamek
- Data & Computational Science, R&D, Sanofi, 240 Richmond Street West, 3rd Floor, Toronto, M5V 1V6, Ontario, Canada.
| | - Greg Padiasek
- Data & Computational Science, R&D, Sanofi, 240 Richmond Street West, 3rd Floor, Toronto, M5V 1V6, Ontario, Canada.
| | - Chaorui Zhang
- Data & Computational Science, R&D, Sanofi, 240 Richmond Street West, 3rd Floor, Toronto, M5V 1V6, Ontario, Canada.
| | - Ingrid O'Dwyer
- Data & Computational Science, R&D, Sanofi, 240 Richmond Street West, 3rd Floor, Toronto, M5V 1V6, Ontario, Canada.
| | - Nicolas Capit
- Clinical Real World Evidence, R&D, Sanofi, 46 Av. de la Grande Armée, Paris, 75017, Île-de-France, France.
| | - Flavio Dormont
- Data & Computational Science, R&D, Sanofi, 450 Water St, MA, Cambridge, 02141, MA, United States.
| | - Ramon Hernandez
- Clinical Real World Evidence, R&D, Sanofi, 46 Av. de la Grande Armée, Paris, 75017, Île-de-France, France.
| | - Ziv Bar-Joseph
- Data & Computational Science, R&D, Sanofi, 450 Water St, MA, Cambridge, 02141, MA, United States.
| | - Brandon Rufino
- Data & Computational Science, R&D, Sanofi, 240 Richmond Street West, 3rd Floor, Toronto, M5V 1V6, Ontario, Canada.
| |
Collapse
|
2
|
Su C, Hou Y, Zhou M, Rajendran S, Maasch JRA, Abedi Z, Zhang H, Bai Z, Cuturrufo A, Guo W, Chaudhry FF, Ghahramani G, Tang J, Cheng F, Li Y, Zhang R, DeKosky ST, Bian J, Wang F. Biomedical discovery through the integrative biomedical knowledge hub (iBKH). iScience 2023; 26:106460. [PMID: 37020958 PMCID: PMC10068563 DOI: 10.1016/j.isci.2023.106460] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 09/20/2022] [Accepted: 03/16/2023] [Indexed: 04/01/2023] Open
Abstract
The abundance of biomedical knowledge gained from biological experiments and clinical practices is an invaluable resource for biomedicine. The emerging biomedical knowledge graphs (BKGs) provide an efficient and effective way to manage the abundant knowledge in biomedical and life science. In this study, we created a comprehensive BKG called the integrative Biomedical Knowledge Hub (iBKH) by harmonizing and integrating information from diverse biomedical resources. To make iBKH easily accessible for biomedical research, we developed a web-based, user-friendly graphical portal that allows fast and interactive knowledge retrieval. Additionally, we also implemented an efficient and scalable graph learning pipeline for discovering novel biomedical knowledge in iBKH. As a proof of concept, we performed our iBKH-based method for computational in-silico drug repurposing for Alzheimer's disease. The iBKH is publicly available.
Collapse
Affiliation(s)
- Chang Su
- Department of Health Service Administration and Policy, College of Public Health, Temple University, Philadelphia, PA 19122, USA
| | - Yu Hou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Manqi Zhou
- Department of Computational Biology, Cornell University, Ithaca, NY 14850, USA
| | - Suraj Rajendran
- Tri-Institutional Computational Biology & Medicine Program, Cornell University, New York, NY 10065, USA
| | | | - Zehra Abedi
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | - Haotan Zhang
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | | | - Winston Guo
- Department of Medicine, Weill Cornell Medicine, New York, NY 10021, USA
| | - Fayzan F. Chaudhry
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Gregory Ghahramani
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY 10065, USA
| | - Jian Tang
- Mila-Quebec AI Institute and HEC Montreal, Montreal, QC H2S 3H1, Canada
| | - Feixiong Cheng
- Genomic Medicine Institute, Lerner Research Institute, Cleveland Clinic, Cleveland, OH 44195, USA
- Department of Molecular Medicine, Cleveland Clinic Lerner College of Medicine, Case Western Reserve University, Cleveland, OH 44195, USA
- Case Comprehensive Cancer Center, Case Western Reserve University School of Medicine, Cleveland, OH 44106, USA
| | - Yue Li
- School of Computer Science, McGill University, Montreal, QC H3A 0C6, Canada
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN 55455, USA
| | - Steven T. DeKosky
- Department of Neurology, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Jiang Bian
- Department of Health Outcomes & Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL 32610, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| |
Collapse
|
3
|
Liu C, Ta CN, Havrilla JM, Nestor JG, Spotnitz ME, Geneslaw AS, Hu Y, Chung WK, Wang K, Weng C. OARD: Open annotations for rare diseases and their phenotypes based on real-world data. Am J Hum Genet 2022; 109:1591-1604. [PMID: 35998640 PMCID: PMC9502051 DOI: 10.1016/j.ajhg.2022.08.002] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Accepted: 08/01/2022] [Indexed: 11/23/2022] Open
Abstract
Diagnosis for rare genetic diseases often relies on phenotype-driven methods, which hinge on the accuracy and completeness of the rare disease phenotypes in the underlying annotation knowledgebase. Existing knowledgebases are often manually curated with additional annotations found in published case reports. Despite their potential, real-world data such as electronic health records (EHRs) have not been fully exploited to derive rare disease annotations. Here, we present open annotation for rare diseases (OARD), a real-world-data-derived resource with annotation for rare-disease-related phenotypes. This resource is derived from the EHRs of two academic health institutions containing more than 10 million individuals spanning wide age ranges and different disease subgroups. By leveraging ontology mapping and advanced natural-language-processing (NLP) methods, OARD automatically and efficiently extracts concepts for both rare diseases and their phenotypic traits from billing codes and lab tests as well as over 100 million clinical narratives. The rare disease prevalence derived by OARD is highly correlated with those annotated in the original rare disease knowledgebase. By performing association analysis, we identified more than 1 million novel disease-phenotype association pairs that were previously missed by human annotation, and >60% were confirmed true associations via manual review of a list of sampled pairs. Compared to the manual curated annotation, OARD is 100% data driven and its pipeline can be shared across different institutions. By supporting privacy-preserving sharing of aggregated summary statistics, such as term frequencies and disease-phenotype associations, it fills an important gap to facilitate data-driven research in the rare disease community.
Collapse
Affiliation(s)
- Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Casey N Ta
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Jim M Havrilla
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Jordan G Nestor
- Division of Nephrology, Department of Medicine, Columbia University, New York, NY 10032, USA
| | - Matthew E Spotnitz
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Andrew S Geneslaw
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Yu Hu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Wendy K Chung
- Department of Pediatrics, Columbia University Irving Medical Center, New York, NY 10032, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
4
|
Ding P, Pan Y, Wang Q, Xu R. Prediction and evaluation of combination pharmacotherapy using natural language processing, machine learning and patient electronic health records. J Biomed Inform 2022; 133:104164. [PMID: 35985621 DOI: 10.1016/j.jbi.2022.104164] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 08/08/2022] [Accepted: 08/11/2022] [Indexed: 11/18/2022]
Abstract
Combination pharmacotherapy targets key disease pathways in a synergistic or additive manner and has high potential in treating complex diseases. Computational methods have been developed to identifying combination pharmacotherapy by analyzing large amounts of biomedical data. Existing computational approaches are often underpowered due to their reliance on our limited understanding of disease mechanisms. On the other hand, observable phenotypic inter-relationships among thousands of diseases often reflect their underlying shared genetic and molecular underpinnings, therefore can offer unique opportunities to design computational models to discover novel combinational therapies by automatically transferring knowledge among phenotypically related diseases. We developed a novel phenome-driven drug discovery system, named TuSDC, which leverages knowledge of existing drug combinations, disease comorbidities, and disease treatments of thousands of disease and drug entities extracted from over 31.5 million biomedical research articles using natural language processing techniques. TuSDC predicts combination pharmacotherapy by extracting representations of diseases and drugs using tensor factorization approaches. In external validation, TuSDC achieved an average precision of 0.77 for top ranked candidates, outperforming a state of art mechanism-based method for discovering drug combinations in treating hypertension. We evaluated top ranked anti-hypertension drug combinations using electronic health records of 84.7 million unique patients and showed that a novel drug combination hydrochlorothiazide-digoxin was associated with significantly lower hazards of subsequent hypertension as compared to the monotherapy hydrochlorothiazide alone (HR: 0.769, 95% CI [0.732, 0.807]) and digoxin alone (0.857, 95% CI [0.785, 0.936]). Data-driven informatics analyses reveal that the renin-angiotensin system is involved in the synergistical interactions of hydrochlorothiazide and digoxin on regulating hypertension. The prediction model's code with PyTorch version 1.5 is available at http://nlp.case.edu/public/data/TuSDC/.
Collapse
Affiliation(s)
- Pingjian Ding
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Yiheng Pan
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Quanqiu Wang
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, USA
| | - Rong Xu
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, USA.
| |
Collapse
|
5
|
Truong TTT, Panizzutti B, Kim JH, Walder K. Repurposing Drugs via Network Analysis: Opportunities for Psychiatric Disorders. Pharmaceutics 2022; 14:1464. [PMID: 35890359 PMCID: PMC9319329 DOI: 10.3390/pharmaceutics14071464] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2022] [Revised: 06/30/2022] [Accepted: 07/12/2022] [Indexed: 02/04/2023] Open
Abstract
Despite advances in pharmacology and neuroscience, the path to new medications for psychiatric disorders largely remains stagnated. Drug repurposing offers a more efficient pathway compared with de novo drug discovery with lower cost and less risk. Various computational approaches have been applied to mine the vast amount of biomedical data generated over recent decades. Among these methods, network-based drug repurposing stands out as a potent tool for the comprehension of multiple domains of knowledge considering the interactions or associations of various factors. Aligned well with the poly-pharmacology paradigm shift in drug discovery, network-based approaches offer great opportunities to discover repurposing candidates for complex psychiatric disorders. In this review, we present the potential of network-based drug repurposing in psychiatry focusing on the incentives for using network-centric repurposing, major network-based repurposing strategies and data resources, applications in psychiatry and challenges of network-based drug repurposing. This review aims to provide readers with an update on network-based drug repurposing in psychiatry. We expect the repurposing approach to become a pivotal tool in the coming years to battle debilitating psychiatric disorders.
Collapse
Affiliation(s)
- Trang T. T. Truong
- IMPACT, The Institute for Mental and Physical Health and Clinical Translation, School of Medicine, Deakin University, Geelong 3220, Australia; (T.T.T.T.); (B.P.); (J.H.K.)
| | - Bruna Panizzutti
- IMPACT, The Institute for Mental and Physical Health and Clinical Translation, School of Medicine, Deakin University, Geelong 3220, Australia; (T.T.T.T.); (B.P.); (J.H.K.)
| | - Jee Hyun Kim
- IMPACT, The Institute for Mental and Physical Health and Clinical Translation, School of Medicine, Deakin University, Geelong 3220, Australia; (T.T.T.T.); (B.P.); (J.H.K.)
- Mental Health Theme, The Florey Institute of Neuroscience and Mental Health, Parkville 3010, Australia
| | - Ken Walder
- IMPACT, The Institute for Mental and Physical Health and Clinical Translation, School of Medicine, Deakin University, Geelong 3220, Australia; (T.T.T.T.); (B.P.); (J.H.K.)
| |
Collapse
|
6
|
Xiang J, Zhang J, Zhao Y, Wu FX, Li M. Biomedical data, computational methods and tools for evaluating disease-disease associations. Brief Bioinform 2022; 23:6522999. [PMID: 35136949 DOI: 10.1093/bib/bbac006] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/04/2022] [Accepted: 01/05/2022] [Indexed: 12/12/2022] Open
Abstract
In recent decades, exploring potential relationships between diseases has been an active research field. With the rapid accumulation of disease-related biomedical data, a lot of computational methods and tools/platforms have been developed to reveal intrinsic relationship between diseases, which can provide useful insights to the study of complex diseases, e.g. understanding molecular mechanisms of diseases and discovering new treatment of diseases. Human complex diseases involve both external phenotypic abnormalities and complex internal molecular mechanisms in organisms. Computational methods with different types of biomedical data from phenotype to genotype can evaluate disease-disease associations at different levels, providing a comprehensive perspective for understanding diseases. In this review, available biomedical data and databases for evaluating disease-disease associations are first summarized. Then, existing computational methods for disease-disease associations are reviewed and classified into five groups in terms of the usages of biomedical data, including disease semantic-based, phenotype-based, function-based, representation learning-based and text mining-based methods. Further, we summarize software tools/platforms for computation and analysis of disease-disease associations. Finally, we give a discussion and summary on the research of disease-disease associations. This review provides a systematic overview for current disease association research, which could promote the development and applications of computational methods and tools/platforms for disease-disease associations.
Collapse
Affiliation(s)
- Ju Xiang
- School of Computer Science and Engineering, Central South University, China
| | - Jiashuai Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Yichao Zhao
- School of Computer Science and Engineering, Central South University, China
| | - Fang-Xiang Wu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha, Hunan 410083, China
| | - Min Li
- Division of Biomedical Engineering and Department of Mechanical Engineering at University of Saskatchewan, Saskatoon, Canada
| |
Collapse
|
7
|
Zhao S, Su C, Lu Z, Wang F. Recent advances in biomedical literature mining. Brief Bioinform 2021; 22:bbaa057. [PMID: 32422651 PMCID: PMC8138828 DOI: 10.1093/bib/bbaa057] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2019] [Revised: 03/22/2020] [Accepted: 03/25/2020] [Indexed: 01/26/2023] Open
Abstract
The recent years have witnessed a rapid increase in the number of scientific articles in biomedical domain. These literature are mostly available and readily accessible in electronic format. The domain knowledge hidden in them is critical for biomedical research and applications, which makes biomedical literature mining (BLM) techniques highly demanding. Numerous efforts have been made on this topic from both biomedical informatics (BMI) and computer science (CS) communities. The BMI community focuses more on the concrete application problems and thus prefer more interpretable and descriptive methods, while the CS community chases more on superior performance and generalization ability, thus more sophisticated and universal models are developed. The goal of this paper is to provide a review of the recent advances in BLM from both communities and inspire new research directions.
Collapse
Affiliation(s)
- Sendong Zhao
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| | - Chang Su
- Division of Health Informatics, Department of Healthcare Policy and Research at Weill Cornell Medicine at Cornell University, New York, NY, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI) at National Library of Medicine, National Institute of Health, Bethesda, MD, USA
| | - Fei Wang
- Department of Healthcare Policy and Research, Weill Medical College of Cornell University, New York, NY 10065, USA
| |
Collapse
|
8
|
Gamba A, Salmona M, Bazzoni G. The similarity of inherited diseases (I): clinical similarity within the phenotypic series. BMC Med Genomics 2021; 14:52. [PMID: 33622316 PMCID: PMC7903653 DOI: 10.1186/s12920-021-00900-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 02/10/2021] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Mutations of different genes often result in clinically similar diseases. Among the datasets of similar diseases, we analyzed the 'phenotypic series' from Online Mendelian Inheritance in Man and examined the similarity of the diseases that belong to the same phenotypic series, because we hypothesize that clinical similarity may unveil shared pathogenic mechanisms. METHODS Specifically, for each pair of diseases, we quantified their similarity, based on both number and information content of the shared clinical phenotypes. Then, we assembled the disease similarity network, in which nodes represent diseases and edges represent clinical similarities. RESULTS On average, diseases have high similarity with other diseases of their own phenotypic series, even though about one third of diseases have their maximal similarity with a disease of another series. Consequently, the network is assortative (i.e., diseases belonging to the same series link preferentially to each other), but the series differ in the way they distribute within the network. Specifically, heterophobic series, which minimize links to other series, form islands at the periphery of the network, whereas heterophilic series, which are highly inter-connected with other series, occupy the center of the network. CONCLUSIONS The finding that the phenotypic series display not only internal similarity (assortativity) but also varying degrees of external similarity (ranging from heterophobicity to heterophilicity) calls for investigation of biological mechanisms that might be shared among different series. The correlation between the clinical and biological similarities of the phenotypic series is analyzed in Part II of this study1.
Collapse
Affiliation(s)
- Alessio Gamba
- Department of Biochemistry and Molecular Pharmacology, Istituto Di Ricerche Farmacologiche Mario Negri IRCCS, Via Mario Negri 2, 20156, Milano, Italy
| | - Mario Salmona
- Department of Biochemistry and Molecular Pharmacology, Istituto Di Ricerche Farmacologiche Mario Negri IRCCS, Via Mario Negri 2, 20156, Milano, Italy
| | - Gianfranco Bazzoni
- Department of Biochemistry and Molecular Pharmacology, Istituto Di Ricerche Farmacologiche Mario Negri IRCCS, Via Mario Negri 2, 20156, Milano, Italy.
| |
Collapse
|
9
|
Wang Q, Xu R. CoMNRank: An integrated approach to extract and prioritize human microbial metabolites from MEDLINE records. J Biomed Inform 2020; 109:103524. [PMID: 32791237 DOI: 10.1016/j.jbi.2020.103524] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 07/17/2020] [Accepted: 07/29/2020] [Indexed: 02/06/2023]
Abstract
MOTIVATION Trillions of bacteria in human body (human microbiota) affect human health and diseases by controlling host functions through small molecule metabolites.An accurate and comprehensive catalog of the metabolic output from human microbiota is critical for our deep understanding of how microbial metabolism contributes to human health.The large number of published biomedical research articles is a rich resource of microbiome studies.However, automatically extracting microbial metabolites from free-text documents and differentiating them from other human metabolites is a challenging task.Here we developed an integrated approach called Co-occurrence Metabolite Network Ranking (CoMNRank) by combining named entity extraction, network construction and topic sensitive network-based prioritization to extract and prioritize microbial metabolites from biomedical articles. METHODS The text data included 28,851,232 MEDLINE records.CoMNRank consists of three steps: (1) extraction of human metabolites from MEDLINE records; (2) construction of a weighted co-occurrence metabolite network (CoMN); (3) prioritization and differentiation of microbial metabolites from other human metabolites. RESULTS For the first step of CoMNRank, we extracted 11,846 human metabolites from MEDLINE articles, with a baseline performance of precision of 0.014, recall of 0.959 and F1 of 0.028.We then constructed a weighted CoMN of 6,996 nodes and 986,186 edges.CoMNRank effectively prioritized microbial metabolites: the precision of top ranked metabolites is 0.45, a 31-fold enrichment as compared to the overall precision of 0.014.Manual curation of top 100 metabolites showed a true precision of 0.67, among which 48% true positives are not captured by existing databases. CONCLUSION Our study sets the foundation for future tasks of microbial entity and relationship extractions as well as data-driven studies of how microbial metabolism contributes to human health and diseases.
Collapse
Affiliation(s)
- QuanQiu Wang
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Rong Xu
- Center for Artificial Intelligence in Drug Discovery, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States.
| |
Collapse
|
10
|
Automatic extraction, prioritization and analysis of gut microbial metabolites from biomedical literature. Sci Rep 2020; 10:9996. [PMID: 32561832 PMCID: PMC7305201 DOI: 10.1038/s41598-020-67075-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 06/02/2020] [Indexed: 02/07/2023] Open
Abstract
Many diseases are driven by gene-environment interactions. One important environmental factor is the metabolic output of human gut microbiota. A comprehensive catalog of human metabolites originated in microbes is critical for data-driven approaches to understand how microbial metabolism contributes to human health and diseases. Here we present a novel integrated approach to automatically extract and analyze microbial metabolites from 28 million published biomedical records. First, we classified 28,851,232 MEDLINE records into microbial metabolism-related or not. Second, candidate microbial metabolites were extracted from the classified texts. Third, we developed signal prioritization algorithms to further differentiate microbial metabolites from metabolites originated from other resources. Finally, we systematically analyzed the interactions between extracted microbial metabolites and human genes. A total of 11,846 metabolites were extracted from 28 million MEDLINE articles. The combined text classification and signal prioritization significantly enriched true positives among top: manual curation of top 100 metabolites showed a true precision of 0.55, representing a significant 38.3-fold enrichment as compared to the precision of 0.014 for baseline extraction. More importantly, 29% extracted microbial metabolites have not been captured by existing databases. We performed data-driven analysis of the interactions between the extracted microbial metabolite and human genetics. This study represents the first effort towards automatically extracting and prioritizing microbial metabolites from published biomedical literature, which can set a foundation for future tasks of microbial metabolite relationship extraction from literature and facilitate data-driven studies of how microbial metabolism contributes to human diseases.
Collapse
|
11
|
Li Z, Huang Q, Chen X, Wang Y, Li J, Xie Y, Dai Z, Zou X. Identification of Drug-Disease Associations Using Information of Molecular Structures and Clinical Symptoms via Deep Convolutional Neural Network. Front Chem 2020; 7:924. [PMID: 31998700 PMCID: PMC6966717 DOI: 10.3389/fchem.2019.00924] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2019] [Accepted: 12/18/2019] [Indexed: 02/02/2023] Open
Abstract
Identifying drug-disease associations is helpful for not only predicting new drug indications and recognizing lead compounds, but also preventing, diagnosing, treating diseases. Traditional experimental methods are time consuming, laborious and expensive. Therefore, it is urgent to develop computational method for predicting potential drug-disease associations on a large scale. Herein, a novel method was proposed to identify drug-disease associations based on the deep learning technique. Molecular structure and clinical symptom information were used to characterize drugs and diseases. Then, a novel two-dimensional matrix was constructed and mapped to a gray-scale image for representing drug-disease association. Finally, deep convolution neural network was introduced to build model for identifying potential drug-disease associations. The performance of current method was evaluated based on the training set and test set, and accuracies of 89.90 and 86.51% were obtained. Prediction ability for recognizing new drug indications, lead compounds and true drug-disease associations was also investigated and verified by performing various experiments. Additionally, 3,620,516 potential drug-disease associations were identified and some of them were further validated through docking modeling. It is anticipated that the proposed method may be a powerful large scale virtual screening tool for drug research and development. The source code of MATLAB is freely available on request from the authors.
Collapse
Affiliation(s)
- Zhanchao Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, China.,School of Chemistry, Sun Yat-Sen University, Guangzhou, China
| | - Qixing Huang
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, China
| | - Xingyu Chen
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, China
| | - Yang Wang
- Key Laboratory of Digital Quality Evaluation of Chinese Materia Medica of State Administration of Traditional Chinese Medicine, Guangzhou, China
| | - Jinlong Li
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, China
| | - Yun Xie
- School of Chemistry and Chemical Engineering, Guangdong Pharmaceutical University, Guangzhou, China
| | - Zong Dai
- Key Laboratory of Digital Quality Evaluation of Chinese Materia Medica of State Administration of Traditional Chinese Medicine, Guangzhou, China
| | - Xiaoyong Zou
- Key Laboratory of Digital Quality Evaluation of Chinese Materia Medica of State Administration of Traditional Chinese Medicine, Guangzhou, China
| |
Collapse
|
12
|
Jia J, An Z, Ming Y, Guo Y, Li W, Liang Y, Guo D, Li X, Tai J, Chen G, Jin Y, Liu Z, Ni X, Shi T. eRAM: encyclopedia of rare disease annotations for precision medicine. Nucleic Acids Res 2019; 46:D937-D943. [PMID: 29106618 PMCID: PMC5753383 DOI: 10.1093/nar/gkx1062] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Accepted: 10/24/2017] [Indexed: 01/12/2023] Open
Abstract
Rare diseases affect over a hundred million people worldwide, most of these patients are not accurately diagnosed and effectively treated. The limited knowledge of rare diseases forms the biggest obstacle for improving their treatment. Detailed clinical phenotyping is considered as a keystone of deciphering genes and realizing the precision medicine for rare diseases. Here, we preset a standardized system for various types of rare diseases, called encyclopedia of Rare disease Annotations for Precision Medicine (eRAM). eRAM was built by text-mining nearly 10 million scientific publications and electronic medical records, and integrating various data in existing recognized databases (such as Unified Medical Language System (UMLS), Human Phenotype Ontology, Orphanet, OMIM, GWAS). eRAM systematically incorporates currently available data on clinical manifestations and molecular mechanisms of rare diseases and uncovers many novel associations among diseases. eRAM provides enriched annotations for 15 942 rare diseases, yielding 6147 human disease related phenotype terms, 31 661 mammalians phenotype terms, 10,202 symptoms from UMLS, 18 815 genes and 92 580 genotypes. eRAM can not only provide information about rare disease mechanism but also facilitate clinicians to make accurate diagnostic and therapeutic decisions towards rare diseases. eRAM can be freely accessed at http://www.unimd.org/eram/.
Collapse
Affiliation(s)
- Jinmeng Jia
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Zhongxin An
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yue Ming
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yongli Guo
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Wei Li
- Beijing Key Laboratory for Genetics of Birth Defects, The Ministry of Education Key Laboratory of Major Diseases in Children, Center for Medical Genetics, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Yunxiang Liang
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Dongming Guo
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Xin Li
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Jun Tai
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Geng Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yaqiong Jin
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Zhimei Liu
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Xin Ni
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing 100045, China
| | - Tieliu Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| |
Collapse
|
13
|
Luo L, Zheng C, Wang J, Tan M, Li Y, Xu R. Analysis of disease organ as a novel phenotype towards disease genetics understanding. J Biomed Inform 2019; 95:103235. [PMID: 31207382 PMCID: PMC6644057 DOI: 10.1016/j.jbi.2019.103235] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2018] [Revised: 06/06/2019] [Accepted: 06/13/2019] [Indexed: 11/24/2022]
Abstract
Discerning the modular nature of human diseases through computational approaches calls for diverse data. The finding sites of diseases, like other disease phenotypes, possess rich information in understanding disease genetics. Yet, analysis of the rich knowledge of disease finding sites has not been comprehensively investigated. In this study, we built a large-scale disease organ network (DON) based on 76,561 disease-organ associations (for 37,615 diseases and 3492 organs) extracted from the United Medical Language System (UMLS) Metathesaurus. We investigated how phenotypic organ similarity among diseases in DON reflects disease gene sharing. We constructed a disease genetic network (DGN) using curated disease-gene associations and demonstrated that disease pairs with higher organ similarities not only are more likely to share genes, but also tend to share more genes. Based on community detection algorithm, we showed that phenotypic disease clusters on DON significantly correlated with genetic disease clusters on DGN. We compared DON with a state-of-art disease phenotype network, disease manifestation network (DMN), that we have recently constructed, and demonstrated that DON contains complementary knowledge for disease genetics understanding.
Collapse
Affiliation(s)
- Lingyun Luo
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China; Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, Ohio 44106, USA.
| | - Chunlei Zheng
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, Ohio 44106, USA
| | - Jiaolong Wang
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China
| | - Minsheng Tan
- School of Computer Science, University of South China, Hengyang, Hunan 421001, China
| | - Yanshu Li
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, Ohio 44106, USA
| | - Rong Xu
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland, Ohio 44106, USA
| |
Collapse
|
14
|
Shen F, Zhao Y, Wang L, Mojarad MR, Wang Y, Liu S, Liu H. Rare disease knowledge enrichment through a data-driven approach. BMC Med Inform Decis Mak 2019; 19:32. [PMID: 30764825 PMCID: PMC6376651 DOI: 10.1186/s12911-019-0752-9] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 02/01/2019] [Indexed: 01/03/2023] Open
Abstract
BACKGROUND Existing resources to assist the diagnosis of rare diseases are usually curated from the literature that can be limited for clinical use. It often takes substantial effort before the suspicion of a rare disease is even raised to utilize those resources. The primary goal of this study was to apply a data-driven approach to enrich existing rare disease resources by mining phenotype-disease associations from electronic medical record (EMR). METHODS We first applied association rule mining algorithms on EMR to extract significant phenotype-disease associations and enriched existing rare disease resources (Human Phenotype Ontology and Orphanet (HPO-Orphanet)). We generated phenotype-disease bipartite graphs for HPO-Orphanet, EMR, and enriched knowledge base HPO-Orphanet + and conducted a case study on Hodgkin lymphoma to compare performance on differential diagnosis among these three graphs. RESULTS We used disease-disease similarity generated by the eRAM, an existing rare disease encyclopedia, as a gold standard to compare the three graphs with sensitivity and specificity as (0.17, 0.36, 0.46) and (0.52, 0.47, 0.51) for three graphs respectively. We also compared the top 15 diseases generated by the HPO-Orphanet + graph with eRAM and another clinical diagnostic tool, the Phenomizer. CONCLUSIONS Per our evaluation results, our approach was able to enrich existing rare disease knowledge resources with phenotype-disease associations from EMR and thus support rare disease differential diagnosis.
Collapse
Affiliation(s)
- Feichen Shen
- Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA.
| | - Yiqing Zhao
- Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA
| | - Liwei Wang
- Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA
| | - Majid Rastegar Mojarad
- Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA
| | - Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA
| | - Sijia Liu
- Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA
| | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, 205 3rd Ave SW, Rochester, MN, 55905, USA.
| |
Collapse
|
15
|
Zheng C, Xu R. Large-scale mining disease comorbidity relationships from post-market drug adverse events surveillance data. BMC Bioinformatics 2018; 19:500. [PMID: 30591027 PMCID: PMC6309066 DOI: 10.1186/s12859-018-2468-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Background Systems approaches in studying disease relationship have wide applications in biomedical discovery, such as disease mechanism understanding and drug discovery. The FDA Adverse Event Reporting System (FAERS) contains rich information about patient diseases, medications, drug adverse events and demographics of 17 million case reports. Here, we explored this data resource to mine disease comorbidity relationships using association rule mining algorithm and constructed a disease comorbidity network. Results We constructed a disease comorbidity network with 1059 disease nodes and 12,608 edges using association rule mining of FAERS (14,157 rules). We evaluated the performance of comorbidity mining from FAERS using known disease comorbidities of multiple sclerosis (MS), psoriasis and obesity that represent rare, moderate and common disease respectively. Comorbidities of MS, obesity and psoriasis obtained from our network achieved precisions of 58.6%, 73.7%, 56.2% and recalls 87.5%, 69.2% and 72.7% separately. We performed comparative analysis of the disease comorbidity network with disease semantic network, disease genetic network and disease treatment network. We showed that (1) disease comorbidity clusters exhibit significantly higher semantic similarity than random network (0.18 vs 0.10); (2) disease comorbidity clusters share significantly more genes (0.46 vs 0.06); and (3) disease comorbidity clusters share significantly more drugs (0.64 vs 0.17). Finally, we demonstrated that the disease comorbidity network has potential in uncovering novel disease relationships using asthma as a case study. Conclusions Our study presented the first comprehensive attempt to build a disease comorbidity network from FDA Adverse Event Reporting System. This network shows well correlated with disease semantic similarity, disease genetics and disease treatment, which has great potential in disease genetics prediction and drug discovery.
Collapse
Affiliation(s)
- Chunlei Zheng
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 2103 Cornell Road, Cleveland, 44106, OH, USA
| | - Rong Xu
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, 2103 Cornell Road, Cleveland, 44106, OH, USA.
| |
Collapse
|
16
|
Zheng C, Xu R. The Alzheimer's comorbidity phenome: mining from a large patient database and phenome-driven genetics prediction. JAMIA Open 2018; 2:131-138. [PMID: 30944912 PMCID: PMC6434979 DOI: 10.1093/jamiaopen/ooy050] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 10/23/2018] [Accepted: 12/05/2018] [Indexed: 01/08/2023] Open
Abstract
Objective Alzheimer’s disease (AD) is a severe neurodegenerative disorder and has become a global public health problem. Intensive research has been conducted for AD. But the pathophysiology of AD is still not elucidated. Disease comorbidity often associates diseases with overlapping patterns of genetic markers. This may inform a common etiology and suggest essential protein targets. US Food and Drug Administration (FDA) Adverse Event Reporting System (FAERS) collects large-scale postmarketing surveillance data that provide a unique opportunity to investigate disease co-occurrence pattern. We aim to construct a heterogeneous network that integrates disease comorbidity network (DCN) from FAERS with protein–protein interaction (PPI) to prioritize the AD risk genes using network-based ranking algorithm. Materials and Methods We built a DCN based on indication data from FAERS using association rule mining. DCN was further integrated with PPI network. We used random walk with restart ranking algorithm to prioritize AD risk genes. Results We evaluated the performance of our approach using AD risk genes curated from genetic association studies. Our approach achieved an area under a receiver operating characteristic curve of 0.770. Top 500 ranked genes achieved 5.53-fold enrichment for known AD risk genes as compared to random expectation. Pathway enrichment analysis using top-ranked genes revealed that two novel pathways, ERBB and coagulation pathways, might be involved in AD pathogenesis. Conclusion We innovatively leveraged FAERS, a comprehensive data resource for FDA postmarket drug safety surveillance, for large-scale AD comorbidity mining. This exploratory study demonstrated the potential of disease-comorbidities mining from FAERS in AD genetics discovery.
Collapse
Affiliation(s)
- Chunlei Zheng
- Department of Population and Quantitative Health Sciences, Institute of Computational Biology, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| | - Rong Xu
- Department of Population and Quantitative Health Sciences, Institute of Computational Biology, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| |
Collapse
|
17
|
Wang Q, Xu R. Disease comorbidity-guided drug repositioning: a case study in schizophrenia. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:1300-1309. [PMID: 30815174 PMCID: PMC6371343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
The key to any computational drug repositioning is the availability of relevant data in machine-understandable format. While large amount of genetic, genomic and chemical data are publicly available, large-scale higher-level disease and drug phenotypic data are limited. We recently constructed a large-scale disease-comorbidity relationship knowledge base (dCombKB) and a comprehensive drug-treatment relationship knowledge base (TreatKB) from 21 million biomedical research articles and other resources. In this study, we demonstrated the potential of dCombKB and TreatKB in drug repositioning for schizophrenia, one of the top ten illnesses contributing to the global burden of disease. dCombKB contains 121,359 unique disease-disease comorbidity pairs for 23,041 diseases. TreatKB contains 208,330 unique drug-disease treatment pairs for 2,484 drugs and 24,511 diseases. We constructed a phenotypic comorbidity disease network (PDN) of 14,645 disease nodes and 101,275 edges based on dCombKB. We applied standard network-based ranking algorithm to find diseases that are phenotypically related to SCZ. We developed a drug prioritization system, PhenoPredict-CDN, to systematically reposition drugs for SCZ from diseases phenotypically related to SCZ. PhenoPredict-CDN found all 18 FDA-approved SCZ drugs and ranked them highly as tested in a de-novo validation setting (recall: 1.0, mean ranking: top 6.05%, median ranking: top 1.65%). When compared to PREDICT, a comprehensive drug repositioning system, for novel predictions, Pheno-Predict-CDN outperformed PREDICT in Precision-Recall (PR) curves across three different evaluation datasets. Compared to PREDICT, PhenoPredict-CDN showed a significant 110.0-230.0% improvements in mean average precision. In summary, large-scale higher-level disease-comorbidity relationships data extracted from biomedical literature has potential in drug discovery for SCZ, a complex disease with unknown pathophysiological mechanisms. All the data are publicly available: dCombKB at http://nlp. CASE edu/public/data/dCombKB, TreatKB at http://nlp. CASE edu/public/data/treatKB/, and predictions for SCZ at http://nlp. CASE edu/public/data/SCZ_CDN/.
Collapse
Affiliation(s)
- QuanQiu Wang
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland OH 44106
| | - Rong Xu
- Department of Population and Quantitative Health Sciences, School of Medicine, Case Western Reserve University, Cleveland OH 44106
| |
Collapse
|
18
|
Hankosky ER, Bush HM, Dwoskin LP, Harris DR, Henderson DW, Zhang GQ, Freeman PR, Talbert JC. Retrospective analysis of health claims to evaluate pharmacotherapies with potential for repurposing: Association of bupropion and stimulant use disorder remission. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:1292-1299. [PMID: 30815171 PMCID: PMC6371318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Drug repurposing is the identification of novel indication(s) for existing medications. Health claims data provide a burgeoning resource to evaluate pharmacotherapies with repurposing potential. To demonstrate a workflow for drug repurposing using claims data, we assessed the association between prescription of bupropion and stimulant use disorder (StUD) remission. Using the Truven Marketscan database, 96,156 individuals with a StUD were identified. Logistic regression was used to model the association between new bupropion prescriptions and remission while controlling for age, sex, region, StUD severity, antidepressant co-prescriptions, and comorbid mood and attention disorders. Prescription of bupropion within 30 days offirst documented StUD diagnosis increased odds of a subsequent remission diagnosis by 2.1 times (99% confidence interval: 1.09-3.89) in individuals with an amphetamine use disorder, but not those with a cocaine use disorder. This work provides a framework for reverse-translational drug repurposing, which may be applied to many other medical conditions.
Collapse
|
19
|
Jia J, Wang R, An Z, Guo Y, Ni X, Shi T. RDAD: A Machine Learning System to Support Phenotype-Based Rare Disease Diagnosis. Front Genet 2018; 9:587. [PMID: 30564269 PMCID: PMC6288202 DOI: 10.3389/fgene.2018.00587] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2018] [Accepted: 11/15/2018] [Indexed: 01/21/2023] Open
Abstract
DNA sequencing has allowed for the discovery of the genetic cause for a considerable number of diseases, paving the way for new disease diagnostics. However, due to the lack of clinical samples and records, the molecular cause for rare diseases is always hard to identify, significantly limiting the number of rare Mendelian diseases diagnosed through sequencing technologies. Clinical phenotype information therefore becomes a major resource to diagnose rare diseases. In this article, we adopted both a phenotypic similarity method and a machine learning method to build four diagnostic models to support rare disease diagnosis. All the diagnostic models were validated using the real medical records from RAMEDIS. Each model provides a list of the top 10 candidate diseases as the prediction outcome and the results showed that all models had a high diagnostic precision (≥98%) with the highest recall reaching up to 95% while the models with machine learning methods showed the best performance. To promote effective diagnosis for rare disease in clinical application, we developed the phenotype-based Rare Disease Auxiliary Diagnosis system (RDAD) to assist clinicians in diagnosing rare diseases with the above four diagnostic models. The system is freely accessible through http://www.unimd.org/RDAD/.
Collapse
Affiliation(s)
- Jinmeng Jia
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Ruiyuan Wang
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Zhongxin An
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
| | - Yongli Guo
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, The Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing, China
| | - Xi Ni
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, The Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children's Hospital, Capital Medical University, National Center for Children's Health, Beijing, China
| | - Tieliu Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, The Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, China
- National Center for International Research of Biological Targeting Diagnosis and Therapy/Guangxi Key Laboratory of Biological Targeting Diagnosis and Therapy Research/Collaborative Innovation Center for Targeting Tumor Diagnosis and Therapy, Guangxi Medical University, Nanning, Guangxi, China
| |
Collapse
|
20
|
Wang Q, Xu R. Drug repositioning for prostate cancer: using a data-driven approach to gain new insights. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2017:1724-1733. [PMID: 29854243 PMCID: PMC5977574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
UNLABELLED Prostate cancer (PC) is the most common cancer and the third leading cause of cancer death in men worldwide. Despite its high incidence and mortality, the likelihood of a cure is low for late-stages of PC. There is an unmet need for more effective agents for treating PC. Here, we present a drug repositioning system, GenoPredict, for finding innovative drug candidates for treating PC. GenoPredict leverages upon a large amount of disease genomics data and a large-scale drug treatment knowledge base (TreatKB) that we recently constructed. We first constructed a genetic disease network (GDN) that comprised of 882 nodes and 200,758 edges and applied a network-based ranking algorithm to find diseases from GDN that are genetically related to PC. We developed a drug prioritization algorithm to reposition drugs from PC-related diseases to treat PC. When evaluated in a de-novo prediction setting using 27 FDA- approved PC drugs, GenoPredict found 25 of 27 FDA-approved PC drugs and ranked them highly (recall: 0.925, mean ranking: 27.3%, median ranking: 15.6%). When compared to PREDICT, a comprehensive drug repositioning system, in novel predictions, GenoPredict performed better than PREDICT across two evaluation datasets. GenoPredict achieved a mean average precision (MAP) of 0.447 when evaluated with 172 PC drugs extracted from 172,888 clinical trial reports, representing a 164.5% improvement as compared to a MAP of 0.169 for PREDICT. When evaluated with 72 PC drugs extracted from 43,811 ongoing clinical trial reports, GenoPredict achieved a MAP of 0.278, representing a 231.1% improvement as compared to a MAP of 0.084 for PREDICT. The data is publicly available at: http://nlp. CASE edu/public/data/PC_GenoPredict and http: //nlp. CASE edu/public/data/treatKB.
Collapse
Affiliation(s)
| | - Rong Xu
- Department of Epidemiology and Biostatistics, School of Medicine, Case Western Reserve University, Cleveland OH 44106
| |
Collapse
|
21
|
Jia J, An Z, Ming Y, Guo Y, Li W, Li X, Liang Y, Guo D, Tai J, Chen G, Jin Y, Liu Z, Ni X, Shi T. PedAM: a database for Pediatric Disease Annotation and Medicine. Nucleic Acids Res 2018; 46:D977-D983. [PMID: 29126123 PMCID: PMC5753298 DOI: 10.1093/nar/gkx1049] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Revised: 10/04/2017] [Accepted: 10/24/2017] [Indexed: 12/14/2022] Open
Abstract
There is a significant number of children around the world suffering from the consequence of the misdiagnosis and ineffective treatment for various diseases. To facilitate the precision medicine in pediatrics, a database namely the Pediatric Disease Annotations & Medicines (PedAM) has been built to standardize and classify pediatric diseases. The PedAM integrates both biomedical resources and clinical data from Electronic Medical Records to support the development of computational tools, by which enables robust data analysis and integration. It also uses disease-manifestation (D-M) integrated from existing biomedical ontologies as prior knowledge to automatically recognize text-mined, D-M-specific syntactic patterns from 774 514 full-text articles and 8 848 796 abstracts in MEDLINE. Additionally, disease connections based on phenotypes or genes can be visualized on the web page of PedAM. Currently, the PedAM contains standardized 8528 pediatric disease terms (4542 unique disease concepts and 3986 synonyms) with eight annotation fields for each disease, including definition synonyms, gene, symptom, cross-reference (Xref), human phenotypes and its corresponding phenotypes in the mouse. The database PedAM is freely accessible at http://www.unimd.org/pedam/.
Collapse
Affiliation(s)
- Jinmeng Jia
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Zhongxin An
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yue Ming
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yongli Guo
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children’s Hospital, Capital Medical University, National Center for Children’s Health, Beijing 100045, China
| | - Wei Li
- Beijing Key Laboratory for Genetics of Birth Defects, The Ministry of Education Key Laboratory of Major Diseases in Children, Center for Medical Genetics, Beijing Pediatric Research Institute, Beijing Children’s Hospital, Capital Medical University, National Center for Children’s Health, Beijing 100045, China
| | - Xin Li
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yunxiang Liang
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Dongming Guo
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Jun Tai
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children’s Hospital, Capital Medical University, National Center for Children’s Health, Beijing 100045, China
| | - Geng Chen
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Yaqiong Jin
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children’s Hospital, Capital Medical University, National Center for Children’s Health, Beijing 100045, China
| | - Zhimei Liu
- Beijing Key Laboratory for Pediatric Diseases of Otolaryngology, Head and Neck Surgery, the Ministry of Education Key Laboratory of Major Diseases in Children, Beijing Pediatric Research Institute, Beijing Children’s Hospital, Capital Medical University, National Center for Children’s Health, Beijing 100045, China
| | - Xin Ni
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| | - Tieliu Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai 200241, China
| |
Collapse
|
22
|
Jia J, Shi T. Towards efficiency in rare disease research: what is distinctive and important? SCIENCE CHINA-LIFE SCIENCES 2017. [PMID: 28639105 DOI: 10.1007/s11427-017-9099-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Characterized by their low prevalence, rare diseases are often chronically debilitating or life threatening. Despite their low prevalence, the aggregate number of individuals suffering from a rare disease is estimated to be nearly 400 million worldwide. Over the past decades, efforts from researchers, clinicians, and pharmaceutical industries have been focused on both the diagnosis and therapy of rare diseases. However, because of the lack of data and medical records for individual rare diseases and the high cost of orphan drug development, only limited progress has been achieved. In recent years, the rapid development of next-generation sequencing (NGS)-based technologies, as well as the popularity of precision medicine has facilitated a better understanding of rare diseases and their molecular etiology. As a result, molecular subclassification can be identified within each disease more clearly, significantly improving diagnostic accuracy. However, providing appropriate care for patients with rare diseases is still an enormous challenge. In this review, we provide a brief introduction to the challenges of rare disease research and make suggestions on where and how our efforts should be focused.
Collapse
Affiliation(s)
- Jinmeng Jia
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Tieliu Shi
- The Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.
| |
Collapse
|
23
|
Chen Y, Xu R. Context-sensitive network-based disease genetics prediction and its implications in drug discovery. Bioinformatics 2017; 33:1031-1039. [PMID: 28062449 DOI: 10.1093/bioinformatics/btw737] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2016] [Accepted: 11/19/2016] [Indexed: 01/05/2023] Open
Abstract
Motivation Disease phenotype networks play an important role in computational approaches to identifying new disease-gene associations. Current disease phenotype networks often model disease relationships based on pairwise similarities, therefore ignore the specific context on how two diseases are connected. In this study, we propose a new strategy to model disease associations using context-sensitive networks (CSNs). We developed a CSN-based phenome-driven approach for disease genetics prediction, and investigated the translational potential of the predicted genes in drug discovery. Results We constructed CSNs by directly connecting diseases with associated phenotypes. Here, we constructed two CSNs using different data sources; the two networks contain 26 790 and 13 822 nodes respectively. We integrated the CSNs with a genetic functional relationship network and predicted disease genes using a network-based ranking algorithm. For comparison, we built Similarity-Based disease Networks (SBN) using the same disease phenotype data. In a de novo cross validation for 3324 diseases, the CSN-based approach significantly increased the average rank from top 12.6 to top 8.8% for all tested genes comparing with the SBN-based approach ( p<e-22 ). The area under the receiver operating characteristic curve for the CSN approach was also significantly higher than the SBN approach (0.91 versus 0.87, p<e-3 ). In addition, we predicted genes for Parkinson's disease using CSNs, and demonstrated that the top-ranked genes are highly relevant to PD pathologenesis. We pin-pointed a top-ranked drug target gene for PD, and found its association with neurodegeneration supported by literature. In summary, CSNs lead to significantly improve the disease genetics prediction comparing with SBNs and provide leads for potential drug targets. Availability and Implementation nlp.case.edu/public/data/. Contact rxx@case.edu.
Collapse
|
24
|
Silberberg Y, Kupiec M, Sharan R. GLADIATOR: a global approach for elucidating disease modules. Genome Med 2017; 9:48. [PMID: 28549478 PMCID: PMC5446740 DOI: 10.1186/s13073-017-0435-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2016] [Accepted: 05/04/2017] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Understanding the genetic basis of disease is an important challenge in biology and medicine. The observation that disease-related proteins often interact with one another has motivated numerous network-based approaches for deciphering disease mechanisms. In particular, protein-protein interaction networks were successfully used to illuminate disease modules, i.e., interacting proteins working in concert to drive a disease. The identification of these modules can further our understanding of disease mechanisms. METHODS We devised a global method for the prediction of multiple disease modules simultaneously named GLADIATOR (GLobal Approach for DIsease AssociaTed mOdule Reconstruction). GLADIATOR relies on a gold-standard disease phenotypic similarity to obtain a pan-disease view of the underlying modules. To traverse the search space of potential disease modules, we applied a simulated annealing algorithm aimed at maximizing the correlation between module similarity and the gold-standard phenotypic similarity. Importantly, this optimization is employed over hundreds of diseases simultaneously. RESULTS GLADIATOR's predicted modules highly agree with current knowledge about disease-related proteins. Furthermore, the modules exhibit high coherence with respect to functional annotations and are highly enriched with known curated pathways, outperforming previous methods. Examination of the predicted proteins shared by similar diseases demonstrates the diverse role of these proteins in mediating related processes across similar diseases. Last, we provide a detailed analysis of the suggested molecular mechanism predicted by GLADIATOR for hyperinsulinism, suggesting novel proteins involved in its pathology. CONCLUSIONS GLADIATOR predicts disease modules by integrating knowledge of disease-related proteins and phenotypes across multiple diseases. The predicted modules are functionally coherent and are more in line with current biological knowledge compared to modules obtained using previous disease-centric methods. The source code for GLADIATOR can be downloaded from http://www.cs.tau.ac.il/~roded/GLADIATOR.zip .
Collapse
Affiliation(s)
- Yael Silberberg
- Department of Molecular Microbiology and Biotechnology, Tel Aviv University, Tel Aviv, Israel
| | - Martin Kupiec
- Department of Molecular Microbiology and Biotechnology, Tel Aviv University, Tel Aviv, Israel
| | - Roded Sharan
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
| |
Collapse
|
25
|
Wooden B, Goossens N, Hoshida Y, Friedman SL. Using Big Data to Discover Diagnostics and Therapeutics for Gastrointestinal and Liver Diseases. Gastroenterology 2017; 152:53-67.e3. [PMID: 27773806 PMCID: PMC5193106 DOI: 10.1053/j.gastro.2016.09.065] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/20/2016] [Revised: 09/15/2016] [Accepted: 09/25/2016] [Indexed: 12/13/2022]
Abstract
Technologies such as genome sequencing, gene expression profiling, proteomic and metabolomic analyses, electronic medical records, and patient-reported health information have produced large amounts of data from various populations, cell types, and disorders (big data). However, these data must be integrated and analyzed if they are to produce models or concepts about physiological function or mechanisms of pathogenesis. Many of these data are available to the public, allowing researchers anywhere to search for markers of specific biological processes or therapeutic targets for specific diseases or patient types. We review recent advances in the fields of computational and systems biology and highlight opportunities for researchers to use big data sets in the fields of gastroenterology and hepatology to complement traditional means of diagnostic and therapeutic discovery.
Collapse
Affiliation(s)
- Benjamin Wooden
- Division of Liver Diseases, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York
| | - Nicolas Goossens
- Division of Liver Diseases, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York; Division of Gastroenterology and Hepatology, Department of Medical Specialties, Geneva University Hospital, Geneva, Switzerland
| | - Yujin Hoshida
- Division of Liver Diseases, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York.
| | - Scott L Friedman
- Division of Liver Diseases, Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, New York
| |
Collapse
|
26
|
Chen Y, Xu R. Drug repurposing for glioblastoma based on molecular subtypes. J Biomed Inform 2016; 64:131-138. [PMID: 27697594 PMCID: PMC6146394 DOI: 10.1016/j.jbi.2016.09.019] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2016] [Revised: 08/23/2016] [Accepted: 09/27/2016] [Indexed: 01/12/2023]
Abstract
A recent multi-platform analysis by The Cancer Genome Atlas identified four distinct molecular subtypes for glioblastoma (GBM) and demonstrated that the subtypes correlate with clinical phenotypes and treatment responses. In this study, we developed a computational drug repurposing approach to predict GBM drugs based on the molecular subtypes. Our approach leverages the genomic signature for each GBM subtype, and integrates the human cancer genomics with mouse phenotype data to identify the opportunity of reusing the FDA-approved agents to treat specific GBM subtypes. Specifically, we first constructed the phenotype profile for each GBM subtype using their genomic signatures. For each approved drug, we also constructed a phenotype profile using the drug target genes. Then we developed an algorithm to match and prioritize drugs based on their phenotypic similarities to the GBM subtypes. Our approach is highly generalizable for other disorders if provided with a list of disorder-specific genes. We first evaluated the approach in predicting drugs for the whole GBM. For a combined set of approved, potential and off-label GBM drugs, we achieved a median rank of 9.3%, which is significantly higher (p
Collapse
Affiliation(s)
- Yang Chen
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Rong Xu
- Department of Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH 44106, United States.
| |
Collapse
|
27
|
Xu R, Wang Q. A genomics-based systems approach towards drug repositioning for rheumatoid arthritis. BMC Genomics 2016; 17 Suppl 7:518. [PMID: 27557330 PMCID: PMC5001200 DOI: 10.1186/s12864-016-2910-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Background Rheumatoid arthritis (RA) is a chronic autoimmune disease characterized by inflammation and destruction of synovial joints. RA affects up to 1 % of the population worldwide. Currently, there are no drugs that can cure RA or achieve sustained remission. The unknown cause of the disease represents a significant challenge in the drug development. In this study, we address this challenge by proposing an alternative drug discovery approach that integrates and reasons over genetic interrelationships between RA and other genetic diseases as well as a large amount of higher-level drug treatment data. We first constructed a genetic disease network using disease genetics data from Genome-Wide Association Studies (GWAS). We developed a network-based ranking algorithm to prioritize diseases genetically-related to RA (RA-related diseases). We then developed a drug prioritization algorithm to reposition drugs from RA-related diseases to treat RA. Results Our algorithm found 74 of the 80 FDA-approved RA drugs and ranked them highly (recall: 0.925, median ranking: 8.93 %), demonstrating the validity of our strategy. When compared to a study that used GWAS data to directly connect RA-associated genes to drug targets (“direct genetics-based” approach), our algorithm (“indirect genetics-based”) achieved a comparable overall performance, but complementary precision and recall in retrospective validation (precision: 0.22, recall: 0.36; F1: 0.27 vs. precision: 0.74, recall: 0.16; F1: 0.28). Our approach performed significantly better in novel predictions when evaluated using 165 not-yet-FDA-approved RA drugs (precision: 0.46, recall: 0.50; F1: 0.47 vs. precision: 0.40, recall: 0.006; F1: 0.01). Conclusions In summary, although the fundamental pathophysiological mechanisms remain uncharacterized, our proposed computation-based drug discovery approach to analyzing genetic and treatment interrelationships among thousands of diseases and drugs can facilitate the discovery of innovative drugs for treating RA.
Collapse
Affiliation(s)
- Rong Xu
- Department of Epidemiology and Biostatistics, Institute of Computational Biology, School of Medicine, Case Western Reserve University, 2103 Cornell Road, Cleveland, 44106, OH, USA.
| | | |
Collapse
|
28
|
Cai X, Chen Y, Gao Z, Xu R. Explore Small Molecule-induced Genome-wide Transcriptional Profiles for Novel Inflammatory Bowel Disease Drug. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2016; 2016:22-31. [PMID: 27570643 PMCID: PMC5001780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Inflammatory Bowel Disease (IBD) is a chronic and relapsing disorder, which affects millions people worldwide. Current drug options cannot cure the disease and may cause severe side effects. We developed a systematic framework to identify novel IBD drugs exploiting millions of genomic signatures for chemical compounds. Specifically, we searched all FDA-approved drugs for candidates that share similar genomic profiles with IBD. In the evaluation experiments, our approach ranked approved IBD drugs averagely within top 26% among 858 candidates, significantly outperforming a state-of-art genomics-based drug repositioning method (p-value < e-8). Our approach also achieved significantly higher average precision than the state-of-art approach in predicting potential IBD drugs from clinical trials (0.072 vs. 0.043, p<0.1) and off-label IBD drugs (0.198 vs. 0.138, p<0.1). Furthermore, we found evidences supporting the therapeutic potential of the top-ranked drugs, such as Naloxone, in literature and through analyzing target genes and pathways.
Collapse
Affiliation(s)
- Xiaoshu Cai
- Department of Electrical Engineering and Computer Science, School of Engineering, Case Western Reserve University, Cleveland, Ohio, USA
| | - Yang Chen
- Department of Epidemiology & Biostatistics, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| | - Zhen Gao
- Department of Epidemiology & Biostatistics, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| | - Rong Xu
- Department of Epidemiology & Biostatistics, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| |
Collapse
|
29
|
Bai T, Gong L, Wang Y, Wang Y, Kulikowski CA, Huang L. A method for exploring implicit concept relatedness in biomedical knowledge network. BMC Bioinformatics 2016; 17 Suppl 9:265. [PMID: 27454167 PMCID: PMC4959351 DOI: 10.1186/s12859-016-1131-5] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Biomedical information and knowledge, structural and non-structural, stored in different repositories can be semantically connected to form a hybrid knowledge network. How to compute relatedness between concepts and discover valuable but implicit information or knowledge from it effectively and efficiently is of paramount importance for precision medicine, and a major challenge facing the biomedical research community. RESULTS In this study, a hybrid biomedical knowledge network is constructed by linking concepts across multiple biomedical ontologies as well as non-structural biomedical knowledge sources. To discover implicit relatedness between concepts in ontologies for which potentially valuable relationships (implicit knowledge) may exist, we developed a Multi-Ontology Relatedness Model (MORM) within the knowledge network, for which a relatedness network (RN) is defined and computed across multiple ontologies using a formal inference mechanism of set-theoretic operations. Semantic constraints are designed and implemented to prune the search space of the relatedness network. CONCLUSIONS Experiments to test examples of several biomedical applications have been carried out, and the evaluation of the results showed an encouraging potential of the proposed approach to biomedical knowledge discovery.
Collapse
Affiliation(s)
- Tian Bai
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 2699 Qianjin St, Changchun, China
| | - Leiguang Gong
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Yantai Intelligent Information Technologies Ltd., 2699 Qianjin St, Yantai, China
| | - Ye Wang
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
| | - Yan Wang
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 2699 Qianjin St, Changchun, China
| | - Casimir A. Kulikowski
- Department of Computer Science, Rutgers, The State University of New Jersey, 2699 Qianjin St, Piscataway, NJ USA
| | - Lan Huang
- College of Computer Science and Technology, Jilin Univesity, 2699 Qianjin St, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, 2699 Qianjin St, Changchun, China
| |
Collapse
|
30
|
Chen Y, Cai X, Xu R. Combining Human Disease Genetics and Mouse Model Phenotypes towards Drug Repositioning for Parkinson's disease. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015; 2015:1851-60. [PMID: 26958284 PMCID: PMC4765695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Parkinson's disease (PD) is a severe neurodegenerative disorder without effective treatments. Here, we present a novel drug repositioning approach to predict new drugs for PD leveraging both disease genetics and large amounts of mouse model phenotypes. First, we identified PD-specific mouse phenotypes using well-studied human disease genes. Then we searched all FDA-approved drugs for candidates that share similar mouse phenotype profiles with PD. We demonstrated the validity of our approach using drugs that have been approved for PD: 10 approved PD drugs were ranked within top 10% among 1197 candidates. In predicting novel PD drugs, our approach achieved a mean average precision of 0.24, which is significantly higher (p
Collapse
Affiliation(s)
- Yang Chen
- Department of Electrical Engineering and Computer Science, School of Engineering, Case Western Reserve University, Cleveland, Ohio, USA
| | - Xiaoshu Cai
- Department of Electrical Engineering and Computer Science, School of Engineering, Case Western Reserve University, Cleveland, Ohio, USA
| | - Rong Xu
- Department of Epidemiology and Biostatistics, School of Medicine, Case Western Reserve University, Cleveland, Ohio, USA
| |
Collapse
|
31
|
Wang Q, Xu R. DenguePredict: An Integrated Drug Repositioning Approach towards Drug Discovery for Dengue. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2015; 2015:1279-88. [PMID: 26958268 PMCID: PMC4765554] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Dengue is a viral disease of expanding global incidence without cures. Here we present a drug repositioning system (DenguePredict) leveraging upon a unique drug treatment database and vast amounts of disease- and drug-related data. We first constructed a large-scale genetic disease network with enriched dengue genetics data curated from biomedical literature. We applied a network-based ranking algorithm to find dengue-related diseases from the disease network. We then developed a novel algorithm to prioritize FDA-approved drugs from dengue-related diseases to treat dengue. When tested in a de-novo validation setting, DenguePredict found the only two drugs tested in clinical trials for treating dengue and ranked them highly: chloroquine ranked at top 0.96% and ivermectin at top 22.75%. We showed that drugs targeting immune systems and arachidonic acid metabolism-related apoptotic pathways might represent innovative drugs to treat dengue. In summary, DenguePredict, by combining comprehensive disease- and drug-related data and novel algorithms, may greatly facilitate drug discovery for dengue.
Collapse
Affiliation(s)
| | - Rong Xu
- Department of Epidemiology and Biostatistics, Institute of Computational Biology, School of Medicine, Case Western Reserve University, Cleveland, OH
| |
Collapse
|
32
|
Cheng L, Li J, Hu Y, Jiang Y, Liu Y, Chu Y, Wang Z, Wang Y. Using Semantic Association to Extend and Infer Literature-Oriented Relativity Between Terms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:1219-1226. [PMID: 26684460 DOI: 10.1109/tcbb.2015.2430289] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Relative terms often appear together in the literature. Methods have been presented for weighting relativity of pairwise terms by their co-occurring literature and inferring new relationship. Terms in the literature are also in the directed acyclic graph of ontologies, such as Gene Ontology and Disease Ontology. Therefore, semantic association between terms may help for establishing relativities between terms in literature. However, current methods do not use these associations. In this paper, an adjusted R-scaled score (ARSS) based on information content (ARSSIC) method is introduced to infer new relationship between terms. First, set inclusion relationship between terms of ontology was exploited to extend relationships between these terms and literature. Next, the ARSS method was presented to measure relativity between terms across ontologies according to these extensional relationships. Then, the ARSSIC method using ratios of information shared of term's ancestors was designed to infer new relationship between terms across ontologies. The result of the experiment shows that ARSS identified more pairs of statistically significant terms based on corresponding gene sets than other methods. And the high average area under the receiver operating characteristic curve (0.9293) shows that ARSSIC achieved a high true positive rate and a low false positive rate. Data is available at http://mlg.hit.edu.cn/ARSSIC/.
Collapse
|
33
|
Xu R, Wang Q. PhenoPredict: A disease phenome-wide drug repositioning approach towards schizophrenia drug discovery. J Biomed Inform 2015; 56:348-55. [PMID: 26151312 PMCID: PMC4589865 DOI: 10.1016/j.jbi.2015.06.027] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2014] [Revised: 06/26/2015] [Accepted: 06/29/2015] [Indexed: 01/26/2023]
Abstract
Schizophrenia (SCZ) is a common complex disorder with poorly understood mechanisms and no effective drug treatments. Despite the high prevalence and vast unmet medical need represented by the disease, many drug companies have moved away from the development of drugs for SCZ. Therefore, alternative strategies are needed for the discovery of truly innovative drug treatments for SCZ. Here, we present a disease phenome-driven computational drug repositioning approach for SCZ. We developed a novel drug repositioning system, PhenoPredict, by inferring drug treatments for SCZ from diseases that are phenotypically related to SCZ. The key to PhenoPredict is the availability of a comprehensive drug treatment knowledge base that we recently constructed. PhenoPredict retrieved all 18 FDA-approved SCZ drugs and ranked them highly (recall=1.0, and average ranking of 8.49%). When compared to PREDICT, one of the most comprehensive drug repositioning systems currently available, in novel predictions, PhenoPredict represented clear improvements over PREDICT in Precision-Recall (PR) curves, with a significant 98.8% improvement in the area under curve (AUC) of the PR curves. In addition, we discovered many drug candidates with mechanisms of action fundamentally different from traditional antipsychotics, some of which had published literature evidence indicating their treatment benefits in SCZ patients. In summary, although the fundamental pathophysiological mechanisms of SCZ remain unknown, integrated systems approaches to studying phenotypic connections among diseases may facilitate the discovery of innovative SCZ drugs.
Collapse
Affiliation(s)
- Rong Xu
- Department of Epidemiology and Biostatistics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States.
| | - QuanQiu Wang
- ThinTek, LLC, Palo Alto, CA 94306, United States.
| |
Collapse
|
34
|
Abstract
The growing body of transcriptomic, proteomic, metabolomic and genomic data generated from disease states provides a great opportunity to improve our current understanding of the molecular mechanisms driving diseases and shared between diseases. The use of both clinical and molecular phenotypes will lead to better disease understanding and classification. In this study, we set out to gain novel insights into diseases and their relationships by utilising knowledge gained from system-level molecular data. We integrated different types of biological data including genome-wide association studies data, disease-chemical associations, biological pathways and Gene Ontology annotations into an Integrated Disease Network (IDN), a heterogeneous network where nodes are bio-entities and edges between nodes represent their associations. We also introduced a novel disease similarity measure to infer disease-disease associations from the IDN. Our predicted associations were systemically evaluated against the Medical Subject Heading classification and a statistical measure of disease co-occurrence in PubMed. The strong correlation between our predictions and co-occurrence associations indicated the ability of our approach to recover known disease associations. Furthermore, we presented a case study of Crohn's disease. We demonstrated that our approach not only identified well-established connections between Crohn's disease and other diseases, but also revealed new, interesting connections consistent with emerging literature. Our approach also enabled ready access to the knowledge supporting these new connections, making this a powerful approach for exploring connections between diseases.
Collapse
Affiliation(s)
- Kai Sun
- Department of Computing, Imperial College London, London, SW7 2AZ, UK.
| | | | | | | |
Collapse
|
35
|
Alnazzawi N, Thompson P, Batista-Navarro R, Ananiadou S. Using text mining techniques to extract phenotypic information from the PhenoCHF corpus. BMC Med Inform Decis Mak 2015; 15 Suppl 2:S3. [PMID: 26099853 PMCID: PMC4474585 DOI: 10.1186/1472-6947-15-s2-s3] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Background Phenotypic information locked away in unstructured narrative text presents significant barriers to information accessibility, both for clinical practitioners and for computerised applications used for clinical research purposes. Text mining (TM) techniques have previously been applied successfully to extract different types of information from text in the biomedical domain. They have the potential to be extended to allow the extraction of information relating to phenotypes from free text. Methods To stimulate the development of TM systems that are able to extract phenotypic information from text, we have created a new corpus (PhenoCHF) that is annotated by domain experts with several types of phenotypic information relating to congestive heart failure. To ensure that systems developed using the corpus are robust to multiple text types, it integrates text from heterogeneous sources, i.e., electronic health records (EHRs) and scientific articles from the literature. We have developed several different phenotype extraction methods to demonstrate the utility of the corpus, and tested these methods on a further corpus, i.e., ShARe/CLEF 2013. Results Evaluation of our automated methods showed that PhenoCHF can facilitate the training of reliable phenotype extraction systems, which are robust to variations in text type. These results have been reinforced by evaluating our trained systems on the ShARe/CLEF corpus, which contains clinical records of various types. Like other studies within the biomedical domain, we found that solutions based on conditional random fields produced the best results, when coupled with a rich feature set. Conclusions PhenoCHF is the first annotated corpus aimed at encoding detailed phenotypic information. The unique heterogeneous composition of the corpus has been shown to be advantageous in the training of systems that can accurately extract phenotypic information from a range of different text types. Although the scope of our annotation is currently limited to a single disease, the promising results achieved can stimulate further work into the extraction of phenotypic information for other diseases. The PhenoCHF annotation guidelines and annotations are publicly available at https://code.google.com/p/phenochf-corpus.
Collapse
|
36
|
Analysis of the human diseasome using phenotype similarity between common, genetic, and infectious diseases. Sci Rep 2015; 5:10888. [PMID: 26051359 PMCID: PMC4458913 DOI: 10.1038/srep10888] [Citation(s) in RCA: 72] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2014] [Accepted: 04/22/2015] [Indexed: 01/29/2023] Open
Abstract
Phenotypes are the observable characteristics of an organism arising from its response to the environment. Phenotypes associated with engineered and natural genetic variation are widely recorded using phenotype ontologies in model organisms, as are signs and symptoms of human Mendelian diseases in databases such as OMIM and Orphanet. Exploiting these resources, several computational methods have been developed for integration and analysis of phenotype data to identify the genetic etiology of diseases or suggest plausible interventions. A similar resource would be highly useful not only for rare and Mendelian diseases, but also for common, complex and infectious diseases. We apply a semantic text-mining approach to identify the phenotypes (signs and symptoms) associated with over 6,000 diseases. We evaluate our text-mined phenotypes by demonstrating that they can correctly identify known disease-associated genes in mice and humans with high accuracy. Using a phenotypic similarity measure, we generate a human disease network in which diseases that have similar signs and symptoms cluster together, and we use this network to identify closely related diseases based on common etiological, anatomical as well as physiological underpinnings.
Collapse
|
37
|
Xu R, Wang Q. Large-scale automatic extraction of side effects associated with targeted anticancer drugs from full-text oncological articles. J Biomed Inform 2015; 55:64-72. [PMID: 25817969 DOI: 10.1016/j.jbi.2015.03.009] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2014] [Revised: 02/12/2015] [Accepted: 03/20/2015] [Indexed: 10/23/2022]
Abstract
Targeted anticancer drugs such as imatinib, trastuzumab and erlotinib dramatically improved treatment outcomes in cancer patients, however, these innovative agents are often associated with unexpected side effects. The pathophysiological mechanisms underlying these side effects are not well understood. The availability of a comprehensive knowledge base of side effects associated with targeted anticancer drugs has the potential to illuminate complex pathways underlying toxicities induced by these innovative drugs. While side effect association knowledge for targeted drugs exists in multiple heterogeneous data sources, published full-text oncological articles represent an important source of pivotal, investigational, and even failed trials in a variety of patient populations. In this study, we present an automatic process to extract targeted anticancer drug-associated side effects (drug-SE pairs) from a large number of high profile full-text oncological articles. We downloaded 13,855 full-text articles from the Journal of Oncology (JCO) published between 1983 and 2013. We developed text classification, relationship extraction, signaling filtering, and signal prioritization algorithms to extract drug-SE pairs from downloaded articles. We extracted a total of 26,264 drug-SE pairs with an average precision of 0.405, a recall of 0.899, and an F1 score of 0.465. We show that side effect knowledge from JCO articles is largely complementary to that from the US Food and Drug Administration (FDA) drug labels. Through integrative correlation analysis, we show that targeted drug-associated side effects positively correlate with their gene targets and disease indications. In conclusion, this unique database that we built from a large number of high-profile oncological articles could facilitate the development of computational models to understand toxic effects associated with targeted anticancer drugs.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, Cleveland, OH 44106, United States.
| | - QuanQiu Wang
- ThinTek, LLC, Palo Alto, CA 94306, United States.
| |
Collapse
|
38
|
Xu R, Wang Q. Comparing a knowledge-driven approach to a supervised machine learning approach in large-scale extraction of drug-side effect relationships from free-text biomedical literature. BMC Bioinformatics 2015; 16 Suppl 5:S6. [PMID: 25860223 PMCID: PMC4402591 DOI: 10.1186/1471-2105-16-s5-s6] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND Systems approaches to studying drug-side-effect (drug-SE) associations are emerging as an active research area for both drug target discovery and drug repositioning. However, a comprehensive drug-SE association knowledge base does not exist. In this study, we present a novel knowledge-driven (KD) approach to effectively extract a large number of drug-SE pairs from published biomedical literature. DATA AND METHODS For the text corpus, we used 21,354,075 MEDLINE records (119,085,682 sentences). First, we used known drug-SE associations derived from FDA drug labels as prior knowledge to automatically find SE-related sentences and abstracts. We then extracted a total of 49,575 drug-SE pairs from MEDLINE sentences and 180,454 pairs from abstracts. RESULTS On average, the KD approach has achieved a precision of 0.335, a recall of 0.509, and an F1 of 0.392, which is significantly better than a SVM-based machine learning approach (precision: 0.135, recall: 0.900, F1: 0.233) with a 73.0% increase in F1 score. Through integrative analysis, we demonstrate that the higher-level phenotypic drug-SE relationships reflects lower-level genetic, genomic, and chemical drug mechanisms. In addition, we show that the extracted drug-SE pairs can be directly used in drug repositioning. CONCLUSION In summary, we automatically constructed a large-scale higher-level drug phenotype relationship knowledge, which can have great potential in computational drug discovery.
Collapse
|
39
|
Shyr C, Tarailo-Graovac M, Gottlieb M, Lee JJY, van Karnebeek C, Wasserman WW. FLAGS, frequently mutated genes in public exomes. BMC Med Genomics 2014; 7:64. [PMID: 25466818 PMCID: PMC4267152 DOI: 10.1186/s12920-014-0064-y] [Citation(s) in RCA: 102] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Accepted: 10/24/2014] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Dramatic improvements in DNA-sequencing technologies and computational analyses have led to wide use of whole exome sequencing (WES) to identify the genetic basis of Mendelian disorders. More than 180 novel rare-disease-causing genes with Mendelian inheritance patterns have been discovered through sequencing the exomes of just a few unrelated individuals or family members. As rare/novel genetic variants continue to be uncovered, there is a major challenge in distinguishing true pathogenic variants from rare benign mutations. METHODS We used publicly available exome cohorts, together with the dbSNP database, to derive a list of genes (n = 100) that most frequently exhibit rare (<1%) non-synonymous/splice-site variants in general populations. We termed these genes FLAGS for FrequentLy mutAted GeneS and analyzed their properties. RESULTS Analysis of FLAGS revealed that these genes have significantly longer protein coding sequences, a greater number of paralogs and display less evolutionarily selective pressure than expected. FLAGS are more frequently reported in PubMed clinical literature and more frequently associated with diseased phenotypes compared to the set of human protein-coding genes. We demonstrated an overlap between FLAGS and the rare-disease causing genes recently discovered through WES studies (n = 10) and the need for replication studies and rigorous statistical and biological analyses when associating FLAGS to rare disease. Finally, we showed how FLAGS are applied in disease-causing variant prioritization approach on exome data from a family affected by an unknown rare genetic disorder. CONCLUSIONS We showed that some genes are frequently affected by rare, likely functional variants in general population, and are frequently observed in WES studies analyzing diverse rare phenotypes. We found that the rate at which genes accumulate rare mutations is beneficial information for prioritizing candidates. We provided a ranking system based on the mutation accumulation rates for prioritizing exome-captured human genes, and propose that clinical reports associating any disease/phenotype to FLAGS be evaluated with extra caution.
Collapse
Affiliation(s)
- Casper Shyr
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Vancouver, BC, Canada. .,Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada. .,Bioinformatics Graduate Program, University of British Columbia, Vancouver, BC, Canada.
| | - Maja Tarailo-Graovac
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Vancouver, BC, Canada. .,Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada. .,Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada.
| | - Michael Gottlieb
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Vancouver, BC, Canada.
| | - Jessica J Y Lee
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Vancouver, BC, Canada. .,Genome Science and Technology Graduate Program, University of British Columbia, Vancouver, BC, Canada.
| | - Clara van Karnebeek
- Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada. .,Division of Biochemical Diseases, BC Children's Hospital, Vancouver, BC, Canada. .,Department of Pediatrics, University of British Columbia, Vancouver, BC, Canada.
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, Child and Family Research Institute, Vancouver, BC, Canada. .,Department of Medical Genetics, University of British Columbia, Vancouver, BC, Canada. .,Treatable Intellectual Disability Endeavour in British Columbia, Vancouver, Canada.
| |
Collapse
|
40
|
Chen Y, Zhang X, Zhang GQ, Xu R. Comparative analysis of a novel disease phenotype network based on clinical manifestations. J Biomed Inform 2014; 53:113-20. [PMID: 25277758 DOI: 10.1016/j.jbi.2014.09.007] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2014] [Revised: 08/18/2014] [Accepted: 09/21/2014] [Indexed: 12/21/2022]
Abstract
Systems approaches to analyzing disease phenotype networks in combination with protein functional interaction networks have great potential in illuminating disease pathophysiological mechanisms. While many genetic networks are readily available, disease phenotype networks remain largely incomplete. In this study, we built a large-scale Disease Manifestation Network (DMN) from 50,543 highly accurate disease-manifestation semantic relationships in the United Medical Language System (UMLS). Our new phenotype network contains 2305 nodes and 373,527 weighted edges to represent the disease phenotypic similarities. We first compared DMN with the networks representing genetic relationships among diseases, and demonstrated that the phenotype clustering in DMN reflects common disease genetics. Then we compared DMN with a widely-used disease phenotype network in previous gene discovery studies, called mimMiner, which was extracted from the textual descriptions in Online Mendelian Inheritance in Man (OMIM). We demonstrated that DMN contains different knowledge from the existing phenotype data source. Finally, a case study on Marfan syndrome further proved that DMN contains useful information and can provide leads to discover unknown disease causes. Integrating DMN in systems approaches with mimMiner and other data offers the opportunities to predict novel disease genetics. We made DMN publicly available at nlp/case.edu/public/data/DMN.
Collapse
Affiliation(s)
- Yang Chen
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, United States; Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Xiang Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Guo-Qiang Zhang
- Department of Electrical Engineering and Computer Science, Case Western Reserve University, Cleveland, OH 44106, United States; Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States
| | - Rong Xu
- Division of Medical Informatics, School of Medicine, Case Western Reserve University, Cleveland, OH 44106, United States.
| |
Collapse
|
41
|
Xu R, Wang Q. Automatic construction of a large-scale and accurate drug-side-effect association knowledge base from biomedical literature. J Biomed Inform 2014; 51:191-9. [PMID: 24928448 DOI: 10.1016/j.jbi.2014.05.013] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2014] [Revised: 05/28/2014] [Accepted: 05/30/2014] [Indexed: 12/13/2022]
Abstract
Systems approaches to studying drug-side-effect (drug-SE) associations are emerging as an active research area for drug target discovery, drug repositioning, and drug toxicity prediction. However, currently available drug-SE association databases are far from being complete. Herein, in an effort to increase the data completeness of current drug-SE relationship resources, we present an automatic learning approach to accurately extract drug-SE pairs from the vast amount of published biomedical literature, a rich knowledge source of side effect information for commercial, experimental, and even failed drugs. For the text corpus, we used 119,085,682 MEDLINE sentences and their parse trees. We used known drug-SE associations derived from US Food and Drug Administration (FDA) drug labels as prior knowledge to find relevant sentences and parse trees. We extracted syntactic patterns associated with drug-SE pairs from the resulting set of parse trees. We developed pattern-ranking algorithms to prioritize drug-SE-specific patterns. We then selected a set of patterns with both high precisions and recalls in order to extract drug-SE pairs from the entire MEDLINE. In total, we extracted 38,871 drug-SE pairs from MEDLINE using the learned patterns, the majority of which have not been captured in FDA drug labels to date. On average, our knowledge-driven pattern-learning approach in extracting drug-SE pairs from MEDLINE has achieved a precision of 0.833, a recall of 0.407, and an F1 of 0.545. We compared our approach to a support vector machine (SVM)-based machine learning and a co-occurrence statistics-based approach. We show that the pattern-learning approach is largely complementary to the SVM- and co-occurrence-based approaches with significantly higher precision and F1 but lower recall. We demonstrated by correlation analysis that the extracted drug side effects correlate positively with both drug targets, metabolism, and indications.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Program, Center for Clinical Investigation, Case Western Reserve University, Cleveland, OH 44106, United States.
| | - QuanQiu Wang
- ThinTek, LLC, Palo Alto, CA 94306, United States.
| |
Collapse
|
42
|
Xu R, Li L, Wang Q. dRiskKB: a large-scale disease-disease risk relationship knowledge base constructed from biomedical text. BMC Bioinformatics 2014; 15:105. [PMID: 24725842 PMCID: PMC3998061 DOI: 10.1186/1471-2105-15-105] [Citation(s) in RCA: 35] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2013] [Accepted: 04/07/2014] [Indexed: 11/14/2022] Open
Abstract
BACKGROUND Discerning the genetic contributions to complex human diseases is a challenging mandate that demands new types of data and calls for new avenues for advancing the state-of-the-art in computational approaches to uncovering disease etiology. Systems approaches to studying observable phenotypic relationships among diseases are emerging as an active area of research for both novel disease gene discovery and drug repositioning. Currently, systematic study of disease relationships on a phenome-wide scale is limited due to the lack of large-scale machine understandable disease phenotype relationship knowledge bases. Our study innovates a semi-supervised iterative pattern learning approach that is used to build an precise, large-scale disease-disease risk relationship (D1 → D2) knowledge base (dRiskKB) from a vast corpus of free-text published biomedical literature. RESULTS 21,354,075 MEDLINE records comprised the text corpus under study. First, we used one typical disease risk-specific syntactic pattern (i.e. "D1 due to D2") as a seed to automatically discover other patterns specifying similar semantic relationships among diseases. We then extracted D1 → D2 risk pairs from MEDLINE using the learned patterns. We manually evaluated the precisions of the learned patterns and extracted pairs. Finally, we analyzed the correlations between disease-disease risk pairs and their associated genes and drugs. The newly created dRiskKB consists of a total of 34,448 unique D1 → D2 pairs, representing the risk-specific semantic relationships among 12,981 diseases with each disease linked to its associated genes and drugs. The identified patterns are highly precise (average precision of 0.99) in specifying the risk-specific relationships among diseases. The precisions of extracted pairs are 0.919 for those that are exactly matched and 0.988 for those that are partially matched. By comparing the iterative pattern approach starting from different seeds, we demonstrated that our algorithm is robust in terms of seed choice. We show that diseases and their risk diseases as well as diseases with similar risk profiles tend to share both genes and drugs. CONCLUSIONS This unique dRiskKB, when combined with existing phenotypic, genetic, and genomic datasets, can have profound implications in our deeper understanding of disease etiology and in drug repositioning.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Division, Case Western Reserve University, Cleveland, OH, USA
| | - Li Li
- Departments of Family Medicine and Community Health, Epidemiology and Biostatistics, Case Western Reserve University, Cleveland, OH, USA
| | | |
Collapse
|