1
|
Stock M, Van Criekinge W, Boeckaerts D, Taelman S, Van Haeverbeke M, Dewulf P, De Baets B. Hyperdimensional computing: A fast, robust, and interpretable paradigm for biological data. PLoS Comput Biol 2024; 20:e1012426. [PMID: 39316621 PMCID: PMC11421772 DOI: 10.1371/journal.pcbi.1012426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/26/2024] Open
Abstract
Advances in bioinformatics are primarily due to new algorithms for processing diverse biological data sources. While sophisticated alignment algorithms have been pivotal in analyzing biological sequences, deep learning has substantially transformed bioinformatics, addressing sequence, structure, and functional analyses. However, these methods are incredibly data-hungry, compute-intensive, and hard to interpret. Hyperdimensional computing (HDC) has recently emerged as an exciting alternative. The key idea is that random vectors of high dimensionality can represent concepts such as sequence identity or phylogeny. These vectors can then be combined using simple operators for learning, reasoning, or querying by exploiting the peculiar properties of high-dimensional spaces. Our work reviews and explores HDC's potential for bioinformatics, emphasizing its efficiency, interpretability, and adeptness in handling multimodal and structured data. HDC holds great potential for various omics data searching, biosignal analysis, and health applications.
Collapse
Affiliation(s)
- Michiel Stock
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Wim Van Criekinge
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Dimitri Boeckaerts
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Steff Taelman
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Biobix Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- BioLizard nv, Ghent, Belgium
| | - Maxime Van Haeverbeke
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Pieter Dewulf
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Bernard De Baets
- KERMIT Research Unit, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| |
Collapse
|
2
|
Malec SA, Taneja SB, Albert SM, Elizabeth Shaaban C, Karim HT, Levine AS, Munro P, Callahan TJ, Boyce RD. Causal feature selection using a knowledge graph combining structured knowledge from the biomedical literature and ontologies: A use case studying depression as a risk factor for Alzheimer's disease. J Biomed Inform 2023; 142:104368. [PMID: 37086959 PMCID: PMC10355339 DOI: 10.1016/j.jbi.2023.104368] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2022] [Revised: 03/03/2023] [Accepted: 04/17/2023] [Indexed: 04/24/2023]
Abstract
BACKGROUND Causal feature selection is essential for estimating effects from observational data. Identifying confounders is a crucial step in this process. Traditionally, researchers employ content-matter expertise and literature review to identify confounders. Uncontrolled confounding from unidentified confounders threatens validity, conditioning on intermediate variables (mediators) weakens estimates, and conditioning on common effects (colliders) induces bias. Additionally, without special treatment, erroneous conditioning on variables combining roles introduces bias. However, the vast literature is growing exponentially, making it infeasible to assimilate this knowledge. To address these challenges, we introduce a novel knowledge graph (KG) application enabling causal feature selection by combining computable literature-derived knowledge with biomedical ontologies. We present a use case of our approach specifying a causal model for estimating the total causal effect of depression on the risk of developing Alzheimer's disease (AD) from observational data. METHODS We extracted computable knowledge from a literature corpus using three machine reading systems and inferred missing knowledge using logical closure operations. Using a KG framework, we mapped the output to target terminologies and combined it with ontology-grounded resources. We translated epidemiological definitions of confounder, collider, and mediator into queries for searching the KG and summarized the roles played by the identified variables. We compared the results with output from a complementary method and published observational studies and examined a selection of confounding and combined role variables in-depth. RESULTS Our search identified 128 confounders, including 58 phenotypes, 47 drugs, 35 genes, 23 collider, and 16 mediator phenotypes. However, only 31 of the 58 confounder phenotypes were found to behave exclusively as confounders, while the remaining 27 phenotypes played other roles. Obstructive sleep apnea emerged as a potential novel confounder for depression and AD. Anemia exemplified a variable playing combined roles. CONCLUSION Our findings suggest combining machine reading and KG could augment human expertise for causal feature selection. However, the complexity of causal feature selection for depression with AD highlights the need for standardized field-specific databases of causal variables. Further work is needed to optimize KG search and transform the output for human consumption.
Collapse
Affiliation(s)
- Scott A Malec
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Sanya B Taneja
- Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| | - Steven M Albert
- Department of Behavioral and Community Health Sciences, School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA
| | - C Elizabeth Shaaban
- Department of Epidemiology, School of Public Health, University of Pittsburgh, Pittsburgh, PA, USA
| | - Helmet T Karim
- Department of Psychiatry, University of Pittsburgh, Pittsburgh, PA, USA; Department of Bioengineering, University of Pittsburgh, Pittsburgh, PA, USA
| | - Arthur S Levine
- Department of Neurobiology, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA; The Brain Institute, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA
| | - Paul Munro
- School of Computing and Information, University of Pittsburgh, Pittsburgh, PA, USA
| | - Tiffany J Callahan
- Department of Biomedical Informatics, Columbia University, New York, NY, USA
| | - Richard D Boyce
- Department of Biomedical Informatics, School of Medicine, University of Pittsburgh, Pittsburgh, PA, USA; Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA
| |
Collapse
|
3
|
Kumar S, Nanelia A, Mariappan R, Rajagopal A, Rajan V. Patient Representation Learning From Heterogeneous Data Sources and Knowledge Graphs Using Deep Collective Matrix Factorization: Evaluation Study. JMIR Med Inform 2022; 10:e28842. [PMID: 35049514 PMCID: PMC8814927 DOI: 10.2196/28842] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 11/07/2021] [Accepted: 11/14/2021] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Patient representation learning aims to learn features, also called representations, from input sources automatically, often in an unsupervised manner, for use in predictive models. This obviates the need for cumbersome, time- and resource-intensive manual feature engineering, especially from unstructured data such as text, images, or graphs. Most previous techniques have used neural network-based autoencoders to learn patient representations, primarily from clinical notes in electronic medical records (EMRs). Knowledge graphs (KGs), with clinical entities as nodes and their relations as edges, can be extracted automatically from biomedical literature and provide complementary information to EMR data that have been found to provide valuable predictive signals. OBJECTIVE This study aims to evaluate the efficacy of collective matrix factorization (CMF), both the classical variant and a recent neural architecture called deep CMF (DCMF), in integrating heterogeneous data sources from EMR and KG to obtain patient representations for clinical decision support tasks. METHODS Using a recent formulation for obtaining graph representations through matrix factorization within the context of CMF, we infused auxiliary information during patient representation learning. We also extended the DCMF architecture to create a task-specific end-to-end model that learns to simultaneously find effective patient representations and predictions. We compared the efficacy of such a model to that of first learning unsupervised representations and then independently learning a predictive model. We evaluated patient representation learning using CMF-based methods and autoencoders for 2 clinical decision support tasks on a large EMR data set. RESULTS Our experiments show that DCMF provides a seamless way for integrating multiple sources of data to obtain patient representations, both in unsupervised and supervised settings. Its performance in single-source settings is comparable with that of previous autoencoder-based representation learning methods. When DCMF is used to obtain representations from a combination of EMR and KG, where most previous autoencoder-based methods cannot be used directly, its performance is superior to that of previous nonneural methods for CMF. Infusing information from KGs into patient representations using DCMF was found to improve downstream predictive performance. CONCLUSIONS Our experiments indicate that DCMF is a versatile model that can be used to obtain representations from single and multiple data sources and combine information from EMR data and KGs. Furthermore, DCMF can be used to learn representations in both supervised and unsupervised settings. Thus, DCMF offers an effective way of integrating heterogeneous data sources and infusing auxiliary knowledge into patient representations.
Collapse
Affiliation(s)
| | - Alicia Nanelia
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| | - Ragunathan Mariappan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| | | | - Vaibhav Rajan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| |
Collapse
|
4
|
Dasgupta S, Jayagopal A, Jun Hong AL, Mariappan R, Rajan V. Adverse Drug Event Prediction Using Noisy Literature-Derived Knowledge Graphs: Algorithm Development and Validation. JMIR Med Inform 2021; 9:e32730. [PMID: 34694230 PMCID: PMC8576589 DOI: 10.2196/32730] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Revised: 09/07/2021] [Accepted: 09/18/2021] [Indexed: 01/19/2023] Open
Abstract
BACKGROUND Adverse drug events (ADEs) are unintended side effects of drugs that cause substantial clinical and economic burdens globally. Not all ADEs are discovered during clinical trials; therefore, postmarketing surveillance, called pharmacovigilance, is routinely conducted to find unknown ADEs. A wealth of information, which facilitates ADE discovery, lies in the growing body of biomedical literature. Knowledge graphs (KGs) encode information from the literature, where the vertices and the edges represent clinical concepts and their relations, respectively. The scale and unstructured form of the literature necessitates the use of natural language processing (NLP) to automatically create such KGs. Previous studies have demonstrated the utility of such literature-derived KGs in ADE prediction. Through unsupervised learning of the representations (features) of clinical concepts from the KG, which are used in machine learning models, state-of-the-art results for ADE prediction were obtained on benchmark data sets. OBJECTIVE Due to the use of NLP to infer literature-derived KGs, there is noise in the form of false positive (erroneous) and false negative (absent) nodes and edges. Previous representation learning methods do not account for such inaccuracies in the graph. NLP algorithms can quantify the confidence in their inference of extracted concepts and relations from the literature. Our hypothesis, which motivates this work, is that by using such confidence scores during representation learning, the learned embeddings would yield better features for ADE prediction models. METHODS We developed methods to use these confidence scores on two well-known representation learning methods-DeepWalk and Translating Embeddings for Modeling Multi-relational Data (TransE)-to develop their weighted versions: Weighted DeepWalk and Weighted TransE. These methods were used to learn representations from a large literature-derived KG, the Semantic MEDLINE Database, which contains more than 93 million clinical relations. They were compared with Embedding of Semantic Predications, which, to our knowledge, is the best reported representation learning method using the Semantic MEDLINE Database with state-of-the-art results for ADE prediction. Representations learned from different methods were used (separately) as features of drugs and diseases to build classification models for ADE prediction using benchmark data sets. The methods were compared rigorously over multiple cross-validation settings. RESULTS The weighted versions we designed were able to learn representations that yielded more accurate predictive models than the corresponding unweighted versions of both DeepWalk and TransE, as well as Embedding of Semantic Predications, in our experiments. There were performance improvements of up to 5.75% in the F1-score and 8.4% in the area under the receiver operating characteristic curve value, thus advancing the state of the art in ADE prediction from literature-derived KGs. CONCLUSIONS Our classification models can be used to aid pharmacovigilance teams in detecting potentially new ADEs. Our experiments demonstrate the importance of modeling inaccuracies in the inferred KGs for representation learning.
Collapse
Affiliation(s)
| | | | - Abel Lim Jun Hong
- School of Computing, National University of Singapore, Singapore, Singapore
| | | | - Vaibhav Rajan
- Department of Information Systems and Analytics, National University of Singapore, Singapore, Singapore
| |
Collapse
|
5
|
Malec SA, Wei P, Bernstam EV, Boyce RD, Cohen T. Using computable knowledge mined from the literature to elucidate confounders for EHR-based pharmacovigilance. J Biomed Inform 2021; 117:103719. [PMID: 33716168 PMCID: PMC8559730 DOI: 10.1016/j.jbi.2021.103719] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2020] [Revised: 12/31/2020] [Accepted: 01/04/2021] [Indexed: 10/21/2022]
Abstract
INTRODUCTION Drug safety research asks causal questions but relies on observational data. Confounding bias threatens the reliability of studies using such data. The successful control of confounding requires knowledge of variables called confounders affecting both the exposure and outcome of interest. However, causal knowledge of dynamic biological systems is complex and challenging. Fortunately, computable knowledge mined from the literature may hold clues about confounders. In this paper, we tested the hypothesis that incorporating literature-derived confounders can improve causal inference from observational data. METHODS We introduce two methods (semantic vector-based and string-based confounder search) that query literature-derived information for confounder candidates to control, using SemMedDB, a database of computable knowledge mined from the biomedical literature. These methods search SemMedDB for confounders by applying semantic constraint search for indications treated by the drug (exposure) and that are also known to cause the adverse event (outcome). We then include the literature-derived confounder candidates in statistical and causal models derived from free-text clinical notes. For evaluation, we use a reference dataset widely used in drug safety containing labeled pairwise relationships between drugs and adverse events and attempt to rediscover these relationships from a corpus of 2.2 M NLP-processed free-text clinical notes. We employ standard adjustment and causal inference procedures to predict and estimate causal effects by informing the models with varying numbers of literature-derived confounders and instantiating the exposure, outcome, and confounder variables in the models with dichotomous EHR-derived data. Finally, we compare the results from applying these procedures with naive measures of association (χ2 and reporting odds ratio) and with each other. RESULTS AND CONCLUSIONS We found semantic vector-based search to be superior to string-based search at reducing confounding bias. However, the effect of including more rather than fewer literature-derived confounders was inconclusive. We recommend using targeted learning estimation methods that can address treatment-confounder feedback, where confounders also behave as intermediate variables, and engaging subject-matter experts to adjudicate the handling of problematic covariates.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States.
| | - Peng Wei
- The University of Texas MD Anderson Cancer Center, Department of Biostatistics, Houston, TX, United States
| | - Elmer V Bernstam
- University of Texas Health Science Center at Houston, School of Biomedical Informatics, Houston, TX, United States
| | - Richard D Boyce
- University of Pittsburgh School of Medicine, Department of Biomedical Informatics, Pittsburgh, PA, United States
| | - Trevor Cohen
- University of Washington, Department of Biomedical Informatics and Medical Education, Seattle, WA, United States
| |
Collapse
|
6
|
Zhang R, Hristovski D, Schutte D, Kastrin A, Fiszman M, Kilicoglu H. Drug repurposing for COVID-19 via knowledge graph completion. J Biomed Inform 2021; 115:103696. [PMID: 33571675 PMCID: PMC7869625 DOI: 10.1016/j.jbi.2021.103696] [Citation(s) in RCA: 73] [Impact Index Per Article: 18.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Revised: 12/23/2020] [Accepted: 02/01/2021] [Indexed: 02/07/2023]
Abstract
OBJECTIVE To discover candidate drugs to repurpose for COVID-19 using literature-derived knowledge and knowledge graph completion methods. METHODS We propose a novel, integrative, and neural network-based literature-based discovery (LBD) approach to identify drug candidates from PubMed and other COVID-19-focused research literature. Our approach relies on semantic triples extracted using SemRep (via SemMedDB). We identified an informative and accurate subset of semantic triples using filtering rules and an accuracy classifier developed on a BERT variant. We used this subset to construct a knowledge graph, and applied five state-of-the-art, neural knowledge graph completion algorithms (i.e., TransE, RotatE, DistMult, ComplEx, and STELP) to predict drug repurposing candidates. The models were trained and assessed using a time slicing approach and the predicted drugs were compared with a list of drugs reported in the literature and evaluated in clinical trials. These models were complemented by a discovery pattern-based approach. RESULTS Accuracy classifier based on PubMedBERT achieved the best performance (F1 = 0.854) in identifying accurate semantic predications. Among five knowledge graph completion models, TransE outperformed others (MR = 0.923, Hits@1 = 0.417). Some known drugs linked to COVID-19 in the literature were identified, as well as others that have not yet been studied. Discovery patterns enabled identification of additional candidate drugs and generation of plausible hypotheses regarding the links between the candidate drugs and COVID-19. Among them, five highly ranked and novel drugs (i.e., paclitaxel, SB 203580, alpha 2-antiplasmin, metoclopramide, and oxymatrine) and the mechanistic explanations for their potential use are further discussed. CONCLUSION We showed that a LBD approach can be feasible not only for discovering drug candidates for COVID-19, but also for generating mechanistic explanations. Our approach can be generalized to other diseases as well as to other clinical questions. Source code and data are available at https://github.com/kilicogluh/lbd-covid.
Collapse
Affiliation(s)
- Rui Zhang
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA.
| | - Dimitar Hristovski
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Dalton Schutte
- Institute for Health Informatics and Department of Pharmaceutical Care & Health Systems, University of Minnesota, MN, USA
| | - Andrej Kastrin
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Marcelo Fiszman
- NITES - Núcleo de Inovação e Tecnologia Em Saúde, Pontifical Catholic University of Rio de Janeiro, Brazil
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| |
Collapse
|
7
|
|
8
|
Malec SA, Boyce RD. Exploring Novel Computable Knowledge in Structured Drug Product Labels. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2020; 2020:403-412. [PMID: 32477661 PMCID: PMC7233092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
This paper introduces a database derived from Structured Product Labels (SPLs). SPLs are legally mandated snapshots containing information on all drugs released to market in the United States. Since publication is not required for pre-trial findings, we hypothesize that SPLs may contain knowledge absent in the literature, and hence "novel." SemMedDB is an existing database of computable knowledge derived from the literature. If SPL content could be similarly transformed, novel clinically relevant assertions in the SPLs could be identified through comparison with SemMedDB. After we derive a database (containing 4,297,481 assertions), we compare the extracted content with SemMedDB for recent FDA drug approvals. We find that novelty between the SPLs and the literature is nuanced, due to the redundancy of SPLs. Highlighting areas for improvement and future work, we conclude that SPLs contain a wealth of novel knowledge relevant to research and complementary to the literature.
Collapse
Affiliation(s)
- Scott A Malec
- University of Pittsburgh Department of Biomedical Informatics, Pittsburgh, PA
| | - Richard D Boyce
- University of Pittsburgh Department of Biomedical Informatics, Pittsburgh, PA
| |
Collapse
|
9
|
Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21:188. [PMID: 32410573 PMCID: PMC7222583 DOI: 10.1186/s12859-020-3517-7] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
- University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820 IL USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | | | - Dongwook Shin
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
10
|
Burkhardt HA, Subramanian D, Mower J, Cohen T. Predicting Adverse Drug-Drug Interactions with Neural Embedding of Semantic Predications. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2020; 2019:992-1001. [PMID: 32308896 PMCID: PMC7153048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
The identification of drug-drug interactions (DDIs) is important for patient safety; yet, compared to other pharmacovigilance work, a limited amount of research has been conducted in this space. Recent work has successfully applied a method of deriving distributed vector representations from structured biomedical knowledge, known as Embedding of Semantic Predications (ESP), to the problem of predicting individual drug side effects. In the current paper we extend this work by applying ESP to the problem of predicting polypharmacy side-effects for particular drug combinations, building on a recent reconceptualization of this problem as a network of drug nodes connected by side effect edges. We evaluate ESP embeddings derived from the resulting graph on a side-effect prediction task against a previously reported graph convolutional neural network approach, using the same data and evaluation methods. We demonstrate that ESP models perform better, while being faster to train, more re-usable, and significantly simpler.
Collapse
|
11
|
Mower J, Cohen T, Subramanian D. Complementing Observational Signals with Literature-Derived Distributed Representations for Post-Marketing Drug Surveillance. Drug Saf 2019; 43:67-77. [PMID: 31646442 DOI: 10.1007/s40264-019-00872-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
INTRODUCTION As a result of the well documented limitations of data collected by spontaneous reporting systems (SRS), such as bias and under-reporting, a number of authors have evaluated the utility of other data sources for the purpose of pharmacovigilance, including the biomedical literature. Previous work has demonstrated the utility of literature-derived distributed representations (concept embeddings) with machine learning for the purpose of drug side-effect prediction. In terms of data sources, these methods are complementary, observing drug safety from two different perspectives (knowledge extracted from the literature and statistics from SRS data). However, the combined utility of these pharmacovigilance methods has yet to be evaluated. OBJECTIVE This research investigates the utility of directly or indirectly combining an observational signal from SRS with literature-derived distributed representations into a single feature vector or in an ensemble approach for downstream machine learning (logistic regression). METHODS Leveraging a recently developed representation scheme, concept embeddings were generated from relational connections extracted from the literature and composed to represent drug and associated adverse reactions, as defined by two reference standards of positive (likely causal) and negative (no causal evidence) pairs. Embeddings were presented with and without common measures of observational signal from SRS sources to logistic regressors, and performance was evaluated with the receiver operating characteristic (ROC) area under the curve (AUC) metric. RESULTS ROC AUC performance with these composite models improves up to ≈ 20% over SRS-based disproportionality metrics alone and exceeds the best prior results reported in the literature when models leverage both sources of information. CONCLUSIONS Results from this study support the hypothesis that knowledge extracted from the literature can enhance the performance of SRS-based methods (and vice versa). Across reference sets, using literature and SRS information together performed better than using either source alone, providing strong support for the complementary nature of these approaches to post-marketing drug surveillance.
Collapse
Affiliation(s)
- Justin Mower
- Department of Computer Science, Rice University, Houston, TX, 77018, USA.
| | - Trevor Cohen
- University of Washington, Biomedical Informatics and Medical Education, Seattle, WA, 98195, USA
| | - Devika Subramanian
- Department of Computer Science, Rice University, Houston, TX, 77018, USA
| |
Collapse
|
12
|
Maldonado A, Klubička F, Kelleher J. Size Matters: The Impact of Training Size in Taxonomically-Enriched Word Embeddings. OPEN COMPUTER SCIENCE 2019. [DOI: 10.1515/comp-2019-0009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
AbstractWord embeddings trained on natural corpora (e.g., newspaper collections, Wikipedia or the Web) excel in capturing thematic similarity (“topical relatedness”) on word pairs such as ‘coffee’ and ‘cup’ or ’bus’ and ‘road’. However, they are less successful on pairs showing taxonomic similarity, like ‘cup’ and ‘mug’ (near synonyms) or ‘bus’ and ‘train’ (types of public transport). Moreover, purely taxonomy-based embeddings (e.g. those trained on a random-walk of WordNet’s structure) outperform natural-corpus embeddings in taxonomic similarity but underperform them in thematic similarity. Previous work suggests that performance gains in both types of similarity can be achieved by enriching natural-corpus embeddings with taxonomic information from taxonomies like Word-Net. This taxonomic enrichment can be done by combining natural-corpus embeddings with taxonomic embeddings (e.g. those trained on a random-walk of WordNet’s structure). This paper conducts a deep analysis of this assumption and shows that both the size of the natural corpus and of the random-walk coverage of the WordNet structure play a crucial role in the performance of combined (enriched) vectors in both similarity tasks. Specifically, we show that embeddings trained on medium-sized natural corpora benefit the most from taxonomic enrichment whilst embeddings trained on large natural corpora only benefit from this enrichment when evaluated on taxonomic similarity tasks. The implication of this is that care has to be taken in controlling the size of the natural corpus and the size of the random-walk used to train vectors. In addition, we find that, whilst the WordNet structure is finite and it is possible to fully traverse it in a single pass, the repetition of well-connected WordNet concepts in extended random-walks effectively reinforces taxonomic relations in the learned embeddings.
Collapse
Affiliation(s)
| | - Filip Klubička
- ADAPT Centre at Technological University Dublin, Dublin, Ireland
| | - John Kelleher
- ADAPT Centre at Technological University Dublin, Dublin, Ireland
| |
Collapse
|
13
|
Natsiavas P, Malousi A, Bousquet C, Jaulent MC, Koutkias V. Computational Advances in Drug Safety: Systematic and Mapping Review of Knowledge Engineering Based Approaches. Front Pharmacol 2019; 10:415. [PMID: 31156424 PMCID: PMC6533857 DOI: 10.3389/fphar.2019.00415] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2018] [Accepted: 04/02/2019] [Indexed: 12/12/2022] Open
Abstract
Drug Safety (DS) is a domain with significant public health and social impact. Knowledge Engineering (KE) is the Computer Science discipline elaborating on methods and tools for developing “knowledge-intensive” systems, depending on a conceptual “knowledge” schema and some kind of “reasoning” process. The present systematic and mapping review aims to investigate KE-based approaches employed for DS and highlight the introduced added value as well as trends and possible gaps in the domain. Journal articles published between 2006 and 2017 were retrieved from PubMed/MEDLINE and Web of Science® (873 in total) and filtered based on a comprehensive set of inclusion/exclusion criteria. The 80 finally selected articles were reviewed on full-text, while the mapping process relied on a set of concrete criteria (concerning specific KE and DS core activities, special DS topics, employed data sources, reference ontologies/terminologies, and computational methods, etc.). The analysis results are publicly available as online interactive analytics graphs. The review clearly depicted increased use of KE approaches for DS. The collected data illustrate the use of KE for various DS aspects, such as Adverse Drug Event (ADE) information collection, detection, and assessment. Moreover, the quantified analysis of using KE for the respective DS core activities highlighted room for intensifying research on KE for ADE monitoring, prevention and reporting. Finally, the assessed use of the various data sources for DS special topics demonstrated extensive use of dominant data sources for DS surveillance, i.e., Spontaneous Reporting Systems, but also increasing interest in the use of emerging data sources, e.g., observational healthcare databases, biochemical/genetic databases, and social media. Various exemplar applications were identified with promising results, e.g., improvement in Adverse Drug Reaction (ADR) prediction, detection of drug interactions, and novel ADE profiles related with specific mechanisms of action, etc. Nevertheless, since the reviewed studies mostly concerned proof-of-concept implementations, more intense research is required to increase the maturity level that is necessary for KE approaches to reach routine DS practice. In conclusion, we argue that efficiently addressing DS data analytics and management challenges requires the introduction of high-throughput KE-based methods for effective knowledge discovery and management, resulting ultimately, in the establishment of a continuous learning DS system.
Collapse
Affiliation(s)
- Pantelis Natsiavas
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece.,Sorbonne Université, INSERM, Univ Paris 13, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé, LIMICS, Paris, France
| | - Andigoni Malousi
- Laboratory of Biological Chemistry, Department of Medicine, Aristotle University of Thessaloniki, Thessaloniki, Greece
| | - Cédric Bousquet
- Sorbonne Université, INSERM, Univ Paris 13, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé, LIMICS, Paris, France.,Public Health and Medical Information Unit, University Hospital of Saint-Etienne, Saint-Étienne, France
| | - Marie-Christine Jaulent
- Sorbonne Université, INSERM, Univ Paris 13, Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances pour la e-Santé, LIMICS, Paris, France
| | - Vassilis Koutkias
- Institute of Applied Biosciences, Centre for Research and Technology Hellas, Thessaloniki, Greece
| |
Collapse
|
14
|
Gopalakrishnan V, Jha K, Jin W, Zhang A. A survey on literature based discovery approaches in biomedical domain. J Biomed Inform 2019; 93:103141. [PMID: 30857950 DOI: 10.1016/j.jbi.2019.103141] [Citation(s) in RCA: 21] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 02/17/2019] [Accepted: 02/19/2019] [Indexed: 02/06/2023]
Abstract
Literature Based Discovery (LBD) refers to the problem of inferring new and interesting knowledge by logically connecting independent fragments of information units through explicit or implicit means. This area of research, which incorporates techniques from Natural Language Processing (NLP), Information Retrieval and Artificial Intelligence, has significant potential to reduce discovery time in biomedical research fields. Formally introduced in 1986, LBD has grown to be a significant and a core task for text mining practitioners in the biomedical domain. Together with its inter-disciplinary nature, this has led researchers across domains to contribute in advancing this field of study. This survey attempts to consolidate and present the evolution of techniques in this area. We cover a variety of techniques and provide a detailed description of the problem setting, the intuition, the advantages and limitations of various influential papers. We also list the current bottlenecks in this field and provide a general direction of research activities for the future. In an effort to be comprehensive and for ease of reference for off-the-shelf users, we also list many publicly available tools for LBD. We hope this survey will act as a guide to both academic and industry (bio)-informaticians, introduce the various methodologies currently employed and also the challenges yet to be tackled.
Collapse
Affiliation(s)
| | | | - Wei Jin
- University of North Texas at Denton, TX, United States.
| | | |
Collapse
|
15
|
Fathiamini S, Johnson AM, Zeng J, Holla V, Sanchez NS, Meric-Bernstam F, Bernstam EV, Cohen T. Rapamycin - mTOR + BRAF = ? Using relational similarity to find therapeutically relevant drug-gene relationships in unstructured text. J Biomed Inform 2019; 90:103094. [PMID: 30615938 PMCID: PMC6386529 DOI: 10.1016/j.jbi.2019.103094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Revised: 11/30/2018] [Accepted: 12/27/2018] [Indexed: 11/17/2022]
Affiliation(s)
- Safa Fathiamini
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, United States.
| | - Amber M Johnson
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Jia Zeng
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Vijaykumar Holla
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Nora S Sanchez
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Funda Meric-Bernstam
- Sheikh Khalifa Al Nahyan Ben Zayed Institute for Personalized Cancer Therapy, The University of Texas MD Anderson Cancer Center, Houston, TX, United States; Department of Investigational Cancer Therapeutics, The University of Texas MD Anderson Cancer Center, Houston, TX, United States; Department of Surgical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Elmer V Bernstam
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, TX, United States; Division of General Internal Medicine, Department of Internal Medicine, The University of Texas Health Science Center at Houston, TX, United States.
| | - Trevor Cohen
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, United States.
| |
Collapse
|
16
|
Mower J, Subramanian D, Cohen T. Learning predictive models of drug side-effect relationships from distributed representations of literature-derived semantic predications. J Am Med Inform Assoc 2018; 25:1339-1350. [PMID: 30010902 PMCID: PMC6454491 DOI: 10.1093/jamia/ocy077] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2018] [Revised: 04/23/2018] [Accepted: 06/05/2018] [Indexed: 02/01/2023] Open
Abstract
Objective The aim of this work is to leverage relational information extracted from biomedical literature using a novel synthesis of unsupervised pretraining, representational composition, and supervised machine learning for drug safety monitoring. Methods Using ≈80 million concept-relationship-concept triples extracted from the literature using the SemRep Natural Language Processing system, distributed vector representations (embeddings) were generated for concepts as functions of their relationships utilizing two unsupervised representational approaches. Embeddings for drugs and side effects of interest from two widely used reference standards were then composed to generate embeddings of drug/side-effect pairs, which were used as input for supervised machine learning. This methodology was developed and evaluated using cross-validation strategies and compared to contemporary approaches. To qualitatively assess generalization, models trained on the Observational Medical Outcomes Partnership (OMOP) drug/side-effect reference set were evaluated against a list of ≈1100 drugs from an online database. Results The employed method improved performance over previous approaches. Cross-validation results advance the state of the art (AUC 0.96; F1 0.90 and AUC 0.95; F1 0.84 across the two sets), outperforming methods utilizing literature and/or spontaneous reporting system data. Examination of predictions for unseen drug/side-effect pairs indicates the ability of these methods to generalize, with over tenfold label support enrichment in the top 100 predictions versus the bottom 100 predictions. Discussion and Conclusion Our methods can assist the pharmacovigilance process using information from the biomedical literature. Unsupervised pretraining generates a rich relationship-based representational foundation for machine learning techniques to classify drugs in the context of a putative side effect, given known examples.
Collapse
Affiliation(s)
- Justin Mower
- Baylor College of Medicine, Quantitative and Computational Biosciences, Houston, Texas, USA
| | | | - Trevor Cohen
- School of Biomedical Informatics, University of Texas Health Science Center Houston, Texas, USA
| |
Collapse
|
17
|
Zhang W, Yue X, Lin W, Wu W, Liu R, Huang F, Liu F. Predicting drug-disease associations by using similarity constrained matrix factorization. BMC Bioinformatics 2018; 19:233. [PMID: 29914348 PMCID: PMC6006580 DOI: 10.1186/s12859-018-2220-4] [Citation(s) in RCA: 154] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2017] [Accepted: 05/28/2018] [Indexed: 02/06/2023] Open
Abstract
Background Drug-disease associations provide important information for the drug discovery. Wet experiments that identify drug-disease associations are time-consuming and expensive. However, many drug-disease associations are still unobserved or unknown. The development of computational methods for predicting unobserved drug-disease associations is an important and urgent task. Results In this paper, we proposed a similarity constrained matrix factorization method for the drug-disease association prediction (SCMFDD), which makes use of known drug-disease associations, drug features and disease semantic information. SCMFDD projects the drug-disease association relationship into two low-rank spaces, which uncover latent features for drugs and diseases, and then introduces drug feature-based similarities and disease semantic similarity as constraints for drugs and diseases in low-rank spaces. Different from the classic matrix factorization technique, SCMFDD takes the biological context of the problem into account. In computational experiments, the proposed method can produce high-accuracy performances on benchmark datasets, and outperform existing state-of-the-art prediction methods when evaluated by five-fold cross validation and independent testing. Conclusion We developed a user-friendly web server by using known associations collected from the CTD database, available at http://www.bioinfotech.cn/SCMFDD/. The case studies show that the server can find out novel associations, which are not included in the CTD database.
Collapse
Affiliation(s)
- Wen Zhang
- School of Computer Science, Wuhan University, Wuhan, 430072, China.
| | - Xiang Yue
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Weiran Lin
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Wenjian Wu
- School of Electronic Information, Wuhan University, Wuhan, 430072, China
| | - Ruoqi Liu
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Feng Huang
- School of Computer Science, Wuhan University, Wuhan, 430072, China
| | - Feng Liu
- School of Computer Science, Wuhan University, Wuhan, 430072, China.
| |
Collapse
|
18
|
Smalheiser NR. Rediscovering Don Swanson: the Past, Present and Future of Literature-Based Discovery. JOURNAL OF DATA AND INFORMATION SCIENCE 2017; 2:43-64. [PMID: 29355246 PMCID: PMC5771422 DOI: 10.1515/jdis-2017-0019] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023] Open
Abstract
The late Don R. Swanson was well appreciated during his lifetime as Dean of the Graduate Library School at University of Chicago, as winner of the American Society for Information Science Award of Merit for 2000, and as author of many seminal articles. In this informal essay, I will give my personal perspective on Don's contributions to science, and outline some current and future directions in literature-based discovery that are rooted in concepts that he developed.
Collapse
Affiliation(s)
- Neil R Smalheiser
- Department of Psychiatry, University of Illinois at Chicago, Chicago, IL 60612 USA, +1 312-413-4581
| |
Collapse
|