1
|
Li D, Wu P, Dong Y, Gu J, Qian L, Zhou G. Joint learning-based causal relation extraction from biomedical literature. J Biomed Inform 2023; 139:104318. [PMID: 36781035 DOI: 10.1016/j.jbi.2023.104318] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 02/03/2023] [Accepted: 02/08/2023] [Indexed: 02/13/2023]
Abstract
Causal relation extraction of biomedical entities is one of the most complex tasks in biomedical text mining, which involves two kinds of information: entity relations and entity functions. One feasible approach is to take relation extraction and function detection as two independent sub-tasks. However, this separate learning method ignores the intrinsic correlation between them and leads to unsatisfactory performance. In this paper, we propose a joint learning model, which combines entity relation extraction and entity function detection to exploit their commonality and capture their inter-relationship, so as to improve the performance of biomedical causal relation extraction. Experimental results on the BioCreative-V Track 4 corpus show that our joint learning model outperforms the separate models in BEL statement extraction, achieving the F1 scores of 57.0% and 37.3% on the test set in Stage 2 and Stage 1 evaluations, respectively. This demonstrates that our joint learning system reaches the state-of-the-art performance in Stage 2 compared with other systems.
Collapse
Affiliation(s)
- Dongling Li
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province 215006, China.
| | - Pengchao Wu
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province 215006, China.
| | - Yuehu Dong
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province 215006, China.
| | - Jinghang Gu
- Department of Chinese and Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong 999077, China.
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province 215006, China.
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province 215006, China.
| |
Collapse
|
2
|
Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. A reproducible experimental survey on biomedical sentence similarity: A string-based method sets the state of the art. PLoS One 2022; 17:e0276539. [PMID: 36409715 PMCID: PMC9678326 DOI: 10.1371/journal.pone.0276539] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Accepted: 10/08/2022] [Indexed: 11/22/2022] Open
Abstract
This registered report introduces the largest, and for the first time, reproducible experimental survey on biomedical sentence similarity with the following aims: (1) to elucidate the state of the art of the problem; (2) to solve some reproducibility problems preventing the evaluation of most current methods; (3) to evaluate several unexplored sentence similarity methods; (4) to evaluate for the first time an unexplored benchmark, called Corpus-Transcriptional-Regulation (CTR); (5) to carry out a study on the impact of the pre-processing stages and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (6) to bridge the lack of software and data reproducibility resources for methods and experiments in this line of research. Our reproducible experimental survey is based on a single software platform, which is provided with a detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results. In addition, we introduce a new aggregated string-based sentence similarity method, called LiBlock, together with eight variants of current ontology-based methods, and a new pre-trained word embedding model trained on the full-text articles in the PMC-BioC corpus. Our experiments show that our novel string-based measure establishes the new state of the art in sentence similarity analysis in the biomedical domain and significantly outperforms all the methods evaluated herein, with the only exception of one ontology-based method. Likewise, our experiments confirm that the pre-processing stages, and the choice of the NER tool for ontology-based methods, have a very significant impact on the performance of the sentence similarity methods. We also detail some drawbacks and limitations of current methods, and highlight the need to refine the current benchmarks. Finally, a notable finding is that our new string-based method significantly outperforms all state-of-the-art Machine Learning (ML) models evaluated herein.
Collapse
Affiliation(s)
- Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| |
Collapse
|
3
|
Martínez-García M, Hernández-Lemus E. Data Integration Challenges for Machine Learning in Precision Medicine. Front Med (Lausanne) 2022; 8:784455. [PMID: 35145977 PMCID: PMC8821900 DOI: 10.3389/fmed.2021.784455] [Citation(s) in RCA: 43] [Impact Index Per Article: 14.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Accepted: 12/28/2021] [Indexed: 12/19/2022] Open
Abstract
A main goal of Precision Medicine is that of incorporating and integrating the vast corpora on different databases about the molecular and environmental origins of disease, into analytic frameworks, allowing the development of individualized, context-dependent diagnostics, and therapeutic approaches. In this regard, artificial intelligence and machine learning approaches can be used to build analytical models of complex disease aimed at prediction of personalized health conditions and outcomes. Such models must handle the wide heterogeneity of individuals in both their genetic predisposition and their social and environmental determinants. Computational approaches to medicine need to be able to efficiently manage, visualize and integrate, large datasets combining structure, and unstructured formats. This needs to be done while constrained by different levels of confidentiality, ideally doing so within a unified analytical architecture. Efficient data integration and management is key to the successful application of computational intelligence approaches to medicine. A number of challenges arise in the design of successful designs to medical data analytics under currently demanding conditions of performance in personalized medicine, while also subject to time, computational power, and bioethical constraints. Here, we will review some of these constraints and discuss possible avenues to overcome current challenges.
Collapse
Affiliation(s)
- Mireya Martínez-García
- Clinical Research Division, National Institute of Cardiology ‘Ignacio Chávez’, Mexico City, Mexico
| | - Enrique Hernández-Lemus
- Computational Genomics Division, National Institute of Genomic Medicine (INMEGEN), Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autnoma de Mexico, Mexico City, Mexico
| |
Collapse
|
4
|
Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021; 22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open
Abstract
Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Collapse
Affiliation(s)
| | - Indika Kahanda
- School of Computing, University of North Florida, Jacksonville, USA.
| |
Collapse
|
5
|
Yang X, Wu C, Nenadic G, Wang W, Lu K. Mining a stroke knowledge graph from literature. BMC Bioinformatics 2021; 22:387. [PMID: 34325669 PMCID: PMC8319697 DOI: 10.1186/s12859-021-04292-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2021] [Accepted: 07/06/2021] [Indexed: 01/01/2023] Open
Abstract
BACKGROUND Stroke has an acute onset and a high mortality rate, making it one of the most fatal diseases worldwide. Its underlying biology and treatments have been widely studied both in the "Western" biomedicine and the Traditional Chinese Medicine (TCM). However, these two approaches are often studied and reported in insolation, both in the literature and associated databases. RESULTS To aid research in finding effective prevention methods and treatments, we integrated knowledge from the literature and a number of databases (e.g. CID, TCMID, ETCM). We employed a suite of biomedical text mining (i.e. named-entity) approaches to identify mentions of genes, diseases, drugs, chemicals, symptoms, Chinese herbs and patent medicines, etc. in a large set of stroke papers from both biomedical and TCM domains. Then, using a combination of a rule-based approach with a pre-trained BioBERT model, we extracted and classified links and relationships among stroke-related entities as expressed in the literature. We construct StrokeKG, a knowledge graph includes almost 46 k nodes of nine types, and 157 k links of 30 types, connecting diseases, genes, symptoms, drugs, pathways, herbs, chemical, ingredients and patent medicine. CONCLUSIONS Our Stroke-KG can provide practical and reliable stroke-related knowledge to help with stroke-related research like exploring new directions for stroke research and ideas for drug repurposing and discovery. We make StrokeKG freely available at http://114.115.208.144:7474/browser/ (Please click "Connect" directly) and the source structured data for stroke at https://github.com/yangxi1016/Stroke.
Collapse
Affiliation(s)
- Xi Yang
- College of Computer, National University of Defence Technology, Changsha, 410073 China
- State Key Laboratory of High-Performance Computing, National University of Defence Technology, Changsha, 410073 China
- Department of Computer Science, University of Manchester, Manchester, M13 9PL UK
| | - Chengkun Wu
- State Key Laboratory of High-Performance Computing, National University of Defence Technology, Changsha, 410073 China
| | - Goran Nenadic
- Department of Computer Science, University of Manchester, Manchester, M13 9PL UK
| | - Wei Wang
- College of Computer, National University of Defence Technology, Changsha, 410073 China
| | - Kai Lu
- College of Computer, National University of Defence Technology, Changsha, 410073 China
| |
Collapse
|
6
|
Lara-Clares A, Lastra-Díaz JJ, Garcia-Serrano A. Protocol for a reproducible experimental survey on biomedical sentence similarity. PLoS One 2021; 16:e0248663. [PMID: 33760855 PMCID: PMC7990182 DOI: 10.1371/journal.pone.0248663] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Accepted: 03/02/2021] [Indexed: 11/28/2022] Open
Abstract
Measuring semantic similarity between sentences is a significant task in the fields of Natural Language Processing (NLP), Information Retrieval (IR), and biomedical text mining. For this reason, the proposal of sentence similarity methods for the biomedical domain has attracted a lot of attention in recent years. However, most sentence similarity methods and experimental results reported in the biomedical domain cannot be reproduced for multiple reasons as follows: the copying of previous results without confirmation, the lack of source code and data to replicate both methods and experiments, and the lack of a detailed definition of the experimental setup, among others. As a consequence of this reproducibility gap, the state of the problem can be neither elucidated nor new lines of research be soundly set. On the other hand, there are other significant gaps in the literature on biomedical sentence similarity as follows: (1) the evaluation of several unexplored sentence similarity methods which deserve to be studied; (2) the evaluation of an unexplored benchmark on biomedical sentence similarity, called Corpus-Transcriptional-Regulation (CTR); (3) a study on the impact of the pre-processing stage and Named Entity Recognition (NER) tools on the performance of the sentence similarity methods; and finally, (4) the lack of software and data resources for the reproducibility of methods and experiments in this line of research. Identified these open problems, this registered report introduces a detailed experimental setup, together with a categorization of the literature, to develop the largest, updated, and for the first time, reproducible experimental survey on biomedical sentence similarity. Our aforementioned experimental survey will be based on our own software replication and the evaluation of all methods being studied on the same software platform, which will be specially developed for this work, and it will become the first publicly available software library for biomedical sentence similarity. Finally, we will provide a very detailed reproducibility protocol and dataset as supplementary material to allow the exact replication of all our experiments and results.
Collapse
Affiliation(s)
- Alicia Lara-Clares
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Juan J. Lastra-Díaz
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| | - Ana Garcia-Serrano
- NLP & IR Research Group, E.T.S.I. Informática, Universidad Nacional de Educación a Distancia (UNED), Madrid, Spain
| |
Collapse
|
7
|
Shao Y, Li H, Gu J, Qian L, Zhou G. Extraction of causal relations based on SBEL and BERT model. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2021:6133143. [PMID: 33570092 PMCID: PMC7904051 DOI: 10.1093/database/baab005] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 01/19/2021] [Accepted: 01/26/2021] [Indexed: 11/15/2022]
Abstract
Extraction of causal relations between biomedical entities in the form of Biological Expression Language (BEL) poses a new challenge to the community of biomedical text mining due to the complexity of BEL statements. We propose a simplified form of BEL statements [Simplified Biological Expression Language (SBEL)] to facilitate BEL extraction and employ BERT (Bidirectional Encoder Representation from Transformers) to improve the performance of causal relation extraction (RE). On the one hand, BEL statement extraction is transformed into the extraction of an intermediate form—SBEL statement, which is then further decomposed into two subtasks: entity RE and entity function detection. On the other hand, we use a powerful pretrained BERT model to both extract entity relations and detect entity functions, aiming to improve the performance of two subtasks. Entity relations and functions are then combined into SBEL statements and finally merged into BEL statements. Experimental results on the BioCreative-V Track 4 corpus demonstrate that our method achieves the state-of-the-art performance in BEL statement extraction with F1 scores of 54.8% in Stage 2 evaluation and of 30.1% in Stage 1 evaluation, respectively. Database URL: https://github.com/grapeff/SBEL_datasets
Collapse
Affiliation(s)
- Yifan Shao
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Haoru Li
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Jinghang Gu
- Department of Chinese & Bilingual Studies, The Hong Kong Polytechnic University, Hong Kong, China, 999077
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu Province, China, 215006
| |
Collapse
|
8
|
Nicholson DN, Greene CS. Constructing knowledge graphs and their biomedical applications. Comput Struct Biotechnol J 2020; 18:1414-1428. [PMID: 32637040 PMCID: PMC7327409 DOI: 10.1016/j.csbj.2020.05.017] [Citation(s) in RCA: 97] [Impact Index Per Article: 19.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Revised: 05/22/2020] [Accepted: 05/23/2020] [Indexed: 12/31/2022] Open
Abstract
Knowledge graphs can support many biomedical applications. These graphs represent biomedical concepts and relationships in the form of nodes and edges. In this review, we discuss how these graphs are constructed and applied with a particular focus on how machine learning approaches are changing these processes. Biomedical knowledge graphs have often been constructed by integrating databases that were populated by experts via manual curation, but we are now seeing a more robust use of automated systems. A number of techniques are used to represent knowledge graphs, but often machine learning methods are used to construct a low-dimensional representation that can support many different applications. This representation is designed to preserve a knowledge graph's local and/or global structure. Additional machine learning methods can be applied to this representation to make predictions within genomic, pharmaceutical, and clinical domains. We frame our discussion first around knowledge graph construction and then around unifying representational learning techniques and unifying applications. Advances in machine learning for biomedicine are creating new opportunities across many domains, and we note potential avenues for future work with knowledge graphs that appear particularly promising.
Collapse
Affiliation(s)
- David N. Nicholson
- Department of Systems Pharmacology and Translational Therapeutics, University of Pennsylvania, United States
| | - Casey S. Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Childhood Cancer Data Lab, Alex’s Lemonade Stand Foundation, United States
| |
Collapse
|
9
|
Chen Q, Du J, Kim S, Wilbur WJ, Lu Z. Deep learning with sentence embeddings pre-trained on biomedical corpora improves the performance of finding similar sentences in electronic medical records. BMC Med Inform Decis Mak 2020; 20:73. [PMID: 32349758 PMCID: PMC7191680 DOI: 10.1186/s12911-020-1044-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
Background Capturing sentence semantics plays a vital role in a range of text mining applications. Despite continuous efforts on the development of related datasets and models in the general domain, both datasets and models are limited in biomedical and clinical domains. The BioCreative/OHNLP2018 organizers have made the first attempt to annotate 1068 sentence pairs from clinical notes and have called for a community effort to tackle the Semantic Textual Similarity (BioCreative/OHNLP STS) challenge. Methods We developed models using traditional machine learning and deep learning approaches. For the post challenge, we focused on two models: the Random Forest and the Encoder Network. We applied sentence embeddings pre-trained on PubMed abstracts and MIMIC-III clinical notes and updated the Random Forest and the Encoder Network accordingly. Results The official results demonstrated our best submission was the ensemble of eight models. It achieved a Person correlation coefficient of 0.8328 – the highest performance among 13 submissions from 4 teams. For the post challenge, the performance of both Random Forest and the Encoder Network was improved; in particular, the correlation of the Encoder Network was improved by ~ 13%. During the challenge task, no end-to-end deep learning models had better performance than machine learning models that take manually-crafted features. In contrast, with the sentence embeddings pre-trained on biomedical corpora, the Encoder Network now achieves a correlation of ~ 0.84, which is higher than the original best model. The ensembled model taking the improved versions of the Random Forest and Encoder Network as inputs further increased performance to 0.8528. Conclusions Deep learning models with sentence embeddings pre-trained on biomedical corpora achieve the highest performance on the test set. Through error analysis, we find that end-to-end deep learning models and traditional machine learning models with manually-crafted features complement each other by finding different types of sentences. We suggest a combination of these models can better find similar sentences in practice.
Collapse
Affiliation(s)
- Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Jingcheng Du
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.,School of Biomedical Informatics, UTHealth, Houston, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, USA.
| |
Collapse
|
10
|
Farahmand S, Riley T, Zarringhalam K. ModEx: A text mining system for extracting mode of regulation of transcription factor-gene regulatory interaction. J Biomed Inform 2019; 102:103353. [PMID: 31857203 DOI: 10.1016/j.jbi.2019.103353] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Revised: 11/22/2019] [Accepted: 12/10/2019] [Indexed: 10/25/2022]
Abstract
BACKGROUND Transcription factors (TFs) are proteins that are fundamental to transcription and regulation of gene expression. Each TF may regulate multiple genes and each gene may be regulated by multiple TFs. TFs can act as either activator or repressor of gene expression. This complex network of interactions between TFs and genes underlies many developmental and biological processes and is implicated in several human diseases such as cancer. Hence deciphering the network of TF-gene interactions with information on mode of regulation (activation vs. repression) is an important step toward understanding the regulatory pathways that underlie complex traits. There are many experimental, computational, and manually curated databases of TF-gene interactions. In particular, high-throughput ChIP-Seq datasets provide a large-scale map or transcriptional regulatory interactions. However, these interactions are not annotated with information on context and mode of regulation. Such information is crucial to gain a global picture of gene regulatory mechanisms and can aid in developing machine learning models for applications such as biomarker discovery, prediction of response to therapy, and precision medicine. METHODS In this work, we introduce a text-mining system to annotate ChIP-Seq derived interaction with such meta data through mining PubMed articles. We evaluate the performance of our system using gold standard small scale manually curated databases. RESULTS Our results show that the method is able to accurately extract mode of regulation with F-score 0.77 on TRRUST curated interaction and F-score 0.96 on intersection of TRUSST and ChIP-network. We provide a HTTP REST API for our code to facilitate usage. Availibility: Source code and datasets are available for download on GitHub: https://github.com/samanfrm/modex.
Collapse
Affiliation(s)
- Saman Farahmand
- Computational Sciences PhD program, University of Massachusetts Boston, Boston, USA; Department of Biology, University of Massachusetts Boston, Boston, USA
| | - Todd Riley
- Department of Biology, University of Massachusetts Boston, Boston, USA
| | | |
Collapse
|
11
|
Lai PT, Lu WL, Kuo TR, Chung CR, Han JC, Tsai RTH, Horng JT. Using a Large Margin Context-Aware Convolutional Neural Network to Automatically Extract Disease-Disease Association from Literature: Comparative Analytic Study. JMIR Med Inform 2019; 7:e14502. [PMID: 31769759 PMCID: PMC6913619 DOI: 10.2196/14502] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 07/26/2019] [Accepted: 08/11/2019] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Research on disease-disease association (DDA), like comorbidity and complication, provides important insights into disease treatment and drug discovery, and a large body of the literature has been published in the field. However, using current search tools, it is not easy for researchers to retrieve information on the latest DDA findings. First, comorbidity and complication keywords pull up large numbers of PubMed studies. Second, disease is not highlighted in search results. Finally, DDA is not identified, as currently no disease-disease association extraction (DDAE) dataset or tools are available. OBJECTIVE As there are no available DDAE datasets or tools, this study aimed to develop (1) a DDAE dataset and (2) a neural network model for extracting DDA from the literature. METHODS In this study, we formulated DDAE as a supervised machine learning classification problem. To develop the system, we first built a DDAE dataset. We then employed two machine learning models, support vector machine and convolutional neural network, to extract DDA. Furthermore, we evaluated the effect of using the output layer as features of the support vector machine-based model. Finally, we implemented large margin context-aware convolutional neural network architecture to integrate context features and convolutional neural networks through the large margin function. RESULTS Our DDAE dataset consisted of 521 PubMed abstracts. Experiment results showed that the support vector machine-based approach achieved an F1 measure of 80.32%, which is higher than the convolutional neural network-based approach (73.32%). Using the output layer of convolutional neural network as a feature for the support vector machine does not further improve the performance of support vector machine. However, our large margin context-aware-convolutional neural network achieved the highest F1 measure of 84.18% and demonstrated that combining the hinge loss function of support vector machine with a convolutional neural network into a single neural network architecture outperforms other approaches. CONCLUSIONS To facilitate the development of text-mining research for DDAE, we developed the first publicly available DDAE dataset consisting of disease mentions, Medical Subject Heading IDs, and relation annotations. We developed different conventional machine learning models and neural network architectures and evaluated their effects on our DDAE dataset. To further improve DDAE performance, we propose an large margin context-aware-convolutional neural network model for DDAE that outperforms other approaches.
Collapse
Affiliation(s)
- Po-Ting Lai
- Department of Computer Science National Tsing Hua University, Hsinchu, Province of China Taiwan
| | - Wei-Liang Lu
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Ting-Rung Kuo
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Chia-Ru Chung
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Jen-Chieh Han
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Richard Tzong-Han Tsai
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
| | - Jorng-Tzong Horng
- Department of Computer Science & Information Engineering, National Central University, Taoyuan, Province of China Taiwan
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Province of China Taiwan
| |
Collapse
|
12
|
Liu S, Shao Y, Qian L, Zhou G. Hierarchical sequence labeling for extracting BEL statements from biomedical literature. BMC Med Inform Decis Mak 2019; 19:63. [PMID: 30961584 PMCID: PMC6454591 DOI: 10.1186/s12911-019-0758-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
Abstract
Background Extracting relations between bio-entities from biomedical literature is often a challenging task and also an essential step towards biomedical knowledge expansion. The BioCreative community has organized a shared task to evaluate the robustness of the causal relationship extraction algorithms in Biological Expression Language (BEL) from biomedical literature. Method We first map the sentence-level BEL statements in the BC-V training corpus to the corresponding text segments, thus generating hierarchically tagged training instances. A hierarchical sequence labeling model was afterwards induced from these training instances and applied to the test sentences in order to construct the BEL statements. Results The experimental results on extracting BEL statements from BioCreative V Track 4 test corpus show that our method achieves promising performance with an overall F-measure of 31.6%. Furthermore, it has the potential to be enhanced by adopting more advanced machine learning approaches. Conclusion We propose a framework for hierarchical relation extraction using hierarchical sequence labeling on the instance-level training corpus derived from the original sentence-level corpus via word alignment. Its main advantage is that we can make full use of the original training corpus to induce the sequence labelers and then apply them to the test corpus.
Collapse
Affiliation(s)
- Suwen Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Yifan Shao
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, China.
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
13
|
Saqi M, Lysenko A, Guo YK, Tsunoda T, Auffray C. Navigating the disease landscape: knowledge representations for contextualizing molecular signatures. Brief Bioinform 2019; 20:609-623. [PMID: 29684165 PMCID: PMC6556902 DOI: 10.1093/bib/bby025] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2017] [Revised: 02/05/2018] [Indexed: 12/14/2022] Open
Abstract
Large amounts of data emerging from experiments in molecular medicine are leading to the identification of molecular signatures associated with disease subtypes. The contextualization of these patterns is important for obtaining mechanistic insight into the aberrant processes associated with a disease, and this typically involves the integration of multiple heterogeneous types of data. In this review, we discuss knowledge representations that can be useful to explore the biological context of molecular signatures, in particular three main approaches, namely, pathway mapping approaches, molecular network centric approaches and approaches that represent biological statements as knowledge graphs. We discuss the utility of each of these paradigms, illustrate how they can be leveraged with selected practical examples and identify ongoing challenges for this field of research.
Collapse
Affiliation(s)
- Mansoor Saqi
- Mansoor Saqi Data Science Institute, Imperial College London, UK
| | - Artem Lysenko
- Artem Lysenko Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Yi-Ke Guo
- Yi-Ke Guo Data Science Institute, Imperial College London, UK
| | - Tatsuhiko Tsunoda
- Tatsuhiko Tsunoda Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan CREST, JST, Tokyo, Japan Department of Medical Science Mathematics, Medical Research Institute, Tokyo Medical and Dental University, Tokyo, Japan
| | - Charles Auffray
- Charles Auffray European Institute for Systems Biology and Medicine, Lyon, France
| |
Collapse
|
14
|
Liu S, Cheng W, Qian L, Zhou G. Combining relation extraction with function detection for BEL statement extraction. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5277249. [PMID: 30624649 PMCID: PMC6323300 DOI: 10.1093/database/bay133] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Accepted: 11/26/2018] [Indexed: 11/29/2022]
Abstract
The BioCreative-V community proposed a challenging task of automatic extraction of causal relation network in Biological Expression Language (BEL) from the biomedical literature. Previous studies on this task largely used models induced from other related tasks and then transformed intermediate structures to BEL statements, which left the given training corpus unexplored. To make full use of the BEL training corpus, in this work, we propose a deep learning-based approach to extract BEL statements. Specifically, we decompose the problem into two subtasks: entity relation extraction and entity function detection. First, two attention-based bidirectional long short-term memory networks models are used to extract entity relation and entity function, respectively. Then entity relation and their functions are combined into a BEL statement. In order to boost the overall performance, a strategy of threshold filtering is applied to improve the precision of identified entity functions. We evaluate our approach on the BioCreative-V Track 4 corpus with or without gold entities. The experimental results show that our method achieves the state-of-the-art performance with an overall F1-measure of 46.9% in stage 2 and 21.3% in stage 1, respectively.
Collapse
Affiliation(s)
- Suwen Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Wei Cheng
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Longhua Qian
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Guodong Zhou
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
15
|
Abstract
BACKGROUND Systems biology is an important field for understanding whole biological mechanisms composed of interactions between biological components. One approach for understanding complex and diverse mechanisms is to analyze biological pathways. However, because these pathways consist of important interactions and information on these interactions is disseminated in a large number of biomedical reports, text-mining techniques are essential for extracting these relationships automatically. RESULTS In this study, we applied node2vec, an algorithmic framework for feature learning in networks, for relationship extraction. To this end, we extracted genes from paper abstracts using pkde4j, a text-mining tool for detecting entities and relationships. Using the extracted genes, a co-occurrence network was constructed and node2vec was used with the network to generate a latent representation. To demonstrate the efficacy of node2vec in extracting relationships between genes, performance was evaluated for gene-gene interactions involved in a type 2 diabetes pathway. Moreover, we compared the results of node2vec to those of baseline methods such as co-occurrence and DeepWalk. CONCLUSIONS Node2vec outperformed existing methods in detecting relationships in the type 2 diabetes pathway, demonstrating that this method is appropriate for capturing the relatedness between pairs of biological entities involved in biological pathways. The results demonstrated that node2vec is useful for automatic pathway construction.
Collapse
Affiliation(s)
- Munui Kim
- Department of Library and Information Science, Yonsei University, 50, Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea
| | - Seung Han Baek
- Department of Library and Information Science, Yonsei University, 50, Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, 50, Yonsei-ro, Seodaemun-gu, Seoul, Republic of Korea.
| |
Collapse
|
16
|
Hoyt CT, Domingo-Fernández D, Hofmann-Apitius M. BEL Commons: an environment for exploration and analysis of networks encoded in Biological Expression Language. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2018:5255171. [PMID: 30576488 PMCID: PMC6301338 DOI: 10.1093/database/bay126] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 11/05/2018] [Indexed: 12/19/2022]
Abstract
The rapid accumulation of knowledge in the field of systems and networks biology during recent years requires complex, but user-friendly and accessible web applications that allow from visualization to complex algorithmic analysis. While several web applications exist with various focuses on creation, revision, curation, storage, integration, collaboration, exploration, visualization and analysis, many of these services remain disjoint and have yet to be packaged into a cohesive environment. Here, we present BEL Commons: an integrative knowledge discovery environment for networks encoded in the Biological Expression Language (BEL). Users can upload files in BEL to be parsed, validated, compiled and stored with fine granular permissions. After, users can summarize, explore and optionally shared their networks with the scientific community. We have implemented a query builder wizard to help users find the relevant portions of increasingly large and complex networks and a visualization interface that allows them to explore their resulting networks. Finally, we have included a dedicated analytical service for performing data-driven analysis of knowledge networks to support hypothesis generation.
Collapse
Affiliation(s)
- Charles Tapley Hoyt
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin, Germany.,Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Daniel Domingo-Fernández
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin, Germany.,Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing, Sankt Augustin, Germany.,Bonn-Aachen International Center for IT, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany
| |
Collapse
|
17
|
Wang Y, Rastegar-Mojarad M, Komandur-Elayavilli R, Liu H. Leveraging word embeddings and medical entity extraction for biomedical dataset retrieval using unstructured texts. Database (Oxford) 2017; 2017:bax091. [PMID: 31725862 PMCID: PMC7243926 DOI: 10.1093/database/bax091] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2017] [Revised: 10/17/2017] [Accepted: 11/14/2017] [Indexed: 11/16/2022]
Abstract
The recent movement towards open data in the biomedical domain has generated a large number of datasets that are publicly accessible. The Big Data to Knowledge data indexing project, biomedical and healthCAre Data Discovery Index Ecosystem (bioCADDIE), has gathered these datasets in a one-stop portal aiming at facilitating their reuse for accelerating scientific advances. However, as the number of biomedical datasets stored and indexed increases, it becomes more and more challenging to retrieve the relevant datasets according to researchers' queries. In this article, we propose an information retrieval (IR) system to tackle this problem and implement it for the bioCADDIE Dataset Retrieval Challenge. The system leverages the unstructured texts of each dataset including the title and description for the dataset, and utilizes a state-of-the-art IR model, medical named entity extraction techniques, query expansion with deep learning-based word embeddings and a re-ranking strategy to enhance the retrieval performance. In empirical experiments, we compared the proposed system with 11 baseline systems using the bioCADDIE Dataset Retrieval Challenge datasets. The experimental results show that the proposed system outperforms other systems in terms of inference Average Precision and inference normalized Discounted Cumulative Gain, implying that the proposed system is a viable option for biomedical dataset retrieval. Database URL: https://github.com/yanshanwang/biocaddie2016mayodata.
Collapse
Affiliation(s)
- Yanshan Wang
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA
| | | | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55901, USA
| |
Collapse
|
18
|
Rastegar-Mojarad M, Komandur Elayavilli R, Liu H. BELTracker: evidence sentence retrieval for BEL statements. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw079. [PMID: 27173525 PMCID: PMC4865361 DOI: 10.1093/database/baw079] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 04/22/2016] [Indexed: 01/09/2023]
Abstract
Biological expression language (BEL) is one of the main formal representation models of biological networks. The primary source of information for curating biological networks in BEL representation has been literature. It remains a challenge to identify relevant articles and the corresponding evidence statements for curating and validating BEL statements. In this paper, we describe BELTracker, a tool used to retrieve and rank evidence sentences from PubMed abstracts and full-text articles for a given BEL statement (per the 2015 task requirements of BioCreative V BEL Task). The system is comprised of three main components, (i) translation of a given BEL statement to an information retrieval (IR) query, (ii) retrieval of relevant PubMed citations and (iii) finding and ranking the evidence sentences in those citations. BELTracker uses a combination of multiple approaches based on traditional IR, machine learning, and heuristics to accomplish the task. The system identified and ranked at least one fully relevant evidence sentence in the top 10 retrieved sentences for 72 out of 97 BEL statements in the test set. BELTracker achieved a precision of 0.392, 0.532 and 0.615 when evaluated with three criteria, namely full, relaxed and context criteria, respectively, by the task organizers. Our team at Mayo Clinic was the only participant in this task. BELTracker is available as a RESTful API and is available for public use. Database URL:http://www.openbionlp.org:8080/BelTracker/finder/Given_BEL_Statement
Collapse
Affiliation(s)
- Majid Rastegar-Mojarad
- Department of Health Sciences Research, Mayo Clinic, USA University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, USA
| |
Collapse
|