1
|
Cai L, Li J, Lv H, Liu W, Niu H, Wang Z. Integrating domain knowledge for biomedical text analysis into deep learning: A survey. J Biomed Inform 2023; 143:104418. [PMID: 37290540 DOI: 10.1016/j.jbi.2023.104418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 04/24/2023] [Accepted: 05/31/2023] [Indexed: 06/10/2023]
Abstract
The past decade has witnessed an explosion of textual information in the biomedical field. Biomedical texts provide a basis for healthcare delivery, knowledge discovery, and decision-making. Over the same period, deep learning has achieved remarkable performance in biomedical natural language processing, however, its development has been limited by well-annotated datasets and interpretability. To solve this, researchers have considered combining domain knowledge (such as biomedical knowledge graph) with biomedical data, which has become a promising means of introducing more information into biomedical datasets and following evidence-based medicine. This paper comprehensively reviews more than 150 recent literature studies on incorporating domain knowledge into deep learning models to facilitate typical biomedical text analysis tasks, including information extraction, text classification, and text generation. We eventually discuss various challenges and future directions.
Collapse
Affiliation(s)
- Linkun Cai
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Jia Li
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Han Lv
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Wenjuan Liu
- Aerospace Center Hospital, 100049 Beijing, China
| | - Haijun Niu
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Zhenchang Wang
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China; Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China.
| |
Collapse
|
2
|
Badr H, Wanas N, Fayek M. Unsupervised domain adaptation with post-adaptation labeled domain performance preservation. MACHINE LEARNING WITH APPLICATIONS 2022. [DOI: 10.1016/j.mlwa.2022.100439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
|
3
|
Mahbub M, Srinivasan S, Begoli E, Peterson GD. BioADAPT-MRC: adversarial learning-based domain adaptation improves biomedical machine reading comprehension task. Bioinformatics 2022; 38:4369-4379. [PMID: 35876792 PMCID: PMC9477526 DOI: 10.1093/bioinformatics/btac508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2022] [Revised: 05/31/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Biomedical machine reading comprehension (biomedical-MRC) aims to comprehend complex biomedical narratives and assist healthcare professionals in retrieving information from them. The high performance of modern neural network-based MRC systems depends on high-quality, large-scale, human-annotated training datasets. In the biomedical domain, a crucial challenge in creating such datasets is the requirement for domain knowledge, inducing the scarcity of labeled data and the need for transfer learning from the labeled general-purpose (source) domain to the biomedical (target) domain. However, there is a discrepancy in marginal distributions between the general-purpose and biomedical domains due to the variances in topics. Therefore, direct-transferring of learned representations from a model trained on a general-purpose domain to the biomedical domain can hurt the model's performance. RESULTS We present an adversarial learning-based domain adaptation framework for the biomedical machine reading comprehension task (BioADAPT-MRC), a neural network-based method to address the discrepancies in the marginal distributions between the general and biomedical domain datasets. BioADAPT-MRC relaxes the need for generating pseudo labels for training a well-performing biomedical-MRC model. We extensively evaluate the performance of BioADAPT-MRC by comparing it with the best existing methods on three widely used benchmark biomedical-MRC datasets-BioASQ-7b, BioASQ-8b and BioASQ-9b. Our results suggest that without using any synthetic or human-annotated data from the biomedical domain, BioADAPT-MRC can achieve state-of-the-art performance on these datasets. AVAILABILITY AND IMPLEMENTATION BioADAPT-MRC is freely available as an open-source project at https://github.com/mmahbub/BioADAPT-MRC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Sudarshan Srinivasan
- Cyber Resilience and Intelligence Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
| | - Edmon Begoli
- Cyber Resilience and Intelligence Division, Oak Ridge National Laboratory, Oak Ridge, TN 37830, USA
| | - Gregory D Peterson
- Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, TN 37996, USA
| |
Collapse
|
4
|
Liu Y, Li R, Luo J, Zhang Z. Inferring RNA-binding protein target preferences using adversarial domain adaptation. PLoS Comput Biol 2022; 18:e1009863. [PMID: 35202389 PMCID: PMC8870515 DOI: 10.1371/journal.pcbi.1009863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2021] [Accepted: 01/25/2022] [Indexed: 11/18/2022] Open
Abstract
Precise identification of target sites of RNA-binding proteins (RBP) is important to understand their biochemical and cellular functions. A large amount of experimental data is generated by in vivo and in vitro approaches. The binding preferences determined from these platforms share similar patterns but there are discernable differences between these datasets. Computational methods trained on one dataset do not always work well on another dataset. To address this problem which resembles the classic "domain shift" in deep learning, we adopted the adversarial domain adaptation (ADDA) technique and developed a framework (RBP-ADDA) that can extract RBP binding preferences from an integration of in vivo and vitro datasets. Compared with conventional methods, ADDA has the advantage of working with two input datasets, as it trains the initial neural network for each dataset individually, projects the two datasets onto a feature space, and uses an adversarial framework to derive an optimal network that achieves an optimal discriminative predictive power. In the first step, for each RBP, we include only the in vitro data to pre-train a source network and a task predictor. Next, for the same RBP, we initiate the target network by using the source network and use adversarial domain adaptation to update the target network using both in vitro and in vivo data. These two steps help leverage the in vitro data to improve the prediction on in vivo data, which is typically challenging with a lower signal-to-noise ratio. Finally, to further take the advantage of the fused source and target data, we fine-tune the task predictor using both data. We showed that RBP-ADDA achieved better performance in modeling in vivo RBP binding data than other existing methods as judged by Pearson correlations. It also improved predictive performance on in vitro datasets. We further applied augmentation operations on RBPs with less in vivo data to expand the input data and showed that it can improve prediction performances. Lastly, we explored the predictive interpretability of RBP-ADDA, where we quantified the contribution of the input features by Integrated Gradients and identified nucleotide positions that are important for RBP recognition.
Collapse
Affiliation(s)
- Ying Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Ruihui Li
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
5
|
Pourreza Shahri M, Kahanda I. Deep semi-supervised learning ensemble framework for classifying co-mentions of human proteins and phenotypes. BMC Bioinformatics 2021; 22:500. [PMID: 34656098 PMCID: PMC8520253 DOI: 10.1186/s12859-021-04421-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2021] [Accepted: 10/04/2021] [Indexed: 11/13/2022] Open
Abstract
Background Identifying human protein-phenotype relationships has attracted researchers in bioinformatics and biomedical natural language processing due to its importance in uncovering rare and complex diseases. Since experimental validation of protein-phenotype associations is prohibitive, automated tools capable of accurately extracting these associations from the biomedical text are in high demand. However, while the manual annotation of protein-phenotype co-mentions required for training such models is highly resource-consuming, extracting millions of unlabeled co-mentions is straightforward. Results In this study, we propose a novel deep semi-supervised ensemble framework that combines deep neural networks, semi-supervised, and ensemble learning for classifying human protein-phenotype co-mentions with the help of unlabeled data. This framework allows the ability to incorporate an extensive collection of unlabeled sentence-level co-mentions of human proteins and phenotypes with a small labeled dataset to enhance overall performance. We develop PPPredSS, a prototype of our proposed semi-supervised framework that combines sophisticated language models, convolutional networks, and recurrent networks. Our experimental results demonstrate that the proposed approach provides a new state-of-the-art performance in classifying human protein-phenotype co-mentions by outperforming other supervised and semi-supervised counterparts. Furthermore, we highlight the utility of PPPredSS in powering a curation assistant system through case studies involving a group of biologists. Conclusions This article presents a novel approach for human protein-phenotype co-mention classification based on deep, semi-supervised, and ensemble learning. The insights and findings from this work have implications for biomedical researchers, biocurators, and the text mining community working on biomedical relationship extraction.
Collapse
Affiliation(s)
| | - Indika Kahanda
- School of Computing, University of North Florida, Jacksonville, USA.
| |
Collapse
|
6
|
Dai S, Ding Y, Zhang Z, Zuo W, Huang X, Zhu S. GrantExtractor: Accurate Grant Support Information Extraction from Biomedical Fulltext Based on Bi-LSTM-CRF. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:205-215. [PMID: 31494556 DOI: 10.1109/tcbb.2019.2939128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Grant support (GS) in the MEDLINE database refers to funding agencies and contract numbers. It is important for funding organizations to track their funding outcomes from the GS information. As such, how to accurately and automatically extract funding information from biomedical literature is challenging. In this paper, we present a pipeline system called GrantExtractor that is able to accurately extract GS information from fulltext biomedical literature. GrantExtractor effectively integrates several advanced machine learning techniques. In particular, we use a sentence classifier to identify funding sentences from articles first. A bi-directional LSTM and the CRF layer (BiLSTM-CRF), and pattern matching are then used to extract entities of grant numbers and agencies from these identified funding sentences. After removing noisy numbers by a multi-class model, we finally match each grant number with its corresponding agency. Experimental results on benchmark datasets have demonstrated that GrantExtractor clearly outperforms all baseline methods. It is further evident that GrantExtractor won the first place in Task 5C of 2017 BioASQ challenge, with achieving the Micro-recall of 0.9526 for 22,610 articles. Moreover, GrantExtractor has achieved the Micro F-measure score as high as 0.90 in extracting grant pairs.
Collapse
|
7
|
Wang Z, Yan B, Wu C, Wu B, Wang X, Zheng K. Graph Adaptation Network with Domain-Specific Word Alignment for Cross-Domain Relation Extraction. SENSORS 2020; 20:s20247180. [PMID: 33333844 PMCID: PMC7765263 DOI: 10.3390/s20247180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Revised: 12/05/2020] [Accepted: 12/08/2020] [Indexed: 11/29/2022]
Abstract
Cross-domain relation extraction has become an essential approach when target domain lacking labeled data. Most existing works adapted relation extraction models from the source domain to target domain through aligning sequential features, but failed to transfer non-local and non-sequential features such as word co-occurrence which are also critical for cross-domain relation extraction. To address this issue, in this paper, we propose a novel tripartite graph architecture to adapt non-local features when there is no labeled data in the target domain. The graph uses domain words as nodes to model the co-occurrence relation between domain-specific words and domain-independent words. Through graph convolutions on the tripartite graph, the information of domain-specific words is propagated so that the word representation can be fine-tuned to align domain-specific features. In addition, unlike the traditional graph structure, the weights of edges innovatively combine fixed weight and dynamic weight, to capture the global non-local features and avoid introducing noise to word representation. Experiments on three domains of ACE2005 datasets show that our method outperforms the state-of-the-art models by a big margin.
Collapse
Affiliation(s)
- Zhe Wang
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| | - Bo Yan
- School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
- Correspondence: ; Tel.: +86-176-1124-2518
| | - Chunhua Wu
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| | - Bin Wu
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| | - Xiujuan Wang
- School of Computer Science, Beijing University of Technology, Beijing 100124, China;
| | - Kangfeng Zheng
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| |
Collapse
|
8
|
Noh J, Kavuluru R. Literature Retrieval for Precision Medicine with Neural Matching and Faceted Summarization. PROCEEDINGS OF THE CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING. CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING 2020; 2020:3389-3399. [PMID: 34541588 PMCID: PMC8444997 DOI: 10.18653/v1/2020.findings-emnlp.304] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/26/2022]
Abstract
Information retrieval (IR) for precision medicine (PM) often involves looking for multiple pieces of evidence that characterize a patient case. This typically includes at least the name of a condition and a genetic variation that applies to the patient. Other factors such as demographic attributes, comorbidities, and social determinants may also be pertinent. As such, the retrieval problem is often formulated as ad hoc search but with multiple facets (e.g., disease, mutation) that may need to be incorporated. In this paper, we present a document reranking approach that combines neural query-document matching and text summarization toward such retrieval scenarios. Our architecture builds on the basic BERT model with three specific components for reranking: (a). document-query matching (b). keyword extraction and (c). facet-conditioned abstractive summarization. The outcomes of (b) and (c) are used to essentially transform a candidate document into a concise summary that can be compared with the query at hand to compute a relevance score. Component (a) directly generates a matching score of a candidate document for a query. The full architecture benefits from the complementary potential of document-query matching and the novel document transformation approach based on summarization along PM facets. Evaluations using NIST's TREC-PM track datasets (2017-2019) show that our model achieves state-of-the-art performance. To foster reproducibility, our code is made available here: https://github.com/bionlproc/text-summ-for-doc-retrieval.
Collapse
Affiliation(s)
- Jiho Noh
- Department of Computer Science, University of Kentucky, Kentucky, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, University of Kentucky, Kentucky, USA
| |
Collapse
|
9
|
Exploiting adversarial transfer learning for adverse drug reaction detection from texts. J Biomed Inform 2020; 106:103431. [DOI: 10.1016/j.jbi.2020.103431] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2019] [Revised: 04/13/2020] [Accepted: 04/20/2020] [Indexed: 11/18/2022]
|
10
|
Kilicoglu H, Rosemblat G, Fiszman M, Shin D. Broad-coverage biomedical relation extraction with SemRep. BMC Bioinformatics 2020; 21:188. [PMID: 32410573 PMCID: PMC7222583 DOI: 10.1186/s12859-020-3517-7] [Citation(s) in RCA: 38] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Accepted: 04/29/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In the era of information overload, natural language processing (NLP) techniques are increasingly needed to support advanced biomedical information management and discovery applications. In this paper, we present an in-depth description of SemRep, an NLP system that extracts semantic relations from PubMed abstracts using linguistic principles and UMLS domain knowledge. We also evaluate SemRep on two datasets. In one evaluation, we use a manually annotated test collection and perform a comprehensive error analysis. In another evaluation, we assess SemRep's performance on the CDR dataset, a standard benchmark corpus annotated with causal chemical-disease relationships. RESULTS A strict evaluation of SemRep on our manually annotated dataset yields 0.55 precision, 0.34 recall, and 0.42 F 1 score. A relaxed evaluation, which more accurately characterizes SemRep performance, yields 0.69 precision, 0.42 recall, and 0.52 F 1 score. An error analysis reveals named entity recognition/normalization as the largest source of errors (26.9%), followed by argument identification (14%) and trigger detection errors (12.5%). The evaluation on the CDR corpus yields 0.90 precision, 0.24 recall, and 0.38 F 1 score. The recall and the F 1 score increase to 0.35 and 0.50, respectively, when the evaluation on this corpus is limited to sentence-bound relationships, which represents a fairer evaluation, as SemRep operates at the sentence level. CONCLUSIONS SemRep is a broad-coverage, interpretable, strong baseline system for extracting semantic relations from biomedical text. It also underpins SemMedDB, a literature-scale knowledge graph based on semantic relations. Through SemMedDB, SemRep has had significant impact in the scientific community, supporting a variety of clinical and translational applications, including clinical decision making, medical diagnosis, drug repurposing, literature-based discovery and hypothesis generation, and contributing to improved health outcomes. In ongoing development, we are redesigning SemRep to increase its modularity and flexibility, and addressing weaknesses identified in the error analysis.
Collapse
Affiliation(s)
- Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
- University of Illinois at Urbana-Champaign, School of Information Sciences, 501 E Daniel Street, Champaign, 61820 IL USA
| | - Graciela Rosemblat
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| | | | - Dongwook Shin
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, 8600 Rockville Pike, Bethesda, 20894 MD USA
| |
Collapse
|
11
|
Xie L, He S, Zhang Z, Lin K, Bo X, Yang S, Feng B, Wan K, Yang K, Yang J, Ding Y. Domain-adversarial multi-task framework for novel therapeutic property prediction of compounds. Bioinformatics 2020; 36:2848-2855. [PMID: 31999334 DOI: 10.1093/bioinformatics/btaa063] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2019] [Revised: 12/23/2019] [Accepted: 01/23/2020] [Indexed: 12/11/2022] Open
Abstract
MOTIVATION With the rapid development of high-throughput technologies, parallel acquisition of large-scale drug-informatics data provides significant opportunities to improve pharmaceutical research and development. One important application is the purpose prediction of small-molecule compounds with the objective of specifying the therapeutic properties of extensive purpose-unknown compounds and repurposing the novel therapeutic properties of FDA-approved drugs. Such a problem is extremely challenging because compound attributes include heterogeneous data with various feature patterns, such as drug fingerprints, drug physicochemical properties and drug perturbation gene expressions. Moreover, there is a complex non-linear dependency among heterogeneous data. In this study, we propose a novel domain-adversarial multi-task framework for integrating shared knowledge from multiple domains. The framework first uses an adversarial strategy to learn target representations and then models non-linear dependency among several domains. RESULTS Experiments on two real-world datasets illustrate that our approach achieves an obvious improvement over competitive baselines. The novel therapeutic properties of purpose-unknown compounds that we predicted have been widely reported or brought to clinics. Furthermore, our framework can integrate various attributes beyond the three domains examined herein and can be applied in industry for screening significant numbers of small-molecule drug candidates. AVAILABILITY AND IMPLEMENTATION The source code and datasets are available at https://github.com/JohnnyY8/DAMT-Model. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lingwei Xie
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Song He
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Zhongnan Zhang
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Kunhui Lin
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Xiaochen Bo
- Department of Biotechnology, Beijing Institute of Radiation Medicine, Beijing 100850, China
| | - Shu Yang
- Department of Computer Science, UC Santa Barbara, Santa Barbara, CA 93106, USA
| | - Boyuan Feng
- Department of Computer Science, UC Santa Barbara, Santa Barbara, CA 93106, USA
| | - Kun Wan
- Department of Computer Science, UC Santa Barbara, Santa Barbara, CA 93106, USA
| | - Kang Yang
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Jie Yang
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Yufei Ding
- Department of Computer Science, UC Santa Barbara, Santa Barbara, CA 93106, USA
| |
Collapse
|
12
|
Zhang Y, Lin H, Yang Z, Wang J, Sun Y. Chemical-protein interaction extraction via contextualized word representations and multihead attention. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2020; 2019:5498050. [PMID: 31125403 PMCID: PMC6534182 DOI: 10.1093/database/baz054] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/12/2018] [Revised: 03/16/2019] [Accepted: 04/02/2019] [Indexed: 12/17/2022]
Abstract
A rich source of chemical–protein interactions (CPIs) is locked in the exponentially growing biomedical literature. Automatic extraction of CPIs is a crucial task in biomedical natural language processing (NLP), which has great benefits for pharmacological and clinical research. Deep context representation and multihead attention are recent developments in deep learning and have shown their potential in some NLP tasks. Unlike traditional word embedding, deep context representation has the ability to generate comprehensive sentence representation based on the sentence context. The multihead attention mechanism can effectively learn the important features from different heads and emphasize the relatively important features. Integrating deep context representation and multihead attention with a neural network-based model may improve CPI extraction. We present a deep neural model for CPI extraction based on deep context representation and multihead attention. Our model mainly consists of the following three parts: a deep context representation layer, a bidirectional long short-term memory networks (Bi-LSTMs) layer and a multihead attention layer. The deep context representation is employed to provide more comprehensive feature input for Bi-LSTMs. The multihead attention can effectively emphasize the important part of the Bi-LSTMs output. We evaluated our method on the public ChemProt corpus. These experimental results show that both deep context representation and multihead attention are helpful in CPI extraction. Our method can compete with other state-of-the-art methods on ChemProt corpus.
Collapse
Affiliation(s)
- Yijia Zhang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Hongfei Lin
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jian Wang
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Yuanyuan Sun
- College of Computer Science and Technology, Dalian University of Technology, Dalian, China
| |
Collapse
|
13
|
Junge A, Jensen LJ. CoCoScore: context-aware co-occurrence scoring for text mining applications using distant supervision. Bioinformatics 2020; 36:264-271. [PMID: 31199464 PMCID: PMC6956794 DOI: 10.1093/bioinformatics/btz490] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2018] [Revised: 05/30/2019] [Accepted: 06/10/2019] [Indexed: 12/18/2022] Open
Abstract
MOTIVATION Information extraction by mining the scientific literature is key to uncovering relations between biomedical entities. Most existing approaches based on natural language processing extract relations from single sentence-level co-mentions, ignoring co-occurrence statistics over the whole corpus. Existing approaches counting entity co-occurrences ignore the textual context of each co-occurrence. RESULTS We propose a novel corpus-wide co-occurrence scoring approach to relation extraction that takes the textual context of each co-mention into account. Our method, called CoCoScore, scores the certainty of stating an association for each sentence that co-mentions two entities. CoCoScore is trained using distant supervision based on a gold-standard set of associations between entities of interest. Instead of requiring a manually annotated training corpus, co-mentions are labeled as positives/negatives according to their presence/absence in the gold standard. We show that CoCoScore outperforms previous approaches in identifying human disease-gene and tissue-gene associations as well as in identifying physical and functional protein-protein associations in different species. CoCoScore is a versatile text mining tool to uncover pairwise associations via co-occurrence mining, within and beyond biomedical applications. AVAILABILITY AND IMPLEMENTATION CoCoScore is available at: https://github.com/JungeAlexander/cocoscore. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Alexander Junge
- Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen N 2200, Denmark
| | - Lars Juhl Jensen
- Disease Systems Biology Program, Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen N 2200, Denmark
| |
Collapse
|
14
|
Neural network-based approaches for biomedical relation classification: A review. J Biomed Inform 2019; 99:103294. [DOI: 10.1016/j.jbi.2019.103294] [Citation(s) in RCA: 34] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 06/02/2019] [Accepted: 09/21/2019] [Indexed: 12/14/2022]
|
15
|
Rios A, Durbin EB, Hands I, Arnold SM, Shah D, Schwartz SM, Goulart BHL, Kavuluru R. Cross-registry neural domain adaptation to extract mutational test results from pathology reports. J Biomed Inform 2019; 97:103267. [PMID: 31401235 PMCID: PMC6736690 DOI: 10.1016/j.jbi.2019.103267] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2019] [Revised: 07/30/2019] [Accepted: 08/05/2019] [Indexed: 10/26/2022]
Abstract
OBJECTIVE We study the performance of machine learning (ML) methods, including neural networks (NNs), to extract mutational test results from pathology reports collected by cancer registries. Given the lack of hand-labeled datasets for mutational test result extraction, we focus on the particular use-case of extracting Epidermal Growth Factor Receptor mutation results in non-small cell lung cancers. We explore the generalization of NNs across different registries where our goals are twofold: (1) to assess how well models trained on a registry's data port to test data from a different registry and (2) to assess whether and to what extent such models can be improved using state-of-the-art neural domain adaptation techniques under different assumptions about what is available (labeled vs unlabeled data) at the target registry site. MATERIALS AND METHODS We collected data from two registries: the Kentucky Cancer Registry (KCR) and the Fred Hutchinson Cancer Research Center (FH) Cancer Surveillance System. We combine NNs with adversarial domain adaptation to improve cross-registry performance. We compare to other classifiers in the standard supervised classification, unsupervised domain adaptation, and supervised domain adaptation scenarios. RESULTS The performance of ML methods varied between registries. To extract positive results, the basic convolutional neural network (CNN) had an F1 of 71.5% on the KCR dataset and 95.7% on the FH dataset. For the KCR dataset, the CNN F1 results were low when trained on FH data (Positive F1: 23%). Using our proposed adversarial CNN, without any labeled data, we match the F1 of the models trained directly on each target registry's data. The adversarial CNN F1 improved when trained on FH and applied to KCR dataset (Positive F1: 70.8%). We found similar performance improvements when we trained on KCR and tested on FH reports (Positive F1: 45% to 96%). CONCLUSION Adversarial domain adaptation improves the performance of NNs applied to pathology reports. In the unsupervised domain adaptation setting, we match the performance of models that are trained directly on target registry's data by using source registry's labeled data and unlabeled examples from the target registry.
Collapse
Affiliation(s)
- Anthony Rios
- Department of Information Systems and Cyber Security, University of Texas at San Antonio, USA
| | - Eric B Durbin
- Division of Biomedical Informatics, Dept. of Internal Medicine, University of Kentucky, USA; Kentucky Cancer Registry, Lexington, KY, USA
| | - Isaac Hands
- Kentucky Cancer Registry, Lexington, KY, USA
| | - Susanne M Arnold
- Markey Cancer Center, University of Kentucky, Lexington, KY, USA
| | - Darshil Shah
- Ironwood Cancer and Research Centers, Avondale, AZ, USA
| | | | | | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Dept. of Internal Medicine, University of Kentucky, USA; Computer Science Department, University of Kentucky, USA.
| |
Collapse
|
16
|
Zhang Y, Lu Z. Exploring semi-supervised variational autoencoders for biomedical relation extraction. Methods 2019; 166:112-119. [PMID: 30822516 PMCID: PMC6708455 DOI: 10.1016/j.ymeth.2019.02.021] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 01/28/2019] [Accepted: 02/25/2019] [Indexed: 10/27/2022] Open
Abstract
The biomedical literature provides a rich source of knowledge such as protein-protein interactions (PPIs), drug-drug interactions (DDIs) and chemical-protein interactions (CPIs). Biomedical relation extraction aims to automatically extract biomedical relations from biomedical text for various biomedical research. State-of-the-art methods for biomedical relation extraction are primarily based on supervised machine learning and therefore depend on (sufficient) labeled data. However, creating large sets of training data is prohibitively expensive and labor-intensive, especially so in biomedicine as domain knowledge is required. In contrast, there is a large amount of unlabeled biomedical text available in PubMed. Hence, computational methods capable of employing unlabeled data to reduce the burden of manual annotation are of particular interest in biomedical relation extraction. We present a novel semi-supervised approach based on variational autoencoder (VAE) for biomedical relation extraction. Our model consists of the following three parts, a classifier, an encoder and a decoder. The classifier is implemented using multi-layer convolutional neural networks (CNNs), and the encoder and decoder are implemented using both bidirectional long short-term memory networks (Bi-LSTMs) and CNNs, respectively. The semi-supervised mechanism allows our model to learn features from both the labeled and unlabeled data. We evaluate our method on multiple public PPI, DDI and CPI corpora. Experimental results show that our method effectively exploits the unlabeled data to improve the performance and reduce the dependence on labeled data. To our best knowledge, this is the first semi-supervised VAE-based method for (biomedical) relation extraction. Our results suggest that exploiting such unlabeled data can be greatly beneficial to improved performance in various biomedical relation extraction, especially when only limited labeled data (e.g. 2000 samples or less) is available in such tasks.
Collapse
Affiliation(s)
- Yijia Zhang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA; School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA.
| |
Collapse
|
17
|
Li F, Yu H. An investigation of single-domain and multidomain medication and adverse drug event relation extraction from electronic health record notes using advanced deep learning models. J Am Med Inform Assoc 2019; 26:646-654. [PMID: 30938761 PMCID: PMC6562161 DOI: 10.1093/jamia/ocz018] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2018] [Revised: 01/15/2019] [Accepted: 01/31/2019] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE We aim to evaluate the effectiveness of advanced deep learning models (eg, capsule network [CapNet], adversarial training [ADV]) for single-domain and multidomain relation extraction from electronic health record (EHR) notes. MATERIALS AND METHODS We built multiple deep learning models with increased complexity, namely a multilayer perceptron (MLP) model and a CapNet model for single-domain relation extraction and fully shared (FS), shared-private (SP), and adversarial training (ADV) modes for multidomain relation extraction. Our models were evaluated in 2 ways: first, we compared our models using our expert-annotated cancer (the MADE1.0 corpus) and cardio corpora; second, we compared our models with the systems in the MADE1.0 and i2b2 challenges. RESULTS Multidomain models outperform single-domain models by 0.7%-1.4% in F1 (t test P < .05), but the results of FS, SP, and ADV modes are mixed. Our results show that the MLP model generally outperforms the CapNet model by 0.1%-1.0% in F1. In the comparisons with other systems, the CapNet model achieves the state-of-the-art result (87.2% in F1) in the cancer corpus and the MLP model generally outperforms MedEx in the cancer, cardiovascular diseases, and i2b2 corpora. CONCLUSIONS Our MLP or CapNet model generally outperforms other state-of-the-art systems in medication and adverse drug event relation extraction. Multidomain models perform better than single-domain models. However, neither the SP nor the ADV mode can always outperform the FS mode significantly. Moreover, the CapNet model is not superior to the MLP model for our corpora.
Collapse
Affiliation(s)
- Fei Li
- Department of Computer Science, University of Massachusetts Lowell, Lowell, Massachusetts, USA
- Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, Massachusetts, USA
| | - Hong Yu
- Department of Computer Science, University of Massachusetts Lowell, Lowell, Massachusetts, USA
- Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, Massachusetts, USA
| |
Collapse
|
18
|
Zhou H, Liu Z, Ning S, Lang C, Lin Y, Du L. Knowledge-aware attention network for protein-protein interaction extraction. J Biomed Inform 2019; 96:103234. [PMID: 31202937 DOI: 10.1016/j.jbi.2019.103234] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/04/2018] [Revised: 06/06/2019] [Accepted: 06/13/2019] [Indexed: 11/19/2022]
Abstract
Protein-protein interaction (PPI) extraction from published scientific literature provides additional support for precision medicine efforts. However, many of the current PPI extraction methods need extensive feature engineering and cannot make full use of the prior knowledge in knowledge bases (KBs). KBs contain huge amounts of structured information about entities and relationships, therefore play a pivotal role in PPI extraction. This paper proposes a knowledge-aware attention network (KAN) to fuse prior knowledge about protein-protein pairs and context information for PPI extraction. The proposed model first adopts a diagonal-disabled multi-head attention mechanism to encode context sequence along with knowledge representations learned from KBs. Then a novel multi-dimensional attention mechanism is used to select the features that can best describe the encoded context. Experiment results on the BioCreative VI PPI dataset show that the proposed approach could acquire knowledge-aware dependencies between different words in a sequence and lead to a new state-of-the-art performance.
Collapse
Affiliation(s)
- Huiwei Zhou
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Zhuang Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Shixian Ning
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Chengkun Lang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Yingyu Lin
- School of Foreign Languages, Dalian University of Technology, Dalian 116024, Liaoning, China.
| | - Lei Du
- School of Mathematical Sciences, Dalian University of Technology, Dalian 116024, Liaoning, China.
| |
Collapse
|