1
|
Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024; 25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open
Abstract
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| | - Jen-Chieh Han
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Pei-Yen Lin
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Center for Geographic Information Science, Research Center for Humanities and Social Sciences, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
2
|
Li Z, Wei Q, Huang LC, Li J, Hu Y, Chuang YS, He J, Das A, Keloth VK, Yang Y, Diala CS, Roberts KE, Tao C, Jiang X, Zheng WJ, Xu H. Ensemble pretrained language models to extract biomedical knowledge from literature. J Am Med Inform Assoc 2024:ocae061. [PMID: 38520725 DOI: 10.1093/jamia/ocae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 02/14/2024] [Accepted: 03/12/2024] [Indexed: 03/25/2024] Open
Abstract
OBJECTIVES The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking. MATERIALS AND METHODS For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE). RESULTS Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models. DISCUSSION AND CONCLUSION Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.
Collapse
Affiliation(s)
- Zhao Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Qiang Wei
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Liang-Chin Huang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Jianfu Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Yan Hu
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Yao-Shun Chuang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Jianping He
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Avisha Das
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Vipina Kuttichi Keloth
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| | - Yuntao Yang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Chiamaka S Diala
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Kirk E Roberts
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Cui Tao
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Xiaoqian Jiang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - W Jim Zheng
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| |
Collapse
|
3
|
Modersohn L, Hahn U. Influence of Context in Transformer-Based Medication Relation Extraction. Stud Health Technol Inform 2024; 310:669-673. [PMID: 38269893 DOI: 10.3233/shti231049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
The extraction of medication information from unstructured clinical documents has been a major application of clinical NLP in the past decade as evidenced by the conduct of two shared tasks under the I2B2 and N2C2 umbrella. We here propose a new methodological approach which has already shown a tremendous potential for increasing system performance for general NLP tasks, but has so far not been applied to medication extraction from EHR data, namely deep learning based on transformer models. We ran experiments on established clinical data sets for English (exploiting I2B2 and N2C2 corpora) and German (based on the 3000PA corpus, a German reference data set). Our results reveal that transformer models are on a par with current state-of-the-art results for English, but yield new ones for German data. We further address the influence of context on the overall performance of transformer-based medication relation extraction.
Collapse
Affiliation(s)
- Luise Modersohn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
- AIIM, TU München, München, Germany
| | - Udo Hahn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
| |
Collapse
|
4
|
Sugimoto K, Wada S, Konishi S, Okada K, Manabe S, Matsumura Y, Takeda T. Extracting Clinical Information From Japanese Radiology Reports Using a 2-Stage Deep Learning Approach: Algorithm Development and Validation. JMIR Med Inform 2023; 11:e49041. [PMID: 37991979 PMCID: PMC10686535 DOI: 10.2196/49041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/25/2023] [Accepted: 10/03/2023] [Indexed: 11/24/2023] Open
Abstract
Background Radiology reports are usually written in a free-text format, which makes it challenging to reuse the reports. Objective For secondary use, we developed a 2-stage deep learning system for extracting clinical information and converting it into a structured format. Methods Our system mainly consists of 2 deep learning modules: entity extraction and relation extraction. For each module, state-of-the-art deep learning models were applied. We trained and evaluated the models using 1040 in-house Japanese computed tomography (CT) reports annotated by medical experts. We also evaluated the performance of the entire pipeline of our system. In addition, the ratio of annotated entities in the reports was measured to validate the coverage of the clinical information with our information model. Results The microaveraged F1-scores of our best-performing model for entity extraction and relation extraction were 96.1% and 97.4%, respectively. The microaveraged F1-score of the 2-stage system, which is a measure of the performance of the entire pipeline of our system, was 91.9%. Our system showed encouraging results for the conversion of free-text radiology reports into a structured format. The coverage of clinical information in the reports was 96.2% (6595/6853). Conclusions Our 2-stage deep system can extract clinical information from chest and abdomen CT reports accurately and comprehensively.
Collapse
Affiliation(s)
- Kento Sugimoto
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Shoya Wada
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
- Department of Transformative System for Medical Information, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Shozo Konishi
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Katsuki Okada
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Shirou Manabe
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
- Department of Transformative System for Medical Information, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Yasushi Matsumura
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
- National Hospital Organization Osaka National Hospital, Osaka, Japan
| | - Toshihiro Takeda
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| |
Collapse
|
5
|
Fang Z, Cong Y, Chai Y, Gao C, Chen X, Qiu J. Conserving Semantic Unit Information and Simplifying Syntactic Constituents to Improve Implicit Discourse Relation Recognition. Entropy (Basel) 2023; 25:1294. [PMID: 37761593 PMCID: PMC10529030 DOI: 10.3390/e25091294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 08/22/2023] [Accepted: 08/28/2023] [Indexed: 09/29/2023]
Abstract
Implicit discourse relation recognition (IDRR) has long been considered a challenging problem in shallow discourse parsing. The absence of connectives makes such relations implicit and requires much more effort to understand the semantics of the text. Thus, it is important to preserve the semantic completeness before any attempt to predict the discourse relation. However, word level embedding, widely used in existing works, may lead to a loss of semantics by splitting some phrases that should be treated as complete semantic units. In this article, we proposed three methods to segment a sentence into complete semantic units: a corpus-based method to serve as the baseline, a constituent parsing tree-based method, and a dependency parsing tree-based method to provide a more flexible and automatic way to divide the sentence. The segmented sentence will then be embedded at the level of semantic units so the embeddings could be fed into the IDRR networks and play the same role as word embeddings. We implemented our methods into one of the recent IDRR models to compare the performance with the original version using word level embeddings. Results show that proper embedding level better conserves the semantic information in the sentence and helps to enhance the performance of IDRR models.
Collapse
Affiliation(s)
| | | | | | | | | | - Jing Qiu
- The Cyberspace Institute of Advanced Technology, Guangzhou University, Guangzhou 510006, China; (Z.F.); (Y.C.); (Y.C.); (C.G.); (X.C.)
| |
Collapse
|
6
|
Peng C, Yang X, Yu Z, Bian J, Hogan WR, Wu Y. Clinical concept and relation extraction using prompt-based machine reading comprehension. J Am Med Inform Assoc 2023; 30:1486-1493. [PMID: 37316988 PMCID: PMC10436141 DOI: 10.1093/jamia/ocad107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 05/08/2023] [Accepted: 06/05/2023] [Indexed: 06/16/2023] Open
Abstract
OBJECTIVE To develop a natural language processing system that solves both clinical concept extraction and relation extraction in a unified prompt-based machine reading comprehension (MRC) architecture with good generalizability for cross-institution applications. METHODS We formulate both clinical concept extraction and relation extraction using a unified prompt-based MRC architecture and explore state-of-the-art transformer models. We compare our MRC models with existing deep learning models for concept extraction and end-to-end relation extraction using 2 benchmark datasets developed by the 2018 National NLP Clinical Challenges (n2c2) challenge (medications and adverse drug events) and the 2022 n2c2 challenge (relations of social determinants of health [SDoH]). We also evaluate the transfer learning ability of the proposed MRC models in a cross-institution setting. We perform error analyses and examine how different prompting strategies affect the performance of MRC models. RESULTS AND CONCLUSION The proposed MRC models achieve state-of-the-art performance for clinical concept and relation extraction on the 2 benchmark datasets, outperforming previous non-MRC transformer models. GatorTron-MRC achieves the best strict and lenient F1-scores for concept extraction, outperforming previous deep learning models on the 2 datasets by 1%-3% and 0.7%-1.3%, respectively. For end-to-end relation extraction, GatorTron-MRC and BERT-MIMIC-MRC achieve the best F1-scores, outperforming previous deep learning models by 0.9%-2.4% and 10%-11%, respectively. For cross-institution evaluation, GatorTron-MRC outperforms traditional GatorTron by 6.4% and 16% for the 2 datasets, respectively. The proposed method is better at handling nested/overlapped concepts, extracting relations, and has good portability for cross-institute applications. Our clinical MRC package is publicly available at https://github.com/uf-hobi-informatics-lab/ClinicalTransformerMRC.
Collapse
Affiliation(s)
- Cheng Peng
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Xi Yang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, Florida, USA
| | - Zehao Yu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, Florida, USA
| | - William R Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, Florida, USA
| |
Collapse
|
7
|
Wei S, Liang Y, Li X, Weng X, Fu J, Han X. Chinese Few-Shot Named Entity Recognition and Knowledge Graph Construction in Managed Pressure Drilling Domain. Entropy (Basel) 2023; 25:1097. [PMID: 37510044 PMCID: PMC10378751 DOI: 10.3390/e25071097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/14/2023] [Accepted: 07/20/2023] [Indexed: 07/30/2023]
Abstract
Managed pressure drilling (MPD) is the most effective means to ensure drilling safety, and MPD is able to avoid further deterioration of complex working conditions through precise control of the wellhead back pressure. The key to the success of MPD is the well control strategy, which currently relies heavily on manual experience, hindering the automation and intelligence process of well control. In response to this issue, an MPD knowledge graph is constructed in this paper that extracts knowledge from published papers and drilling reports to guide well control. In order to improve the performance of entity extraction in the knowledge graph, a few-shot Chinese entity recognition model CEntLM-KL is extended from the EntLM model, in which the KL entropy is built to improve the accuracy of entity recognition. Through experiments on benchmark datasets, it has been shown that the proposed model has a significant improvement compared to the state-of-the-art methods. On the few-shot drilling datasets, the F-1 score of entity recognition reaches 33%. Finally, the knowledge graph is stored in Neo4J and applied for knowledge inference.
Collapse
Affiliation(s)
- Siqing Wei
- Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yanchun Liang
- Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Changchun 130012, China
- Zhuhai Laboratory of Key Laboratory for Symbol Computation and Knowledge Engineering of Ministry of Education, Zhuhai College of Science and Technology, Zhuhai 519041, China
| | - Xiaoran Li
- Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Xiaohui Weng
- School of Mechanical and Aerospace Engineering, Jilin University, Changchun 130012, China
| | - Jiasheng Fu
- CNPC Engineering Technology R&D Company Limited, National Engineering Research Center of Oil & Gas Drilling and Completion Technology, Beijing 102206, China
| | - Xiaosong Han
- Key Laboratory for Symbol Computation and Knowledge Engineering of National Education Ministry, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| |
Collapse
|
8
|
Zitu MM, Zhang S, Owen DH, Chiang C, Li L. Generalizability of machine learning methods in detecting adverse drug events from clinical narratives in electronic medical records. Front Pharmacol 2023; 14:1218679. [PMID: 37502211 PMCID: PMC10368879 DOI: 10.3389/fphar.2023.1218679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2023] [Accepted: 06/26/2023] [Indexed: 07/29/2023] Open
Abstract
We assessed the generalizability of machine learning methods using natural language processing (NLP) techniques to detect adverse drug events (ADEs) from clinical narratives in electronic medical records (EMRs). We constructed a new corpus correlating drugs with adverse drug events using 1,394 clinical notes of 47 randomly selected patients who received immune checkpoint inhibitors (ICIs) from 2011 to 2018 at The Ohio State University James Cancer Hospital, annotating 189 drug-ADE relations in single sentences within the medical records. We also used data from Harvard's publicly available 2018 National Clinical Challenge (n2c2), which includes 505 discharge summaries with annotations of 1,355 single-sentence drug-ADE relations. We applied classical machine learning (support vector machine (SVM)), deep learning (convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM)), and state-of-the-art transformer-based (bidirectional encoder representations from transformers (BERT) and ClinicalBERT) methods trained and tested in the two different corpora and compared performance among them to detect drug-ADE relationships. ClinicalBERT detected drug-ADE relationships better than the other methods when trained using our dataset and tested in n2c2 (ClinicalBERT F-score, 0.78; other methods, F-scores, 0.61-0.73) and when trained using the n2c2 dataset and tested in ours (ClinicalBERT F-score, 0.74; other methods, F-scores, 0.55-0.72). Comparison among several machine learning methods demonstrated the superior performance and, therefore, the greatest generalizability of findings of ClinicalBERT for the detection of drug-ADE relations from clinical narratives in electronic medical records.
Collapse
Affiliation(s)
- Md Muntasir Zitu
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Shijun Zhang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Dwight H. Owen
- Department of Internal Medicine, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Chienwei Chiang
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| | - Lang Li
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
9
|
Datta S, Roberts K. Weakly supervised spatial relation extraction from radiology reports. JAMIA Open 2023; 6:ooad027. [PMID: 37096148 PMCID: PMC10122604 DOI: 10.1093/jamiaopen/ooad027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 03/16/2023] [Accepted: 04/04/2023] [Indexed: 04/26/2023] Open
Abstract
Objective Weak supervision holds significant promise to improve clinical natural language processing by leveraging domain resources and expertise instead of large manually annotated datasets alone. Here, our objective is to evaluate a weak supervision approach to extract spatial information from radiology reports. Materials and Methods Our weak supervision approach is based on data programming that uses rules (or labeling functions) relying on domain-specific dictionaries and radiology language characteristics to generate weak labels. The labels correspond to different spatial relations that are critical to understanding radiology reports. These weak labels are then used to fine-tune a pretrained Bidirectional Encoder Representations from Transformers (BERT) model. Results Our weakly supervised BERT model provided satisfactory results in extracting spatial relations without manual annotations for training (spatial trigger F1: 72.89, relation F1: 52.47). When this model is further fine-tuned on manual annotations (relation F1: 68.76), performance surpasses the fully supervised state-of-the-art. Discussion To our knowledge, this is the first work to automatically create detailed weak labels corresponding to radiological information of clinical significance. Our data programming approach is (1) adaptable as the labeling functions can be updated with relatively little manual effort to incorporate more variations in radiology language reporting formats and (2) generalizable as these functions can be applied across multiple radiology subdomains in most cases. Conclusions We demonstrate a weakly supervision model performs sufficiently well in identifying a variety of relations from radiology text without manual annotations, while exceeding state-of-the-art results when annotated data are available.
Collapse
Affiliation(s)
- Surabhi Datta
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Kirk Roberts
- Corresponding Author: Kirk Roberts, PhD, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, 7000 Fannin St, Suite 600, Houston, TX 77030, USA;
| |
Collapse
|
10
|
Molina M, Jiménez C, Montenegro C. Improving Drug-Drug Interaction Extraction with Gaussian Noise. Pharmaceutics 2023; 15:1823. [PMID: 37514010 PMCID: PMC10385013 DOI: 10.3390/pharmaceutics15071823] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 04/28/2023] [Accepted: 06/12/2023] [Indexed: 07/30/2023] Open
Abstract
Drug-Drug Interactions (DDIs) produce essential and valuable insights for healthcare professionals, since they provide data on the impact of concurrent administration of medications to patients during therapy. In that sense, some relevant works, related to the DDIExtraction2013 Challenge, are available in the current technical literature. This study aims to improve previous results, using two models, where a Gaussian noise layer is added to achieve better DDI relationship extraction. (1) A Piecewise Convolutional Neural Network (PW-CNN) model is used to capture relationships among pharmacological entities described in biomedical databases. Additionally, the model incorporates multichannel words to enrich a person's vocabulary and reduce unfamiliar words. (2) The model uses the pre-trained BERT language model to classify relationships, while also integrating data from the target entities. After identifying the target entities, the model transfers the relevant information through the pre-trained architecture and integrates the encoded data for both entities. The results of the experiment show an improved performance, with respect to previous models.
Collapse
Affiliation(s)
- Marco Molina
- Department of Informatics and Computer Science, Faculty of Systems Engineering, Escuela Politécnica Nacional, Av. Ladron de Guevara E11-25, Quito 170525, Ecuador
| | - Cristina Jiménez
- Department of Informatics and Computer Science, Faculty of Systems Engineering, Escuela Politécnica Nacional, Av. Ladron de Guevara E11-25, Quito 170525, Ecuador
| | - Carlos Montenegro
- Department of Informatics and Computer Science, Faculty of Systems Engineering, Escuela Politécnica Nacional, Av. Ladron de Guevara E11-25, Quito 170525, Ecuador
| |
Collapse
|
11
|
Kim S, Yoon J, Kwon O. Biomedical Relation Extraction Using Dependency Graph and Decoder-Enhanced Transformer Model. Bioengineering (Basel) 2023; 10:bioengineering10050586. [PMID: 37237656 DOI: 10.3390/bioengineering10050586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Revised: 05/06/2023] [Accepted: 05/09/2023] [Indexed: 05/28/2023] Open
Abstract
The identification of drug-drug and chemical-protein interactions is essential for understanding unpredictable changes in the pharmacological effects of drugs and mechanisms of diseases and developing therapeutic drugs. In this study, we extract drug-related interactions from the DDI (Drug-Drug Interaction) Extraction-2013 Shared Task dataset and the BioCreative ChemProt (Chemical-Protein) dataset using various transfer transformers. We propose BERTGAT that uses a graph attention network (GAT) to take into account the local structure of sentences and embedding features of nodes under the self-attention scheme and investigate whether incorporating syntactic structure can help relation extraction. In addition, we suggest T5slim_dec, which adapts the autoregressive generation task of the T5 (text-to-text transfer transformer) to the relation classification problem by removing the self-attention layer in the decoder block. Furthermore, we evaluated the potential of biomedical relation extraction of GPT-3 (Generative Pre-trained Transformer) using GPT-3 variant models. As a result, T5slim_dec, which is a model with a tailored decoder designed for classification problems within the T5 architecture, demonstrated very promising performances for both tasks. We achieved an accuracy of 91.15% in the DDI dataset and an accuracy of 94.29% for the CPR (Chemical-Protein Relation) class group in ChemProt dataset. However, BERTGAT did not show a significant performance improvement in the aspect of relation extraction. We demonstrated that transformer-based approaches focused only on relationships between words are implicitly eligible to understand language well without additional knowledge such as structural information.
Collapse
Affiliation(s)
- Seonho Kim
- Department of Computer Science and Engineering, Sogang University, Seoul 04107, Republic of Korea
| | - Juntae Yoon
- VAIV Company, Seoul 04107, Republic of Korea
| | - Ohyoung Kwon
- Department of Future Technology, Korea University of Technology and Education, Cheonan-si 31253, Republic of Korea
| |
Collapse
|
12
|
Yang Y, Lu Y, Yan W. A comprehensive review on knowledge graphs for complex diseases. Brief Bioinform 2023; 24:6931722. [PMID: 36528805 DOI: 10.1093/bib/bbac543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2022] [Revised: 11/02/2022] [Accepted: 11/10/2022] [Indexed: 12/23/2022] Open
Abstract
In recent years, knowledge graphs (KGs) have gained a great deal of popularity as a tool for storing relationships between entities and for performing higher level reasoning. KGs in biomedicine and clinical practice aim to provide an elegant solution for diagnosing and treating complex diseases more efficiently and flexibly. Here, we provide a systematic review to characterize the state-of-the-art of KGs in the area of complex disease research. We cover the following topics: (1) knowledge sources, (2) entity extraction methods, (3) relation extraction methods and (4) the application of KGs in complex diseases. As a result, we offer a complete picture of the domain. Finally, we discuss the challenges in the field by identifying gaps and opportunities for further research and propose potential research directions of KGs for complex disease diagnosis and treatment.
Collapse
Affiliation(s)
- Yang Yang
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China.,Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Medical College of Soochow University, and Center for Systems Biology, Soochow University, Suzhou 215123, China
| |
Collapse
|
13
|
Yang J, Ding Y, Long S, Poon J, Han SC. DDI-MuG: Multi-aspect graphs for drug-drug interaction extraction. Front Digit Health 2023; 5:1154133. [PMID: 37168529 PMCID: PMC10164961 DOI: 10.3389/fdgth.2023.1154133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 04/03/2023] [Indexed: 05/13/2023] Open
Abstract
Introduction Drug-drug interaction (DDI) may lead to adverse reactions in patients, thus it is important to extract such knowledge from biomedical texts. However, previously proposed approaches typically focus on capturing sentence-aspect information while ignoring valuable knowledge concerning the whole corpus. In this paper, we propose a Multi-aspect Graph-based DDI extraction model, named DDI-MuG. Methods We first employ a bio-specific pre-trained language model to obtain the token contextualized representations. Then we use two graphs to get syntactic information from input instance and word co-occurrence information within the entire corpus, respectively. Finally, we combine the representations of drug entities and verb tokens for the final classification. Results To validate the effectiveness of the proposed model, we perform extensive experiments on two widely used DDI extraction dataset, DDIExtraction-2013 and TAC 2018. It is encouraging to see that our model outperforms all twelve state-of-the-art models. Discussion In contrast to the majority of earlier models that rely on the black-box approach, our model enables visualization of crucial words and their interrelationships by utilizing edge information from two graphs. To the best of our knowledge, this is the first model that explores multi-aspect graphs to the DDI extraction task, and we hope it can establish a foundation for more robust multi-aspect works in the future.
Collapse
Affiliation(s)
- Jie Yang
- School of Computer Science, The University of Sydney, Sydney, NSW, Australia
| | - Yihao Ding
- School of Computer Science, The University of Sydney, Sydney, NSW, Australia
| | - Siqu Long
- School of Computer Science, The University of Sydney, Sydney, NSW, Australia
| | - Josiah Poon
- School of Computer Science, The University of Sydney, Sydney, NSW, Australia
| | - Soyeon Caren Han
- School of Computer Science, The University of Sydney, Sydney, NSW, Australia
- Department of Computer Science, University of Western Australia, Perth, WA, Australia
- Correspondence: Soyeon Caren Han ;
| |
Collapse
|
14
|
Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022; 23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open
Abstract
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
15
|
McInnes BT, Downie JS, Hao Y, Jett J, Keating K, Nakum G, Ranjan S, Rodriguez NE, Tang J, Xiang D, Young EM, Nguyen MH. Discovering Content through Text Mining for a Synthetic Biology Knowledge System. ACS Synth Biol 2022; 11:2043-2054. [PMID: 35671034 DOI: 10.1021/acssynbio.1c00611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Scientific articles contain a wealth of information about experimental methods and results describing biological designs. Due to its unstructured nature and multiple sources of ambiguity and variability, extracting this information from text is a difficult task. In this paper, we describe the development of the synthetic biology knowledge system (SBKS) text processing pipeline. The pipeline uses natural language processing techniques to extract and correlate information from the literature for synthetic biology researchers. Specifically, we apply named entity recognition, relation extraction, concept grounding, and topic modeling to extract information from published literature to link articles to elements within our knowledge system. Our results show the efficacy of each of the components on synthetic biology literature and provide future directions for further advancement of the pipeline.
Collapse
Affiliation(s)
- Bridget T McInnes
- Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - J Stephen Downie
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Yikai Hao
- University of California San Diego, La Jolla, California 92093, United States
| | - Jacob Jett
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Kevin Keating
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Gaurav Nakum
- University of California San Diego, La Jolla, California 92093, United States
| | - Sudhanshu Ranjan
- University of California San Diego, La Jolla, California 92093, United States
| | | | - Jiawei Tang
- University of California San Diego, La Jolla, California 92093, United States
| | - Du Xiang
- University of California San Diego, La Jolla, California 92093, United States
| | - Eric M Young
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Mai H Nguyen
- University of California San Diego, La Jolla, California 92093, United States
| |
Collapse
|
16
|
Shi Y, Quan P, Zhang T, Niu L. DREAM: Drug-Drug Interaction Extraction with Enhanced Dependency Graph and Attention Mechanism. Methods 2022; 203:152-159. [PMID: 35181524 DOI: 10.1016/j.ymeth.2022.02.002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Revised: 12/10/2021] [Accepted: 02/06/2022] [Indexed: 11/28/2022] Open
Abstract
Drug-drug interactions (DDIs) aim at describing the effect relations produced by a combination of two or more drugs. It is an important semantic processing task in the field of bioinformatics such as pharmacovigilance and clinical research. Recently, graph neural networks are applied on dependency graph to promote the performance of DDI extraction with better semantic representations. However, current method concentrates more on first-order dependency relations and cannot discriminate the connected nodes properly. To better incorporate the dependency relations and improve the representations, we propose a novel DDI extraction method named Drug-drug Interactions extRaction with Enhanced Dependency Graph and Attention Mechanism in this work. Specifically, the dependency graph is enhanced with some potential long-range words to complete the semantic information and fit the aggregation process of graph neural networks. And graph attention mechanism is adopted to further improve word representation by discriminating the connected nodes according to the specific task. Numerical experiments on DDIExtraction 2013 corpus, the benchmark corpus for this domain, demonstrate the superiority of our proposed method.
Collapse
Affiliation(s)
- Yong Shi
- School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100049, China; Key Lab of Big Data Mining and Knowledge Management Chinese Academy of Sciences, Beijing, 100190 China; Research Center on Fictitious Economy and Data Science, Chinese Academy of Sciences, Beijing, 100190 China; College of Information Science and Technology, University of Nebraska at Omaha, NE, 68182 USA.
| | - Pei Quan
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, 100049, China.
| | - Tianlin Zhang
- School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, 100049, China; National Centre for Text Mining, University of Manchester, Manchester, M1 7DN, United Kingdom.
| | - Lingfeng Niu
- School of Economics and Management, University of Chinese Academy of Sciences, Beijing, 100049, China.
| |
Collapse
|
17
|
Wang J, Ren Y, Zhang Z, Xu H, Zhang Y. From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents. Front Res Metr Anal 2022; 6:691105. [PMID: 35005421 PMCID: PMC8727901 DOI: 10.3389/frma.2021.691105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 11/02/2021] [Indexed: 11/28/2022] Open
Abstract
Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020—ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising.
Collapse
Affiliation(s)
- Jingqi Wang
- Melax Technologies, Inc., Houston, TX, United States
| | - Yuankai Ren
- School of Medicine, Nantong University, Nantong, China
| | - Zhi Zhang
- School of Medicine, Nantong University, Nantong, China
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yaoyun Zhang
- Melax Technologies, Inc., Houston, TX, United States
| |
Collapse
|
18
|
Chen T, Hu Y. Entity relation extraction from electronic medical records based on improved annotation rules and BiLSTM-CRF. Ann Transl Med 2021; 9:1415. [PMID: 34733967 PMCID: PMC8506757 DOI: 10.21037/atm-21-3828] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Accepted: 08/26/2021] [Indexed: 11/17/2022]
Abstract
Background Extracting entities and their relationships from electronic medical records (EMRs) is an important research direction in the development of medical informatization. Recently, a method was proposed to transform entity relation extraction into entity recognition by using annotation rules, and then solve the problem of relation extraction by an entity recognition model. However, this method cannot deal with one-to-many entity relationship problems. Methods This paper combined the bidirectional long- and short-term memory-conditional random field (BiLSTM-CRF) deep learning model with an improvement of sequence annotation rules, hided relationships between entities in entity labels, then the problem of one-to-many named entity relation extraction in EMRs was transformed into entity recognition based on relation sets, and entity extraction was carried out through the entity recognition model. Results Entity extraction was achieved through the entity recognition model. The result of entity recognition was transformed into the corresponding entity relationship, thus completing the task of one-to-many entity relation extraction by the improved annotation rules, the accuracy rate of proposed method reaches 83.46%, the recall rate is 81.12%, and the value of comprehensive index F1 is 0.8227. Conclusions Through the annotation analysis of EMRs, our experimental results show that the improved annotation rules can effectively complete the task of one-to-many medical entity relation extraction from EMRs.
Collapse
Affiliation(s)
- Tingyin Chen
- Department of Network and Information, Xiangya Hospital, Central South University, Changsha, China.,Mobile Health Ministry of Education-China Mobile Joint Laboratory, Changsha, China
| | - Yongmei Hu
- Department of Oncology, Xiangya Hospital, Central South University, Changsha, China
| |
Collapse
|
19
|
Barros M, Ruas P, Sousa D, Bangash AH, Couto FM. COVID-19 recommender system based on an annotated multilingual corpus. Genomics Inform 2021; 19:e24. [PMID: 34638171 PMCID: PMC8510867 DOI: 10.5808/gi.21008] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 08/12/2021] [Indexed: 12/11/2022] Open
Abstract
Tracking the most recent advances in Coronavirus disease 2019 (COVID-19)‒related research is essential, given the disease's novelty and its impact on society. However, with the publication pace speeding up, researchers and clinicians require automatic approaches to keep up with the incoming information regarding this disease. A solution to this problem requires the development of text mining pipelines; the efficiency of which strongly depends on the availability of curated corpora. However, there is a lack of COVID-19‒related corpora, even more, if considering other languages besides English. This project's main contribution was the annotation of a multilingual parallel corpus and the generation of a recommendation dataset (EN-PT and EN-ES) regarding relevant entities, their relations, and recommendation, providing this resource to the community to improve the text mining research on COVID-19‒related literature. This work was developed during the 7th Biomedical Linked Annotation Hackathon (BLAH7).
Collapse
Affiliation(s)
- Márcia Barros
- Large-Scale Informatics Systems Laboratory, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal.,Center for Astrophysics and Gravitation, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| | - Pedro Ruas
- Large-Scale Informatics Systems Laboratory, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| | - Diana Sousa
- Large-Scale Informatics Systems Laboratory, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| | - Ali Haider Bangash
- Shifa College of Medicine, Shifa Tameer-e-Millat University, Islamabad 46000, Pakistan.,Working Group 3, COST Action EVidence-Based RESearch (EVBRES), Western Norway University of Applied Sciences, Inndalsveien 28, 5063 Bergen, Norway
| | - Francisco M Couto
- Large-Scale Informatics Systems Laboratory, Faculdade de Ciências, Universidade de Lisboa, 1749-016 Lisboa, Portugal
| |
Collapse
|
20
|
Mahendran D, Gurdin G, Lewinski N, Tang C, McInnes BT. Identifying Chemical Reactions and Their Associated Attributes in Patents. Front Res Metr Anal 2021; 6:688353. [PMID: 34322654 PMCID: PMC8312343 DOI: 10.3389/frma.2021.688353] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 05/31/2021] [Indexed: 11/13/2022] Open
Abstract
Chemical patents are an essential source of information about novel chemicals and chemical reactions. However, with the increasing volume of such patents, mining information about these chemicals and chemical reactions has become a time-intensive and laborious endeavor. In this study, we present a system to extract chemical reaction events from patents automatically. Our approach consists of two steps: 1) named entity recognition (NER)-the automatic identification of chemical reaction parameters from the corresponding text, and 2) event extraction (EE)-the automatic classifying and linking of entities based on their relationships to each other. For our NER system, we evaluate bidirectional long short-term memory (BiLSTM)-based and bidirectional encoder representations from transformer (BERT)-based methods. For our EE system, we evaluate BERT-based, convolutional neural network (CNN)-based, and rule-based methods. We evaluate our NER and EE components independently and as an end-to-end system, reporting the precision, recall, and F 1 score. Our results show that the BiLSTM-based method performed best at identifying the entities, and the CNN-based method performed best at extracting events.
Collapse
Affiliation(s)
- Darshini Mahendran
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Gabrielle Gurdin
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Nastassja Lewinski
- Department of Life Science and Chemical Engineering, Virginia Commonwealth University, Richmond, VA, United States
| | - Christina Tang
- Department of Life Science and Chemical Engineering, Virginia Commonwealth University, Richmond, VA, United States
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
21
|
Kim Y, Heider PM, Lally IR, Meystre SM. A Hybrid Model for Family History Information Identification and Relation Extraction: Development and Evaluation of an End-to-End Information Extraction System. JMIR Med Inform 2021; 9:e22797. [PMID: 33885370 PMCID: PMC8103307 DOI: 10.2196/22797] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2020] [Revised: 12/15/2020] [Accepted: 02/19/2021] [Indexed: 01/02/2023] Open
Abstract
BACKGROUND Family history information is important to assess the risk of inherited medical conditions. Natural language processing has the potential to extract this information from unstructured free-text notes to improve patient care and decision making. We describe the end-to-end information extraction system the Medical University of South Carolina team developed when participating in the 2019 National Natural Language Processing Clinical Challenge (n2c2)/Open Health Natural Language Processing (OHNLP) shared task. OBJECTIVE This task involves identifying mentions of family members and observations in electronic health record text notes and recognizing the 2 types of relations (family member-living status relations and family member-observation relations). Our system aims to achieve a high level of performance by integrating heuristics and advanced information extraction methods. Our efforts also include improving the performance of 2 subtasks by exploiting additional labeled data and clinical text-based embedding models. METHODS We present a hybrid method that combines machine learning and rule-based approaches. We implemented an end-to-end system with multiple information extraction and attribute classification components. For entity identification, we trained bidirectional long short-term memory deep learning models. These models incorporated static word embeddings and context-dependent embeddings. We created a voting ensemble that combined the predictions of all individual models. For relation extraction, we trained 2 relation extraction models. The first model determined the living status of each family member. The second model identified observations associated with each family member. We implemented online gradient descent models to extract related entity pairs. As part of postchallenge efforts, we used the BioCreative/OHNLP 2018 corpus and trained new models with the union of these 2 datasets. We also pretrained language models using clinical notes from the Medical Information Mart for Intensive Care (MIMIC-III) clinical database. RESULTS The voting ensemble achieved better performance than individual classifiers. In the entity identification task, our top-performing system reached a precision of 78.90% and a recall of 83.84%. Our natural language processing system for entity identification took 3rd place out of 17 teams in the challenge. We ranked 4th out of 9 teams in the relation extraction task. Our system substantially benefited from the combination of the 2 datasets. Compared to our official submission with F1 scores of 81.30% and 64.94% for entity identification and relation extraction, respectively, the revised system yielded significantly better performance (P<.05) with F1 scores of 86.02% and 72.48%, respectively. CONCLUSIONS We demonstrated that a hybrid model could be used to successfully extract family history information recorded in unstructured free-text notes. In this study, our approach to entity identification as a sequence labeling problem produced satisfactory results. Our postchallenge efforts significantly improved performance by leveraging additional labeled data and using word vector representations learned from large collections of clinical notes.
Collapse
Affiliation(s)
- Youngjun Kim
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| | - Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| | - Isabel Rh Lally
- Department of Computer Science, College of Charleston, Charleston, SC, United States
| | - Stéphane M Meystre
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| |
Collapse
|
22
|
Zhan K, Peng W, Xiong Y, Fu H, Chen Q, Wang X, Tang B. Novel Graph-Based Model With Biaffine Attention for Family History Extraction From Clinical Text: Modeling Study. JMIR Med Inform 2021; 9:e23587. [PMID: 33881405 PMCID: PMC8100876 DOI: 10.2196/23587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2020] [Revised: 01/18/2021] [Accepted: 02/09/2021] [Indexed: 11/17/2022] Open
Abstract
Background Family history information, including information on family members, side of the family of family members, living status of family members, and observations of family members, plays an important role in disease diagnosis and treatment. Family member information extraction aims to extract family history information from semistructured/unstructured text in electronic health records (EHRs), which is a challenging task regarding named entity recognition (NER) and relation extraction (RE), where named entities refer to family members, living status, and observations, and relations refer to relations between family members and living status, and relations between family members and observations. Objective This study aimed to introduce the system we developed for the 2019 n2c2/OHNLP track on family history extraction, which can jointly extract entities and relations about family history information from clinical text. Methods We proposed a novel graph-based model with biaffine attention for family history extraction from clinical text. In this model, we first designed a graph to represent family history information, that is, representing NER and RE regarding family history in a unified way, and then introduced a biaffine attention mechanism to extract family history information in clinical text. Convolution neural network (CNN)-Bidirectional Long Short Term Memory network (BiLSTM) and Bidirectional Encoder Representation from Transformers (BERT) were used to encode the input sentence, and a biaffine classifier was used to extract family history information. In addition, we developed a postprocessing module to adjust the results. A system based on the proposed method was developed for the 2019 n2c2/OHNLP shared task track on family history information extraction. Results Our system ranked first in the challenge, and the F1 scores of the best system on the NER subtask and RE subtask were 0.8745 and 0.6810, respectively. After the challenge, we further fine tuned the parameters and improved the F1 scores of the two subtasks to 0.8823 and 0.7048, respectively. Conclusions The experimental results showed that the system based on the proposed method can extract family history information from clinical text effectively.
Collapse
Affiliation(s)
- Kecheng Zhan
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, China
| | - Weihua Peng
- Baidu International Technology (Shenzhen) Co, Ltd, Shenzhen, China
| | - Ying Xiong
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, China
| | - Huhao Fu
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, China
| | - Qingcai Chen
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, China.,Peng Cheng Laboratory, Shenzhen, China
| | - Xiaolong Wang
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, China
| | - Buzhou Tang
- Key Laboratory of Network Oriented Intelligent Computation, Harbin Institute of Technology, Shenzhen, China.,Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
23
|
Wei Q, Ji Z, Li Z, Du J, Wang J, Xu J, Xiang Y, Tiryaki F, Wu S, Zhang Y, Tao C, Xu H. A study of deep learning approaches for medication and adverse drug event extraction from clinical text. J Am Med Inform Assoc 2021; 27:13-21. [PMID: 31135882 DOI: 10.1093/jamia/ocz063] [Citation(s) in RCA: 49] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Revised: 03/23/2019] [Accepted: 04/17/2019] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE This article presents our approaches to extraction of medications and associated adverse drug events (ADEs) from clinical documents, which is the second track of the 2018 National NLP Clinical Challenges (n2c2) shared task. MATERIALS AND METHODS The clinical corpus used in this study was from the MIMIC-III database and the organizers annotated 303 documents for training and 202 for testing. Our system consists of 2 components: a named entity recognition (NER) and a relation classification (RC) component. For each component, we implemented deep learning-based approaches (eg, BI-LSTM-CRF) and compared them with traditional machine learning approaches, namely, conditional random fields for NER and support vector machines for RC, respectively. In addition, we developed a deep learning-based joint model that recognizes ADEs and their relations to medications in 1 step using a sequence labeling approach. To further improve the performance, we also investigated different ensemble approaches to generating optimal performance by combining outputs from multiple approaches. RESULTS Our best-performing systems achieved F1 scores of 93.45% for NER, 96.30% for RC, and 89.05% for end-to-end evaluation, which ranked #2, #1, and #1 among all participants, respectively. Additional evaluations show that the deep learning-based approaches did outperform traditional machine learning algorithms in both NER and RC. The joint model that simultaneously recognizes ADEs and their relations to medications also achieved the best performance on RC, indicating its promise for relation extraction. CONCLUSION In this study, we developed deep learning approaches for extracting medications and their attributes such as ADEs, and demonstrated its superior performance compared with traditional machine learning algorithms, indicating its uses in broader NER and RC tasks in the medical domain.
Collapse
Affiliation(s)
- Qiang Wei
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Zongcheng Ji
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Zhiheng Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China
| | - Jingcheng Du
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jingqi Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yang Xiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Firat Tiryaki
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Stephen Wu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Cui Tao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
24
|
Christopoulou F, Tran TT, Sahu SK, Miwa M, Ananiadou S. Adverse drug events and medication relation extraction in electronic health records with ensemble deep learning methods. J Am Med Inform Assoc 2021; 27:39-46. [PMID: 31390003 PMCID: PMC6913215 DOI: 10.1093/jamia/ocz101] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 03/21/2019] [Accepted: 05/24/2019] [Indexed: 01/21/2023] Open
Abstract
Objective Identification of drugs, associated medication entities, and interactions among them are crucial to prevent unwanted effects of drug therapy, known as adverse drug events. This article describes our participation to the n2c2 shared-task in extracting relations between medication-related entities in electronic health records. Materials and Methods We proposed an ensemble approach for relation extraction and classification between drugs and medication-related entities. We incorporated state-of-the-art named-entity recognition (NER) models based on bidirectional long short-term memory (BiLSTM) networks and conditional random fields (CRF) for end-to-end extraction. We additionally developed separate models for intra- and inter-sentence relation extraction and combined them using an ensemble method. The intra-sentence models rely on bidirectional long short-term memory networks and attention mechanisms and are able to capture dependencies between multiple related pairs in the same sentence. For the inter-sentence relations, we adopted a neural architecture that utilizes the Transformer network to improve performance in longer sequences. Results Our team ranked third with a micro-averaged F1 score of 94.72% and 87.65% for relation and end-to-end relation extraction, respectively (Tracks 2 and 3). Our ensemble effectively takes advantages from our proposed models. Analysis of the reported results indicated that our proposed approach is more generalizable than the top-performing system, which employs additional training data- and corpus-driven processing techniques. Conclusions We proposed a relation extraction system to identify relations between drugs and medication-related entities. The proposed approach is independent of external syntactic tools. Analysis showed that by using latent Drug-Drug interactions we were able to significantly improve the performance of non–Drug-Drug pairs in EHRs.
Collapse
Affiliation(s)
- Fenia Christopoulou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom.,Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Thy Thy Tran
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom.,Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| | - Sunil Kumar Sahu
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom
| | - Makoto Miwa
- Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan.,Toyota Technological Institute, Nagoya, Japan
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, The University of Manchester, Manchester, United Kingdom.,Artificial Intelligence Research Centre, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
| |
Collapse
|
25
|
Yang X, Bian J, Fang R, Bjarnadottir RI, Hogan WR, Wu Y. Identifying relations of medications with adverse drug events using recurrent convolutional neural networks and gradient boosting. J Am Med Inform Assoc 2021; 27:65-72. [PMID: 31504605 DOI: 10.1093/jamia/ocz144] [Citation(s) in RCA: 30] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2019] [Revised: 05/30/2019] [Accepted: 07/22/2019] [Indexed: 01/19/2023] Open
Abstract
OBJECTIVE To develop a natural language processing system that identifies relations of medications with adverse drug events from clinical narratives. This project is part of the 2018 n2c2 challenge. MATERIALS AND METHODS We developed a novel clinical named entity recognition method based on an recurrent convolutional neural network and compared it to a recurrent neural network implemented using the long-short term memory architecture, explored methods to integrate medical knowledge as embedding layers in neural networks, and investigated 3 machine learning models, including support vector machines, random forests and gradient boosting for relation classification. The performance of our system was evaluated using annotated data and scripts provided by the 2018 n2c2 organizers. RESULTS Our system was among the top ranked. Our best model submitted during this challenge (based on recurrent neural networks and support vector machines) achieved lenient F1 scores of 0.9287 for concept extraction (ranked third), 0.9459 for relation classification (ranked fourth), and 0.8778 for the end-to-end relation extraction (ranked second). We developed a novel named entity recognition model based on a recurrent convolutional neural network and further investigated gradient boosting for relation classification. The new methods improved the lenient F1 scores of the 3 subtasks to 0.9292, 0.9633, and 0.8880, respectively, which are comparable to the best performance reported in this challenge. CONCLUSION This study demonstrated the feasibility of using machine learning methods to extract the relations of medications with adverse drug events from clinical narratives.
Collapse
Affiliation(s)
- Xi Yang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Ruogu Fang
- J. Crayton Pruitt Family Department of Biomedical Engineering, University of Florida, Gainesville, Florida, USA
| | - Ragnhildur I Bjarnadottir
- Department of Family, Community and Health Systems Science, College of Nursing, University of Florida, Gainesville, Florida, USA
| | - William R Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
26
|
Fan Y, Zhou S, Li Y, Zhang R. Deep learning approaches for extracting adverse events and indications of dietary supplements from clinical text. J Am Med Inform Assoc 2021; 28:569-577. [PMID: 33150942 DOI: 10.1093/jamia/ocaa218] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Accepted: 08/19/2020] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE We sought to demonstrate the feasibility of utilizing deep learning models to extract safety signals related to the use of dietary supplements (DSs) in clinical text. MATERIALS AND METHODS Two tasks were performed in this study. For the named entity recognition (NER) task, Bi-LSTM-CRF (bidirectional long short-term memory conditional random field) and BERT (bidirectional encoder representations from transformers) models were trained and compared with CRF model as a baseline to recognize the named entities of DSs and events from clinical notes. In the relation extraction (RE) task, 2 deep learning models, including attention-based Bi-LSTM and convolutional neural network as well as a random forest model were trained to extract the relations between DSs and events, which were categorized into 3 classes: positive (ie, indication), negative (ie, adverse events), and not related. The best performed NER and RE models were further applied on clinical notes mentioning 88 DSs for discovering DSs adverse events and indications, which were compared with a DS knowledge base. RESULTS For the NER task, deep learning models achieved a better performance than CRF, with F1 scores above 0.860. The attention-based Bi-LSTM model performed the best in the RE task, with an F1 score of 0.893. When comparing DS event pairs generated by the deep learning models with the knowledge base for DSs and event, we found both known and unknown pairs. CONCLUSIONS Deep learning models can detect adverse events and indication of DSs in clinical notes, which hold great potential for monitoring the safety of DS use.
Collapse
Affiliation(s)
- Yadan Fan
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Sicheng Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA
| | - Yifan Li
- College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| | - Rui Zhang
- Institute for Health Informatics, University of Minnesota, Minneapolis, Minnesota, USA.,College of Pharmacy, University of Minnesota, Minneapolis, Minnesota, USA
| |
Collapse
|
27
|
Tran T, Kavuluru R, Kilicoglu H. Attention-Gated Graph Convolutions for Extracting Drug Interaction Information from Drug Labels. ACM Trans Comput Healthc 2021; 2:10. [PMID: 34541578 PMCID: PMC8445229 DOI: 10.1145/3423209] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/01/2019] [Accepted: 09/01/2020] [Indexed: 01/02/2023]
Abstract
Preventable adverse events as a result of medical errors present a growing concern in the healthcare system. As drug-drug interactions (DDIs) may lead to preventable adverse events, being able to extract DDIs from drug labels into a machine-processable form is an important step toward effective dissemination of drug safety information. Herein, we tackle the problem of jointly extracting mentions of drugs and their interactions, including interaction outcome, from drug labels. Our deep learning approach entails composing various intermediate representations, including graph-based context derived using graph convolutions (GCs) with a novel attention-based gating mechanism (holistically called GCA), which are combined in meaningful ways to predict on all subtasks jointly. Our model is trained and evaluated on the 2018 TAC DDI corpus. Our GCA model in conjunction with transfer learning performs at 39.20% F1 and 26.09% F1 on entity recognition (ER) and relation extraction (RE), respectively, on the first official test set and at 45.30% F1 and 27.87% F1 on ER and RE, respectively, on the second official test set. These updated results lead to improvements over our prior best by up to 6 absolute F1 points. After controlling for available training data, the proposed model exhibits state-of-the-art performance for this task.
Collapse
Affiliation(s)
- Tung Tran
- University of Kentucky, United States
| | | | | |
Collapse
|
28
|
Zeng D, Zhao C, Quan Z. CID-GCN: An Effective Graph Convolutional Networks for Chemical-Induced Disease Relation Extraction. Front Genet 2021; 12:624307. [PMID: 33643385 PMCID: PMC7902761 DOI: 10.3389/fgene.2021.624307] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 01/18/2021] [Indexed: 11/26/2022] Open
Abstract
Automatic extraction of chemical-induced disease (CID) relation from unstructured text is of essential importance for disease treatment and drug development. In this task, some relational facts can only be inferred from the document rather than single sentence. Recently, researchers investigate graph-based approaches to extract relations across sentences. It iteratively combines the information from neighbor nodes to model the interactions in entity mentions that exist in different sentences. Despite their success, one severe limitation of the graph-based approaches is the over-smoothing problem, which decreases the model distinguishing ability. In this paper, we propose CID-GCN, an effective Graph Convolutional Networks (GCNs) with gating mechanism, for CID relation extraction. Specifically, we construct a heterogeneous graph which contains mention, sentence and entity nodes. Then, the graph convolution operation is employed to aggregate interactive information on the constructed graph. Particularly, we combine gating mechanism with the graph convolution operation to address the over-smoothing problem. The experimental results demonstrate that our approach significantly outperforms the baselines.
Collapse
Affiliation(s)
- Daojian Zeng
- Hunan Provincial Key Laboratory of Intelligent Computing and Language Information Processing, Hunan Normal University, Changsha, China
| | - Chao Zhao
- School of Computer and Communication Engineering, Changsha University of Science and Technology, Changsha, China
| | - Zhe Quan
- College of Information Science and Engineering, Hunan University, Changsha, China
| |
Collapse
|
29
|
Shen F, Liu S, Fu S, Wang Y, Henry S, Uzuner O, Liu H. Family History Extraction From Synthetic Clinical Narratives Using Natural Language Processing: Overview and Evaluation of a Challenge Data Set and Solutions for the 2019 National NLP Clinical Challenges (n2c2)/Open Health Natural Language Processing (OHNLP) Competition. JMIR Med Inform 2021; 9:e24008. [PMID: 33502329 PMCID: PMC7875692 DOI: 10.2196/24008] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 11/25/2020] [Accepted: 12/05/2020] [Indexed: 12/18/2022] Open
Abstract
Background As a risk factor for many diseases, family history (FH) captures both shared genetic variations and living environments among family members. Though there are several systems focusing on FH extraction using natural language processing (NLP) techniques, the evaluation protocol of such systems has not been standardized. Objective The n2c2/OHNLP (National NLP Clinical Challenges/Open Health Natural Language Processing) 2019 FH extraction task aims to encourage the community efforts on a standard evaluation and system development on FH extraction from synthetic clinical narratives. Methods We organized the first BioCreative/OHNLP FH extraction shared task in 2018. We continued the shared task in 2019 in collaboration with the n2c2 and OHNLP consortium, and organized the 2019 n2c2/OHNLP FH extraction track. The shared task comprises 2 subtasks. Subtask 1 focuses on identifying family member entities and clinical observations (diseases), and subtask 2 expects the association of the living status, side of the family, and clinical observations with family members to be extracted. Subtask 2 is an end-to-end task which is based on the result of subtask 1. We manually curated the first deidentified clinical narrative from FH sections of clinical notes at Mayo Clinic Rochester, the content of which is highly relevant to patients’ FH. Results A total of 17 teams from all over the world participated in the n2c2/OHNLP FH extraction shared task, where 38 runs were submitted for subtask 1 and 21 runs were submitted for subtask 2. For subtask 1, the top 3 runs were generated by Harbin Institute of Technology, ezDI, Inc., and The Medical University of South Carolina with F1 scores of 0.8745, 0.8225, and 0.8130, respectively. For subtask 2, the top 3 runs were from Harbin Institute of Technology, ezDI, Inc., and University of Florida with F1 scores of 0.681, 0.6586, and 0.6544, respectively. The workshop was held in conjunction with the AMIA 2019 Fall Symposium. Conclusions A wide variety of methods were used by different teams in both tasks, such as Bidirectional Encoder Representations from Transformers, convolutional neural network, bidirectional long short-term memory, conditional random field, support vector machine, and rule-based strategies. System performances show that relation extraction from FH is a more challenging task when compared to entity identification task.
Collapse
Affiliation(s)
- Feichen Shen
- Division of Digital Health Sciences, Mayo Clinic, Rochester, MN, United States
| | - Sijia Liu
- Division of Digital Health Sciences, Mayo Clinic, Rochester, MN, United States
| | - Sunyang Fu
- Division of Digital Health Sciences, Mayo Clinic, Rochester, MN, United States
| | - Yanshan Wang
- Division of Digital Health Sciences, Mayo Clinic, Rochester, MN, United States
| | - Sam Henry
- Department of Information Sciences and Technology, George Mason University, Fairfax, VA, United States
| | - Ozlem Uzuner
- Department of Information Sciences and Technology, George Mason University, Fairfax, VA, United States.,Department of Biomedical Informatics, Massachusetts Institute of Technology, Cambridge, MA, United States.,Department of Biomedical Informatics, Harvard Medical School, Boston, MA, United States
| | - Hongfang Liu
- Division of Digital Health Sciences, Mayo Clinic, Rochester, MN, United States
| |
Collapse
|
30
|
Wang Z, Yan B, Wu C, Wu B, Wang X, Zheng K. Graph Adaptation Network with Domain-Specific Word Alignment for Cross-Domain Relation Extraction. Sensors (Basel) 2020; 20:s20247180. [PMID: 33333844 PMCID: PMC7765263 DOI: 10.3390/s20247180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/06/2020] [Revised: 12/05/2020] [Accepted: 12/08/2020] [Indexed: 11/29/2022]
Abstract
Cross-domain relation extraction has become an essential approach when target domain lacking labeled data. Most existing works adapted relation extraction models from the source domain to target domain through aligning sequential features, but failed to transfer non-local and non-sequential features such as word co-occurrence which are also critical for cross-domain relation extraction. To address this issue, in this paper, we propose a novel tripartite graph architecture to adapt non-local features when there is no labeled data in the target domain. The graph uses domain words as nodes to model the co-occurrence relation between domain-specific words and domain-independent words. Through graph convolutions on the tripartite graph, the information of domain-specific words is propagated so that the word representation can be fine-tuned to align domain-specific features. In addition, unlike the traditional graph structure, the weights of edges innovatively combine fixed weight and dynamic weight, to capture the global non-local features and avoid introducing noise to word representation. Experiments on three domains of ACE2005 datasets show that our method outperforms the state-of-the-art models by a big margin.
Collapse
Affiliation(s)
- Zhe Wang
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| | - Bo Yan
- School of Computer Science, Beijing University of Posts and Telecommunications, Beijing 100876, China
- Correspondence: ; Tel.: +86-176-1124-2518
| | - Chunhua Wu
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| | - Bin Wu
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| | - Xiujuan Wang
- School of Computer Science, Beijing University of Technology, Beijing 100124, China;
| | - Kangfeng Zheng
- School of Cyberspace Security, Beijing University of Posts and Telecommunications, Beijing 100876, China; (Z.W.); (C.W.); (B.W.); (K.Z.)
| |
Collapse
|
31
|
Abstract
OBJECTIVES We survey recent developments in medical Information Extraction (IE) as reported in the literature from the past three years. Our focus is on the fundamental methodological paradigm shift from standard Machine Learning (ML) techniques to Deep Neural Networks (DNNs). We describe applications of this new paradigm concentrating on two basic IE tasks, named entity recognition and relation extraction, for two selected semantic classes-diseases and drugs (or medications)-and relations between them. METHODS For the time period from 2017 to early 2020, we searched for relevant publications from three major scientific communities: medicine and medical informatics, natural language processing, as well as neural networks and artificial intelligence. RESULTS In the past decade, the field of Natural Language Processing (NLP) has undergone a profound methodological shift from symbolic to distributed representations based on the paradigm of Deep Learning (DL). Meanwhile, this trend is, although with some delay, also reflected in the medical NLP community. In the reporting period, overwhelming experimental evidence has been gathered, as illustrated in this survey for medical IE, that DL-based approaches outperform non-DL ones by often large margins. Still, small-sized and access-limited corpora create intrinsic problems for data-greedy DL as do special linguistic phenomena of medical sublanguages that have to be overcome by adaptive learning strategies. CONCLUSIONS The paradigm shift from (feature-engineered) ML to DNNs changes the fundamental methodological rules of the game for medical NLP. This change is by no means restricted to medical IE but should also deeply influence other areas of medical informatics, either NLP- or non-NLP-based.
Collapse
Affiliation(s)
- Udo Hahn
- Jena University Language & Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany
| | - Michel Oleynik
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| |
Collapse
|
32
|
Dandala B, Joopudi V, Tsou CH, Liang JJ, Suryanarayanan P. Extraction of Information Related to Drug Safety Surveillance From Electronic Health Record Notes: Joint Modeling of Entities and Relations Using Knowledge-Aware Neural Attentive Models. JMIR Med Inform 2020; 8:e18417. [PMID: 32459650 PMCID: PMC7382020 DOI: 10.2196/18417] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2020] [Revised: 05/12/2020] [Accepted: 05/13/2020] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND An adverse drug event (ADE) is commonly defined as "an injury resulting from medical intervention related to a drug." Providing information related to ADEs and alerting caregivers at the point of care can reduce the risk of prescription and diagnostic errors and improve health outcomes. ADEs captured in structured data in electronic health records (EHRs) as either coded problems or allergies are often incomplete, leading to underreporting. Therefore, it is important to develop capabilities to process unstructured EHR data in the form of clinical notes, which contain a richer documentation of a patient's ADE. Several natural language processing (NLP) systems have been proposed to automatically extract information related to ADEs. However, the results from these systems showed that significant improvement is still required for the automatic extraction of ADEs from clinical notes. OBJECTIVE This study aims to improve the automatic extraction of ADEs and related information such as drugs, their attributes, and reason for administration from the clinical notes of patients. METHODS This research was conducted using discharge summaries from the Medical Information Mart for Intensive Care III (MIMIC-III) database obtained through the 2018 National NLP Clinical Challenges (n2c2) annotated with drugs, drug attributes (ie, strength, form, frequency, route, dosage, duration), ADEs, reasons, and relations between drugs and other entities. We developed a deep learning-based system for extracting these drug-centric concepts and relations simultaneously using a joint method enhanced with contextualized embeddings, a position-attention mechanism, and knowledge representations. The joint method generated different sentence representations for each drug, which were then used to extract related concepts and relations simultaneously. Contextualized representations trained on the MIMIC-III database were used to capture context-sensitive meanings of words. The position-attention mechanism amplified the benefits of the joint method by generating sentence representations that capture long-distance relations. Knowledge representations were obtained from graph embeddings created using the US Food and Drug Administration Adverse Event Reporting System database to improve relation extraction, especially when contextual clues were insufficient. RESULTS Our system achieved new state-of-the-art results on the n2c2 data set, with significant improvements in recognizing crucial drug-reason (F1=0.650 versus F1=0.579) and drug-ADE (F1=0.490 versus F1=0.476) relations. CONCLUSIONS This study presents a system for extracting drug-centric concepts and relations that outperformed current state-of-the-art results and shows that contextualized embeddings, position-attention mechanisms, and knowledge graph embeddings effectively improve deep learning-based concepts and relation extraction. This study demonstrates the potential for deep learning-based methods to help extract real-world evidence from unstructured patient data for drug safety surveillance.
Collapse
|
33
|
Sousa D, Lamurias A, Couto FM. Improving accessibility and distinction between negative results in biomedical relation extraction. Genomics Inform 2020; 18:e20. [PMID: 32634874 PMCID: PMC7362944 DOI: 10.5808/gi.2020.18.2.e20] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Accepted: 05/26/2020] [Indexed: 12/03/2022] Open
Abstract
Accessible negative results are relevant for researchers and clinicians not only to limit their search space but also to prevent the costly re-exploration of research hypotheses. However, most biomedical relation extraction datasets do not seek to distinguish between a false and a negative relation among two biomedical entities. Furthermore, datasets created using distant supervision techniques also have some false negative relations that constitute undocumented/unknown relations (missing from a knowledge base). We propose to improve the distinction between these concepts, by revising a subset of the relations marked as false on the phenotype-gene relations corpus and give the first steps to automatically distinguish between the false (F), negative (N), and unknown (U) results. Our work resulted in a sample of 127 manually annotated FNU relations and a weighted-F1 of 0.5609 for their automatic distinction. This work was developed during the 6th Biomedical Linked Annotation Hackathon (BLAH6).
Collapse
Affiliation(s)
- Diana Sousa
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 1749-016, Portugal
| | - Andre Lamurias
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 1749-016, Portugal
| | - Francisco M Couto
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, Campo Grande 1749-016, Portugal
| |
Collapse
|
34
|
Jiang K, Feng S, Huang L, Chen T, Bernard GR. Mining Potential Effects of HUMIRA in Twitter Posts Through Relational Similarity. Stud Health Technol Inform 2020; 270:874-878. [PMID: 32570507 DOI: 10.3233/shti200286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
HUMIRA, a biologic therapy, has been approved to treat autoimmune diseases and been marketed in many countries worldwide. Much like other medications, it demonstrates many effects on the human body. It is important to understand its effects from the information generated by its users, and social media is one of the venues its users share their experience with the medication. To understand what HUMIRA effects were reported on Twitter, we utilized a relational similarity-based approach to infer HUMIRA effects based upon known medication-effect relations of other medications. With a corpus of 3.6 million preprocessed, "clean" tweets, a total of 55 effects were identified, and among them, 46 were previously observed, and nine were potentially unreported after verification with six reliable sources. The results not only indicate that many HUMIRA effects shared by the Twitter users are consistent with those previously reported, but also demonstrate the power and utility of our method, making it applicable to studying effects of other medications shared by Twitter users.
Collapse
Affiliation(s)
- Keyuan Jiang
- Purdue University Northwest, Hammond, Indiana, U.S.A
| | - Shichao Feng
- University of North Texas, Denton, TX 76203, U.S.A
| | - Liyuan Huang
- Purdue University Northwest, Hammond, Indiana, U.S.A
| | - Tingyu Chen
- Purdue University Northwest, Hammond, Indiana, U.S.A
| | | |
Collapse
|
35
|
Liu X, Fan J, Dong S. Document-Level Biomedical Relation Extraction Leveraging Pretrained Self-Attention Structure and Entity Replacement: Algorithm and Pretreatment Method Validation Study. JMIR Med Inform 2020; 8:e17644. [PMID: 32469325 PMCID: PMC7314385 DOI: 10.2196/17644] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Revised: 03/02/2020] [Accepted: 03/19/2020] [Indexed: 01/26/2023] Open
Abstract
Background The most current methods applied for intrasentence relation extraction in the biomedical literature are inadequate for document-level relation extraction, in which the relationship may cross sentence boundaries. Hence, some approaches have been proposed to extract relations by splitting the document-level datasets through heuristic rules and learning methods. However, these approaches may introduce additional noise and do not really solve the problem of intersentence relation extraction. It is challenging to avoid noise and extract cross-sentence relations. Objective This study aimed to avoid errors by dividing the document-level dataset, verify that a self-attention structure can extract biomedical relations in a document with long-distance dependencies and complex semantics, and discuss the relative benefits of different entity pretreatment methods for biomedical relation extraction. Methods This paper proposes a new data preprocessing method and attempts to apply a pretrained self-attention structure for document biomedical relation extraction with an entity replacement method to capture very long-distance dependencies and complex semantics. Results Compared with state-of-the-art approaches, our method greatly improved the precision. The results show that our approach increases the F1 value, compared with state-of-the-art methods. Through experiments of biomedical entity pretreatments, we found that a model using an entity replacement method can improve performance. Conclusions When considering all target entity pairs as a whole in the document-level dataset, a pretrained self-attention structure is suitable to capture very long-distance dependencies and learn the textual context and complicated semantics. A replacement method for biomedical entities is conducive to biomedical relation extraction, especially to document-level relation extraction.
Collapse
Affiliation(s)
- Xiaofeng Liu
- Communication and Computer Network Key Laboratory of Guangdong, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Jianye Fan
- Communication and Computer Network Key Laboratory of Guangdong, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| | - Shoubin Dong
- Communication and Computer Network Key Laboratory of Guangdong, School of Computer Science and Engineering, South China University of Technology, Guangzhou, China
| |
Collapse
|
36
|
Caufield JH, Ping P. New advances in extracting and learning from protein-protein interactions within unstructured biomedical text data. Emerg Top Life Sci 2019; 3:357-369. [PMID: 33523203 DOI: 10.1042/etls20190003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2019] [Revised: 07/11/2019] [Accepted: 07/16/2019] [Indexed: 12/14/2022]
Abstract
Protein-protein interactions, or PPIs, constitute a basic unit of our understanding of protein function. Though substantial effort has been made to organize PPI knowledge into structured databases, maintenance of these resources requires careful manual curation. Even then, many PPIs remain uncurated within unstructured text data. Extracting PPIs from experimental research supports assembly of PPI networks and highlights relationships crucial to elucidating protein functions. Isolating specific protein-protein relationships from numerous documents is technically demanding by both manual and automated means. Recent advances in the design of these methods have leveraged emerging computational developments and have demonstrated impressive results on test datasets. In this review, we discuss recent developments in PPI extraction from unstructured biomedical text. We explore the historical context of these developments, recent strategies for integrating and comparing PPI data, and their application to advancing the understanding of protein function. Finally, we describe the challenges facing the application of PPI mining to the text concerning protein families, using the multifunctional 14-3-3 protein family as an example.
Collapse
Affiliation(s)
- J Harry Caufield
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| | - Peipei Ping
- The NIH BD2K Center of Excellence in Biomedical Computing, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Physiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Medicine/Cardiology, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Department of Bioinformatics, University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
- Scalable Analytics Institute (ScAi), University of California at Los Angeles, Los Angeles, CA 90095, U.S.A
| |
Collapse
|
37
|
Li F, Yu H. An investigation of single-domain and multidomain medication and adverse drug event relation extraction from electronic health record notes using advanced deep learning models. J Am Med Inform Assoc 2019; 26:646-654. [PMID: 30938761 PMCID: PMC6562161 DOI: 10.1093/jamia/ocz018] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2018] [Revised: 01/15/2019] [Accepted: 01/31/2019] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE We aim to evaluate the effectiveness of advanced deep learning models (eg, capsule network [CapNet], adversarial training [ADV]) for single-domain and multidomain relation extraction from electronic health record (EHR) notes. MATERIALS AND METHODS We built multiple deep learning models with increased complexity, namely a multilayer perceptron (MLP) model and a CapNet model for single-domain relation extraction and fully shared (FS), shared-private (SP), and adversarial training (ADV) modes for multidomain relation extraction. Our models were evaluated in 2 ways: first, we compared our models using our expert-annotated cancer (the MADE1.0 corpus) and cardio corpora; second, we compared our models with the systems in the MADE1.0 and i2b2 challenges. RESULTS Multidomain models outperform single-domain models by 0.7%-1.4% in F1 (t test P < .05), but the results of FS, SP, and ADV modes are mixed. Our results show that the MLP model generally outperforms the CapNet model by 0.1%-1.0% in F1. In the comparisons with other systems, the CapNet model achieves the state-of-the-art result (87.2% in F1) in the cancer corpus and the MLP model generally outperforms MedEx in the cancer, cardiovascular diseases, and i2b2 corpora. CONCLUSIONS Our MLP or CapNet model generally outperforms other state-of-the-art systems in medication and adverse drug event relation extraction. Multidomain models perform better than single-domain models. However, neither the SP nor the ADV mode can always outperform the FS mode significantly. Moreover, the CapNet model is not superior to the MLP model for our corpora.
Collapse
Affiliation(s)
- Fei Li
- Department of Computer Science, University of Massachusetts Lowell, Lowell, Massachusetts, USA
- Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, Massachusetts, USA
| | - Hong Yu
- Department of Computer Science, University of Massachusetts Lowell, Lowell, Massachusetts, USA
- Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, Massachusetts, USA
| |
Collapse
|
38
|
Abstract
In neuropsychological assessment, semantic fluency is a widely accepted measure of executive function and access to semantic memory. While fluency scores are typically reported as the number of unique words produced, several alternative manual scoring methods have been proposed that provide additional insights into performance, such as clusters of semantically related items. Many automatic scoring methods yield metrics that are difficult to relate to the theories behind manual scoring methods, and most require manually-curated linguistic ontologies or large corpus infrastructure. In this paper, we propose a novel automatic scoring method based on Wikipedia, Backlink-VSM, which is easily adaptable to any of the 61 languages with more than 100k Wikipedia entries, can account for cultural differences in semantic relatedness, and covers a wide range of item categories. Our Backlink-VSM method combines relational knowledge as represented by links between Wikipedia entries (Backlink model) with a semantic proximity metric derived from distributional representations (vector space model; VSM). Backlink-VSM yields measures that approximate manual clustering and switching analyses, providing a straightforward link to the substantial literature that uses these metrics. We illustrate our approach with examples from two languages (English and Korean), and two commonly used categories of items (animals and fruits). For both Korean and English, we show that the measures generated by our automatic scoring procedure correlate well with manual annotations. We also successfully replicate findings that older adults produce significantly fewer switches compared to younger adults. Furthermore, our automatic scoring procedure outperforms the manual scoring method and a WordNet-based model in separating younger and older participants measured by binary classification accuracy for both English and Korean datasets. Our method also generalizes to a different category (fruit), demonstrating its adaptability.
Collapse
Affiliation(s)
- Najoung Kim
- School of Computing, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
| | - Jung-Ho Kim
- School of Computing, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
| | - Maria K Wolters
- School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
| | - Sarah E MacPherson
- Human Cognitive Neuroscience, Department of Psychology, University of Edinburgh, Edinburgh, United Kingdom
| | - Jong C Park
- School of Computing, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
| |
Collapse
|
39
|
Sun X, Dong K, Ma L, Sutcliffe R, He F, Chen S, Feng J. Drug-Drug Interaction Extraction via Recurrent Hybrid Convolutional Neural Networks with an Improved Focal Loss. Entropy (Basel) 2019; 21:e21010037. [PMID: 33266753 PMCID: PMC7514143 DOI: 10.3390/e21010037] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/08/2018] [Revised: 01/02/2019] [Accepted: 01/03/2019] [Indexed: 12/24/2022]
Abstract
Drug-drug interactions (DDIs) may bring huge health risks and dangerous effects to a patient’s body when taking two or more drugs at the same time or within a certain period of time. Therefore, the automatic extraction of unknown DDIs has great potential for the development of pharmaceutical agents and the safety of drug use. In this article, we propose a novel recurrent hybrid convolutional neural network (RHCNN) for DDI extraction from biomedical literature. In the embedding layer, the texts mentioning two entities are represented as a sequence of semantic embeddings and position embeddings. In particular, the complete semantic embedding is obtained by the information fusion between a word embedding and its contextual information which is learnt by recurrent structure. After that, the hybrid convolutional neural network is employed to learn the sentence-level features which consist of the local context features from consecutive words and the dependency features between separated words for DDI extraction. Lastly but most significantly, in order to make up for the defects of the traditional cross-entropy loss function when dealing with class imbalanced data, we apply an improved focal loss function to mitigate against this problem when using the DDIExtraction 2013 dataset. In our experiments, we achieve DDI automatic extraction with a micro F-score of 75.48% on the DDIExtraction 2013 dataset, outperforming the state-of-the-art approach by 2.49%.
Collapse
Affiliation(s)
- Xia Sun
- Department of Information Science and Technology, Northwest University, Xi’an 710127, China
- Correspondence: (X.S.); (J.F.)
| | - Ke Dong
- Department of Information Science and Technology, Northwest University, Xi’an 710127, China
| | - Long Ma
- Department of Information Science and Technology, Northwest University, Xi’an 710127, China
| | - Richard Sutcliffe
- Department of Information Science and Technology, Northwest University, Xi’an 710127, China
| | - Feijuan He
- Department of Computer Science, Xi’an Jiaotong University City College, Xi’an 710069, China
| | - Sushing Chen
- Department of Computer Information Science and Engineering, University of Florida, Gainesville, FL 32608, USA
| | - Jun Feng
- Department of Information Science and Technology, Northwest University, Xi’an 710127, China
- Correspondence: (X.S.); (J.F.)
| |
Collapse
|
40
|
Li F, Liu W, Yu H. Extraction of Information Related to Adverse Drug Events from Electronic Health Record Notes: Design of an End-to-End Model Based on Deep Learning. JMIR Med Inform 2018; 6:e12159. [PMID: 30478023 PMCID: PMC6288593 DOI: 10.2196/12159] [Citation(s) in RCA: 39] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2018] [Revised: 10/31/2018] [Accepted: 11/09/2018] [Indexed: 12/26/2022] Open
Abstract
Background Pharmacovigilance and drug-safety surveillance are crucial for monitoring adverse drug events (ADEs), but the main ADE-reporting systems such as Food and Drug Administration Adverse Event Reporting System face challenges such as underreporting. Therefore, as complementary surveillance, data on ADEs are extracted from electronic health record (EHR) notes via natural language processing (NLP). As NLP develops, many up-to-date machine-learning techniques are introduced in this field, such as deep learning and multi-task learning (MTL). However, only a few studies have focused on employing such techniques to extract ADEs. Objective We aimed to design a deep learning model for extracting ADEs and related information such as medications and indications. Since extraction of ADE-related information includes two steps—named entity recognition and relation extraction—our second objective was to improve the deep learning model using multi-task learning between the two steps. Methods We employed the dataset from the Medication, Indication and Adverse Drug Events (MADE) 1.0 challenge to train and test our models. This dataset consists of 1089 EHR notes of cancer patients and includes 9 entity types such as Medication, Indication, and ADE and 7 types of relations between these entities. To extract information from the dataset, we proposed a deep-learning model that uses a bidirectional long short-term memory (BiLSTM) conditional random field network to recognize entities and a BiLSTM-Attention network to extract relations. To further improve the deep-learning model, we employed three typical MTL methods, namely, hard parameter sharing, parameter regularization, and task relation learning, to build three MTL models, called HardMTL, RegMTL, and LearnMTL, respectively. Results Since extraction of ADE-related information is a two-step task, the result of the second step (ie, relation extraction) was used to compare all models. We used microaveraged precision, recall, and F1 as evaluation metrics. Our deep learning model achieved state-of-the-art results (F1=65.9%), which is significantly higher than that (F1=61.7%) of the best system in the MADE1.0 challenge. HardMTL further improved the F1 by 0.8%, boosting the F1 to 66.7%, whereas RegMTL and LearnMTL failed to boost the performance. Conclusions Deep learning models can significantly improve the performance of ADE-related information extraction. MTL may be effective for named entity recognition and relation extraction, but it depends on the methods, data, and other factors. Our results can facilitate research on ADE detection, NLP, and machine learning.
Collapse
Affiliation(s)
- Fei Li
- Department of Computer Science, University of Massachusetts Lowell, Lowell, MA, United States.,Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, MA, United States.,Department of Medicine, University of Massachusetts Medical School, Worcester, MA, United States
| | - Weisong Liu
- Department of Computer Science, University of Massachusetts Lowell, Lowell, MA, United States.,Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, MA, United States.,Department of Medicine, University of Massachusetts Medical School, Worcester, MA, United States
| | - Hong Yu
- Department of Computer Science, University of Massachusetts Lowell, Lowell, MA, United States.,Center for Healthcare Organization and Implementation Research, Bedford Veterans Affairs Medical Center, Bedford, MA, United States.,Department of Medicine, University of Massachusetts Medical School, Worcester, MA, United States.,School of Computer Science, University of Massachusetts, Amherst, MA, United States
| |
Collapse
|
41
|
Onishi T, Kadohira T, Watanabe I. Relation extraction with weakly supervised learning based on process-structure-property-performance reciprocity. Sci Technol Adv Mater 2018; 19:649-659. [PMID: 30245757 PMCID: PMC6147111 DOI: 10.1080/14686996.2018.1500852] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Revised: 07/12/2018] [Accepted: 07/12/2018] [Indexed: 06/08/2023]
Abstract
In this study, we develop a computer-aided material design system to represent and extract knowledge related to material design from natural language texts. A machine learning model is trained on a text corpus weakly labeled by minimal annotated relationship data (~100 labeled relationships) to extract knowledge from scientific articles. The knowledge is represented by relationships between scientific concepts, such as {annealing, grain size, strength}. The extracted relationships are represented as a knowledge graph formatted according to design charts, inspired by the process-structure-property-performance (PSPP) reciprocity. The design chart provides an intuitive effect of processes on properties and prospective processes to achieve the certain desired properties. Our system semantically searches the scientific literature and provides knowledge in the form of a design chart, and we hope it contributes more efficient developments of new materials.
Collapse
Affiliation(s)
- Takeshi Onishi
- Toyota Technological Institute at Chicago, Chicago, IL, USA
| | - Takuya Kadohira
- Research and Services Division of Materials Data and Integrated System, National Institute for Materials Science, Tsukuba, Ibaraki, Japan
| | - Ikumu Watanabe
- Research Center for Structural Materials, National Institute for Materials Science, Ibaraki, Tsukuba, Japan
| |
Collapse
|
42
|
Wan H, Moens MF, Luyten W, Zhou X, Mei Q, Liu L, Tang J. Extracting relations from traditional Chinese medicine literature via heterogeneous entity networks. J Am Med Inform Assoc 2015. [PMID: 26224335 DOI: 10.1093/jamia/ocv092] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Traditional Chinese medicine (TCM) is a unique and complex medical system that has developed over thousands of years. This article studies the problem of automatically extracting meaningful relations of entities from TCM literature, for the purposes of assisting clinical treatment or poly-pharmacology research and promoting the understanding of TCM in Western countries. METHODS Instead of separately extracting each relation from a single sentence or document, we propose to collectively and globally extract multiple types of relations (eg, herb-syndrome, herb-disease, formula-syndrome, formula-disease, and syndrome-disease relations) from the entire corpus of TCM literature, from the perspective of network mining. In our analysis, we first constructed heterogeneous entity networks from the TCM literature, in which each edge is a candidate relation, then used a heterogeneous factor graph model (HFGM) to simultaneously infer the existence of all the edges. We also employed a semi-supervised learning algorithm estimate the model's parameters. RESULTS We performed our method to extract relations from a large dataset consisting of more than 100,000 TCM article abstracts. Our results show that the performance of the HFGM at extracting all types of relations from TCM literature was significantly better than a traditional support vector machine (SVM) classifier (increasing the average precision by 11.09%, the recall by 13.83%, and the F1-measure by 12.47% for different types of relations, compared with a traditional SVM classifier). CONCLUSION This study exploits the power of collective inference and proposes an HFGM based on heterogeneous entity networks, which significantly improved our ability to extract relations from TCM literature.
Collapse
Affiliation(s)
- Huaiyu Wan
- Department of Computer Science and Technology, Tsinghua University, Beijing, China School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
| | | | | | - Xuezhong Zhou
- School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
| | - Qiaozhu Mei
- School of Information, University of Michigan, Ann Arbor, Michigan, USA
| | - Lu Liu
- TangoMe Inc, Mountain View, CA, USA
| | - Jie Tang
- Department of Computer Science and Technology, Tsinghua University, Beijing, China
| |
Collapse
|
43
|
Rastegar-Mojarad M, Elayavilli RK, Li D, Liu H. Assessing the Need of Discourse-Level Analysis in Identifying Evidence of Drug-Disease Relations in Scientific Literature. Stud Health Technol Inform 2015; 216:539-543. [PMID: 26262109 PMCID: PMC5859928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/04/2023]
Abstract
Relation extraction typically involves the extraction of relations between two or more entities occurring within a single or multiple sentences. In this study, we investigated the significance of extracting information from multiple sentences specifically in the context of drug-disease relation discovery. We used multiple resources such as Semantic Medline, a literature based resource, and Medline search (for filtering spurious results) and inferred 8,772 potential drug-disease pairs. Our analysis revealed that 6,450 (73.5%) of the 8,772 potential drug-disease relations did not occur in a single sentence. Moreover, only 537 of the drug-disease pairs matched the curated gold standard in Comparative Toxicogenomics Database (CTD), a trusted resource for drug-disease relations. Among the 537, nearly 75% (407) of the drug-disease pairs occur in multiple sentences. Our analysis revealed that the drug-disease pairs inferred from Semantic Medline or retrieved from CTD could be extracted from multiple sentences in the literature. This highlights the significance of the need of discourse-level analysis in extracting the relations from biomedical literature.
Collapse
Affiliation(s)
- Majid Rastegar-Mojarad
- Biomedical Statistics & Informatics, Mayo Clinic, Rochester, MN, USA
- University of Wisconsin-Milwaukee, Milwaukee, WI, USA
| | | | - Dingcheng Li
- Biomedical Statistics & Informatics, Mayo Clinic, Rochester, MN, USA
| | - Hongfang Liu
- Biomedical Statistics & Informatics, Mayo Clinic, Rochester, MN, USA
| |
Collapse
|
44
|
Abstract
Objective To research computational methods for discovering body site and severity modifiers in clinical texts. Methods We cast the task of discovering body site and severity modifiers as a relation extraction problem in the context of a supervised machine learning framework. We utilize rich linguistic features to represent the pairs of relation arguments and delegate the decision about the nature of the relationship between them to a support vector machine model. We evaluate our models using two corpora that annotate body site and severity modifiers. We also compare the model performance to a number of rule-based baselines. We conduct cross-domain portability experiments. In addition, we carry out feature ablation experiments to determine the contribution of various feature groups. Finally, we perform error analysis and report the sources of errors. Results The performance of our method for discovering body site modifiers achieves F1 of 0.740–0.908 and our method for discovering severity modifiers achieves F1 of 0.905–0.929. Discussion Results indicate that both methods perform well on both in-domain and out-domain data, approaching the performance of human annotators. The most salient features are token and named entity features, although syntactic dependency features also contribute to the overall performance. The dominant sources of errors are infrequent patterns in the data and inability of the system to discern deeper semantic structures. Conclusions We investigated computational methods for discovering body site and severity modifiers in clinical texts. Our best system is released open source as part of the clinical Text Analysis and Knowledge Extraction System (cTAKES).
Collapse
Affiliation(s)
- Dmitriy Dligach
- Department of Informatics, Boston Children's Hospital and Harvard Medical School, Boston, Massachusetts, USA
| | | | | | | | | |
Collapse
|
45
|
Cherry C, Zhu X, Martin J, de Bruijn B. A la Recherche du Temps Perdu: extracting temporal relations from medical text in the 2012 i2b2 NLP challenge. J Am Med Inform Assoc 2013; 20:843-8. [PMID: 23523875 PMCID: PMC3756270 DOI: 10.1136/amiajnl-2013-001624] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/01/2022] Open
Abstract
Objective An analysis of the timing of events is critical for a deeper understanding of the course of events within a patient record. The 2012 i2b2 NLP challenge focused on the extraction of temporal relationships between concepts within textual hospital discharge summaries. Materials and methods The team from the National Research Council Canada (NRC) submitted three system runs to the second track of the challenge: typifying the time-relationship between pre-annotated entities. The NRC system was designed around four specialist modules containing statistical machine learning classifiers. Each specialist targeted distinct sets of relationships: local relationships, ‘sectime’-type relationships, non-local overlap-type relationships, and non-local causal relationships. Results The best NRC submission achieved a precision of 0.7499, a recall of 0.6431, and an F1 score of 0.6924, resulting in a statistical tie for first place. Post hoc improvements led to a precision of 0.7537, a recall of 0.6455, and an F1 score of 0.6954, giving the highest scores reported on this task to date. Discussion and conclusions Methods for general relation extraction extended well to temporal relations, and gave top-ranked state-of-the-art results. Careful ordering of predictions within result sets proved critical to this success.
Collapse
Affiliation(s)
- Colin Cherry
- Information and Communication Technologies, National Research Council Canada, Ottawa, Ontario, Canada
| | | | | | | |
Collapse
|