1
|
Wang L, Hao H, Yan X, Zhou TH, Ryu KH. From biomedical knowledge graph construction to semantic querying: a comprehensive approach. Sci Rep 2025; 15:8523. [PMID: 40074859 PMCID: PMC11904217 DOI: 10.1038/s41598-025-93334-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 03/06/2025] [Indexed: 03/14/2025] Open
Abstract
In the biomedical field, the construction and application of knowledge graphs are becoming increasingly important because they can effectively integrate and manage large amounts of complex medical information. This study provides a whole-process approach for the biomedical field, from constructing knowledge graphs to semantic query based on knowledge graphs. In the knowledge graph construction stage, we propose the BioPLBC model, which incorporates BioBERT context-embedded features, part of speech and lexical morphological features to achieve entity annotation of medical texts. Based on the constructed biomedical knowledge graph, we also propose the Adaptive Locating and Expanding Query (ALEQ) algorithm, which improves the query speed by locating and dynamically expanding the query subregion. The experimental results indicate that the BioPLBC model consistently achieves higher accuracy than the baseline model across all datasets, while the ALEQ algorithm achieves different degrees of improvement in query accuracy and speed.
Collapse
Affiliation(s)
- Ling Wang
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China
| | - Haoyu Hao
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China
| | - Xue Yan
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China
| | - Tie Hua Zhou
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China.
| | - Keun Ho Ryu
- Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, 700000, Vietnam
- Research Institute, Bigsun System Co., Ltd., Seoul, 06266, Republic of Korea
- Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju, 28644, Republic of Korea
| |
Collapse
|
2
|
Wang P, Zhang J. Prediction of Composite Clinical Outcomes for Childhood Neuroblastoma Using Multi-Omics Data and Machine Learning. Int J Mol Sci 2024; 26:136. [PMID: 39795994 PMCID: PMC11720239 DOI: 10.3390/ijms26010136] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2024] [Revised: 05/20/2024] [Accepted: 05/22/2024] [Indexed: 01/13/2025] Open
Abstract
Neuroblastoma is a common malignant tumor in childhood that seriously endangers the health and lives of children, making it essential to find effective prognostic markers to accurately predict their clinical outcomes. The development of high-throughput technology in the biomedical field has made it possible to obtain multi-omics data, whose integration can compensate for missing or unreliable information in a single data source. In this study, we integrated clinical data and two omics data, i.e., gene expression and DNA methylation data, to study the prognosis of neuroblastoma. Since the features in omics data are redundant, it is crucial to conduct feature selection on them. We proposed a two-step feature selection (TSFS) method to quickly and accurately select the optimal features, where the first step aims at selecting candidate features and the second step is to remove redundant features among them using our proposed maximal association coefficient (MAC). Our goal is to predict composite clinical outcomes for neuroblastoma patients, i.e., their survival time and vital status at the last follow-up, which was validated to be two inter-correlated tasks. We conducted a series of experiments and evaluated the experimental results using accuracy and AUC (area under the ROC curve) evaluation metrics, which indicated that by the combination of the integration of the three types of data, our proposed TSFS method and a multi-task learning method can synergistically improve the reliability and accuracy of the prediction models.
Collapse
Affiliation(s)
| | - Junying Zhang
- School of Computer Science and Technology, Xidian University, Xi’an 710126, China;
| |
Collapse
|
3
|
Yang Y, Lu Y, Zheng Z, Wu H, Lin Y, Qian F, Yan W. MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer. Comput Struct Biotechnol J 2024; 23:1339-1347. [PMID: 38585647 PMCID: PMC10995799 DOI: 10.1016/j.csbj.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/24/2024] [Indexed: 04/09/2024] Open
Abstract
Over the past decade, information for precision disease medicine has accumulated in the form of textual data. To effectively utilize this expanding medical text, we proposed a multi-task learning-based framework based on hard parameter sharing for knowledge graph construction (MKG), and then used it to automatically extract gastric cancer (GC)-related biomedical knowledge from the literature and identify GC drug candidates. In MKG, we designed three separate modules, MT-BGIPN, MT-SGTF and MT-ScBERT, for entity recognition, entity normalization, and relation classification, respectively. To address the challenges posed by the long and irregular naming of medical entities, the MT-BGIPN utilized bidirectional gated recurrent unit and interactive pointer network techniques, significantly improving entity recognition accuracy to an average F1 value of 84.5% across datasets. In MT-SGTF, we employed the term frequency-inverse document frequency and the gated attention unit. These combine both semantic and characteristic features of entities, resulting in an average Hits@ 1 score of 94.5% across five datasets. The MT-ScBERT integrated cross-text, entity, and context features, yielding an average F1 value of 86.9% across 11 relation classification datasets. Based on the MKG, we then developed a specific knowledge graph for GC (MKG-GC), which encompasses a total of 9129 entities and 88,482 triplets. Lastly, the MKG-GC was used to predict potential GC drugs using a pre-trained language model called BioKGE-BERT and a drug-disease discriminant model based on CNN-BiLSTM. Remarkably, nine out of the top ten predicted drugs have been previously reported as effective for gastric cancer treatment. Finally, an online platform was created for exploration and visualization of MKG-GC at https://www.yanglab-mi.org.cn/MKG-GC/.
Collapse
Affiliation(s)
- Yang Yang
- Computing Science and Artificial Intelligence College, Suzhou City University, Suzhou 215004, China
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Zixuan Zheng
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Hao Wu
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
| | - Yuxin Lin
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Department of Urology, the First Affiliated Hospital of Soochow University, Suzhou 215000, China
| | - Fuliang Qian
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Medical Center of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| |
Collapse
|
4
|
Chai X, An S, Chen S, Li W, Feng Z, Li X, Gong H, Luo Q, Li A. Knowledge mining of brain connectivity in massive literature based on transfer learning. Bioinformatics 2024; 40:btae648. [PMID: 39656949 PMCID: PMC11631446 DOI: 10.1093/bioinformatics/btae648] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2024] [Revised: 10/17/2024] [Accepted: 12/04/2024] [Indexed: 12/17/2024] Open
Abstract
MOTIVATION Neuroscientists have long endeavored to map brain connectivity, yet the intricate nature of brain networks often leads them to concentrate on specific regions, hindering efforts to unveil a comprehensive connectivity map. Recent advancements in imaging and text mining techniques have enabled the accumulation of a vast body of literature containing valuable insights into brain connectivity, facilitating the extraction of whole-brain connectivity relations from this corpus. However, the diverse representations of brain region names and connectivity relations pose a challenge for conventional machine learning methods and dictionary-based approaches in identifying all instances accurately. RESULTS We propose BioSEPBERT, a biomedical pre-trained model based on start-end position pointers and BERT. In addition, our model integrates specialized identifiers with enhanced self-attention capabilities for preceding and succeeding brain regions, thereby improving the performance of named entity recognition and relation extraction in neuroscience. Our approach achieves optimal F1 scores of 85.0%, 86.6%, and 86.5% for named entity recognition, connectivity relation extraction, and directional relation extraction, respectively, surpassing state-of-the-art models by 2.6%, 1.1%, and 1.1%. Furthermore, we leverage BioSEPBERT to extract 22.6 million standardized brain regions and 165 072 directional relations from a corpus comprising 1.3 million abstracts and 193 100 full-text articles. The results demonstrate that our model facilitates researchers to rapidly acquire knowledge regarding neural circuits across various brain regions, thereby enhancing comprehension of brain connectivity in specific regions. AVAILABILITY AND IMPLEMENTATION Data and source code are available at: http://atlas.brainsmatics.org/res/BioSEPBERT and https://github.com/Brainsmatics/BioSEPBERT.
Collapse
Affiliation(s)
- Xiaokang Chai
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Sile An
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Simeng Chen
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Wenwei Li
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Zhao Feng
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
| | - Xiangning Li
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
- HUST-Suzhou Institute for Brainsmatics, JITRI, Suzhou 215123, China
| | - Hui Gong
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
- HUST-Suzhou Institute for Brainsmatics, JITRI, Suzhou 215123, China
| | - Qingming Luo
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
| | - Anan Li
- Britton Chance Center for Biomedical Photonics, Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan 430074, China
- Key Laboratory of Biomedical Engineering of Hainan Province, School of Biomedical Engineering, Hainan University, Haikou 570228, China
- HUST-Suzhou Institute for Brainsmatics, JITRI, Suzhou 215123, China
| |
Collapse
|
5
|
Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024; 159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
Collapse
Affiliation(s)
- Yu Yin
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Hyunjae Kim
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Xiao Xiao
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Jaewoo Kang
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America
| | - Meng Fang
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
| | - Qingyu Chen
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America.
| |
Collapse
|
6
|
Ahmad F, Muhmood T. Clinical translation of nanomedicine with integrated digital medicine and machine learning interventions. Colloids Surf B Biointerfaces 2024; 241:114041. [PMID: 38897022 DOI: 10.1016/j.colsurfb.2024.114041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2024] [Revised: 06/11/2024] [Accepted: 06/13/2024] [Indexed: 06/21/2024]
Abstract
Nanomaterials based therapeutics transform the ways of disease prevention, diagnosis and treatment with increasing sophistications in nanotechnology at a breakneck pace, but very few could reach to the clinic due to inconsistencies in preclinical studies followed by regulatory hinderances. To tackle this, integrating the nanomedicine discovery with digital medicine provide technologies as tools of specific biological activity measurement. Hence, overcome the redundancies in nanomedicine discovery by the on-site data acquisition and analytics through integrating intelligent sensors and artificial intelligence (AI) or machine learning (ML). Integrated AI/ML wearable sensors directly gather clinically relevant biochemical information from the subject's body and process data for physicians to make right clinical decision(s) in a time and cost-effective way. This review summarizes insights and recommend the infusion of actionable big data computation enabled sensors in burgeoning field of nanomedicine at academia, research institutes, and pharmaceutical industries, with a potential of clinical translation. Furthermore, many blind spots are present in modern clinically relevant computation, one of which could prevent ML-guided low-cost new nanomedicine development from being successfully translated into the clinic was also discussed.
Collapse
Affiliation(s)
- Farooq Ahmad
- State Key Laboratory of Chemistry and Utilization of Carbon Based Energy Resources, College of Chemistry, Xinjiang University, Urumqi 830017, China.
| | - Tahir Muhmood
- International Iberian Nanotechnology Laboratory (INL), Avenida Mestre José Veiga, Braga 4715-330, Portugal.
| |
Collapse
|
7
|
Tho BD, Nguyen MT, Le DT, Ying LL, Inoue S, Nguyen TT. Improving biomedical Named Entity Recognition with additional external contexts. J Biomed Inform 2024; 156:104674. [PMID: 38871012 DOI: 10.1016/j.jbi.2024.104674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Revised: 05/26/2024] [Accepted: 06/07/2024] [Indexed: 06/15/2024]
Abstract
OBJECTIVE Biomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities. METHODS To exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence. RESULTS Experimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER. CONCLUSION The promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.
Collapse
Affiliation(s)
- Bui Duc Tho
- Hung Yen University of Technology and Education, Viet Nam; University of Engineering and Technology, Vietnam National University, Hanoi, Viet Nam.
| | | | - Dung Tien Le
- CINNAMON AI, 10th Floor, Geleximco Building, 36 Hoang Cau, Dong Da District, Hanoi, Viet Nam.
| | - Lin-Lung Ying
- CINNAMON AI, 10th Floor, Geleximco Building, 36 Hoang Cau, Dong Da District, Hanoi, Viet Nam.
| | - Shumpei Inoue
- CINNAMON AI, 10th Floor, Geleximco Building, 36 Hoang Cau, Dong Da District, Hanoi, Viet Nam.
| | - Tri-Thanh Nguyen
- University of Engineering and Technology, Vietnam National University, Hanoi, Viet Nam.
| |
Collapse
|
8
|
Huang DL, Zeng Q, Xiong Y, Liu S, Pang C, Xia M, Fang T, Ma Y, Qiang C, Zhang Y, Zhang Y, Li H, Yuan Y. A Combined Manual Annotation and Deep-Learning Natural Language Processing Study on Accurate Entity Extraction in Hereditary Disease Related Biomedical Literature. Interdiscip Sci 2024; 16:333-344. [PMID: 38340264 PMCID: PMC11289304 DOI: 10.1007/s12539-024-00605-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/02/2024] [Accepted: 01/03/2024] [Indexed: 02/12/2024]
Abstract
We report a combined manual annotation and deep-learning natural language processing study to make accurate entity extraction in hereditary disease related biomedical literature. A total of 400 full articles were manually annotated based on published guidelines by experienced genetic interpreters at Beijing Genomics Institute (BGI). The performance of our manual annotations was assessed by comparing our re-annotated results with those publicly available. The overall Jaccard index was calculated to be 0.866 for the four entity types-gene, variant, disease and species. Both a BERT-based large name entity recognition (NER) model and a DistilBERT-based simplified NER model were trained, validated and tested, respectively. Due to the limited manually annotated corpus, Such NER models were fine-tuned with two phases. The F1-scores of BERT-based NER for gene, variant, disease and species are 97.28%, 93.52%, 92.54% and 95.76%, respectively, while those of DistilBERT-based NER are 95.14%, 86.26%, 91.37% and 89.92%, respectively. Most importantly, the entity type of variant has been extracted by a large language model for the first time and a comparable F1-score with the state-of-the-art variant extraction model tmVar has been achieved.
Collapse
Affiliation(s)
- Dao-Ling Huang
- BGI Research, Shenzhen, 518083, China.
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China.
| | - Quanlei Zeng
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yun Xiong
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Shuixia Liu
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Chaoqun Pang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Menglei Xia
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Ting Fang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yanli Ma
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Cuicui Qiang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yi Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yu Zhang
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Hong Li
- BGI-Wuhan Clinical Laboratories, BGI-Shenzhen, Wuhan, 430074, China
| | - Yuying Yuan
- Clinical Laboratory of BGI Health, BGI-Shenzhen, Shenzhen, 518083, China
| |
Collapse
|
9
|
Hou G, Jian Y, Zhao Q, Quan X, Zhang H. Language model based on deep learning network for biomedical named entity recognition. Methods 2024; 226:71-77. [PMID: 38641084 DOI: 10.1016/j.ymeth.2024.04.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 12/22/2023] [Accepted: 04/16/2024] [Indexed: 04/21/2024] Open
Abstract
Biomedical Named Entity Recognition (BioNER) is one of the most basic tasks in biomedical text mining, which aims to automatically identify and classify biomedical entities in text. Recently, deep learning-based methods have been applied to Biomedical Named Entity Recognition and have shown encouraging results. However, many biological entities are polysemous and ambiguous, which is one of the main obstacles to the task of biomedical named entity recognition. Deep learning methods require large amounts of training data, so the lack of data also affect the performance of model recognition. To solve the problem of polysemous words and insufficient data, for the task of biomedical named entity recognition, we propose a multi-task learning framework fused with language model based on the BiLSTM-CRF architecture. Our model uses a language model to design a differential encoding of the context, which could obtain dynamic word vectors to distinguish words in different datasets. Moreover, we use a multi-task learning method to collectively share the dynamic word vector of different types of entities to improve the recognition performance of each type of entity. Experimental results show that our model reduces the false positives caused by polysemous words through differentiated coding, and improves the performance of each subtask by sharing information between different entity data. Compared with other state-of-the art methods, our model achieved superior results in four typical training sets, and achieved the best results in F1 values.
Collapse
Affiliation(s)
- Guan Hou
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Yuhao Jian
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Qingqing Zhao
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Xiongwen Quan
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Han Zhang
- College of Artificial Intelligence, Nankai University, Tianjin, China.
| |
Collapse
|
10
|
Lou Y, Zhu X, Tan K. Dictionary-based matching graph network for biomedical named entity recognition. Sci Rep 2023; 13:21667. [PMID: 38066007 PMCID: PMC10709457 DOI: 10.1038/s41598-023-48564-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/28/2023] [Indexed: 12/18/2023] Open
Abstract
Biomedical named entity recognition (BioNER) is an essential task in biomedical information analysis. Recently, deep neural approaches have become widely utilized for BioNER. Biomedical dictionaries, implemented through a masked manner, are frequently employed in these methods to enhance entity recognition. However, their performance remains limited. In this work, we propose a dictionary-based matching graph network for BioNER. This approach utilizes the matching graph method to project all possible dictionary-based entity combinations in the text onto a directional graph. The network is implemented coherently with a bi-directional graph convolutional network (BiGCN) that incorporates the matching graph information. Our proposed approach fully leverages the dictionary-based matching graph instead of a simple masked manner. We have conducted numerous experiments on five typical Bio-NER datasets. The proposed model shows significant improvements in F1 score compared to the state-of-the-art (SOTA) models: 2.8% on BC2GM, 1.3% on BC4CHEMD, 1.1% on BC5CDR, 1.6% on NCBI-disease, and 0.5% on JNLPBA. The results show that our model, which is superior to other models, can effectively recognize natural biomedical named entities.
Collapse
Affiliation(s)
- Yinxia Lou
- School of Artificial Intelligence, Jianghan University, Wuhan, 430056, China
| | - Xun Zhu
- School of Artificial Intelligence, Jianghan University, Wuhan, 430056, China.
| | - Kai Tan
- State Key Laboratory of Estuarine and Coastal Research, East China Normal University, Shanghai, 200241, China
| |
Collapse
|
11
|
Yang C, Deng J, Chen X, An Y. SPBERE: Boosting span-based pipeline biomedical entity and relation extraction via entity information. J Biomed Inform 2023; 145:104456. [PMID: 37482171 DOI: 10.1016/j.jbi.2023.104456] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 05/03/2023] [Accepted: 07/18/2023] [Indexed: 07/25/2023]
Abstract
Triplet extraction is one of the fundamental tasks in biomedical text mining. Compared with traditional pipeline approaches, joint methods can alleviate the error propagation problem from entity recognition to relation classification. However, existing methods face challenges in detecting overlapping entities and overlapping relations, which are ubiquitous in biomedical texts. In this work, we propose a novel pipeline method of end-to-end biomedical triplet extraction. In particular, a span-based detection strategy is used to detect the overlapping triplets by enumerating possible candidate spans and entity pairs. The strategy is further used to capture different contextualized representations via an entity model and a relation model, respectively. Furthermore, to enhance interrelation between spans, entity information from the output of the entity model is used to construct the input for the relation model without utilizing any external knowledge. Our approach is evaluated on the drug-drug interaction (DDI) and chemical-protein interaction (CHEMPROT) datasets, exhibiting improvement of the absolute F1-score in relation extraction by 3.5%-3.7% compared prior work. The experimental results highlight the importance of overlapping triplet detection using the span-based approach, acquisition of various contextualized representations via different in-domain pre-trained language models, and early fusion of entity information in the relation model.
Collapse
Affiliation(s)
- Chenglin Yang
- Big Data Institute, Central South University, Changsha, 410083, China; School of Life Sciences, Central South University, Changsha, 410083, China
| | - Jiamei Deng
- Big Data Institute, Central South University, Changsha, 410083, China
| | - Xianlai Chen
- Big Data Institute, Central South University, Changsha, 410083, China; Key Laboratory of Medical Information Research, Central South University, Changsha, 410083, China.
| | - Ying An
- Big Data Institute, Central South University, Changsha, 410083, China.
| |
Collapse
|
12
|
Cai L, Li J, Lv H, Liu W, Niu H, Wang Z. Integrating domain knowledge for biomedical text analysis into deep learning: A survey. J Biomed Inform 2023; 143:104418. [PMID: 37290540 DOI: 10.1016/j.jbi.2023.104418] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2022] [Revised: 04/24/2023] [Accepted: 05/31/2023] [Indexed: 06/10/2023]
Abstract
The past decade has witnessed an explosion of textual information in the biomedical field. Biomedical texts provide a basis for healthcare delivery, knowledge discovery, and decision-making. Over the same period, deep learning has achieved remarkable performance in biomedical natural language processing, however, its development has been limited by well-annotated datasets and interpretability. To solve this, researchers have considered combining domain knowledge (such as biomedical knowledge graph) with biomedical data, which has become a promising means of introducing more information into biomedical datasets and following evidence-based medicine. This paper comprehensively reviews more than 150 recent literature studies on incorporating domain knowledge into deep learning models to facilitate typical biomedical text analysis tasks, including information extraction, text classification, and text generation. We eventually discuss various challenges and future directions.
Collapse
Affiliation(s)
- Linkun Cai
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Jia Li
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Han Lv
- Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China
| | - Wenjuan Liu
- Aerospace Center Hospital, 100049 Beijing, China
| | - Haijun Niu
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China
| | - Zhenchang Wang
- School of Biological Science and Medical Engineering, Beihang University, 100191 Beijing, China; Department of Radiology, Beijing Friendship Hospital, Capital Medical University, 100050 Beijing, China.
| |
Collapse
|
13
|
Gendrin A, Souliotis L, Loudon-Griffiths J, Aggarwal R, Amoako D, Desouza G, Dimitrievska S, Metcalfe P, Louvet E, Sahni H. Identifying Patient Populations in Texts Describing Drug Approvals Through Deep Learning-Based Information Extraction: Development of a Natural Language Processing Algorithm. JMIR Form Res 2023; 7:e44876. [PMID: 37347514 DOI: 10.2196/44876] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 03/30/2023] [Accepted: 04/17/2023] [Indexed: 06/23/2023] Open
Abstract
BACKGROUND New drug treatments are regularly approved, and it is challenging to remain up-to-date in this rapidly changing environment. Fast and accurate visualization is important to allow a global understanding of the drug market. Automation of this information extraction provides a helpful starting point for the subject matter expert, helps to mitigate human errors, and saves time. OBJECTIVE We aimed to semiautomate disease population extraction from the free text of oncology drug approval descriptions from the BioMedTracker database for 6 selected drug targets. More specifically, we intended to extract (1) line of therapy, (2) stage of cancer of the patient population described in the approval, and (3) the clinical trials that provide evidence for the approval. We aimed to use these results in downstream applications, aiding the searchability of relevant content against related drug project sources. METHODS We fine-tuned a state-of-the-art deep learning model, Bidirectional Encoder Representations from Transformers, for each of the 3 desired outputs. We independently applied rule-based text mining approaches. We compared the performances of deep learning and rule-based approaches and selected the best method, which was then applied to new entries. The results were manually curated by a subject matter expert and then used to train new models. RESULTS The training data set is currently small (433 entries) and will enlarge over time when new approval descriptions become available or if a choice is made to take another drug target into account. The deep learning models achieved 61% and 56% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively, which were treated as classification tasks. Trial identification is treated as a named entity recognition task, and the 5-fold cross-validated F1-score is currently 87%. Although the scores of the classification tasks could seem low, the models comprise 5 classes each, and such scores are a marked improvement when compared to random classification. Moreover, we expect improved performance as the input data set grows, since deep learning models need to be trained on a large enough amount of data to be able to learn the task they are taught. The rule-based approach achieved 60% and 74% 5-fold cross-validated accuracies for line of therapy and stage of cancer, respectively. No attempt was made to define a rule-based approach for trial identification. CONCLUSIONS We developed a natural language processing algorithm that is currently assisting subject matter experts in disease population extraction, which supports health authority approvals. This algorithm achieves semiautomation, enabling subject matter experts to leverage the results for deeper analysis and to accelerate information retrieval in a crowded clinical environment such as oncology.
Collapse
|
14
|
Srivathsa AV, Sadashivappa NM, Hegde AK, Radha S, Mahesh AR, Ammunje DN, Sen D, Theivendren P, Govindaraj S, Kunjiappan S, Pavadai P. A Review on Artificial Intelligence Approaches and Rational Approaches in Drug Discovery. Curr Pharm Des 2023; 29:1180-1192. [PMID: 37132148 DOI: 10.2174/1381612829666230428110542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 02/06/2023] [Accepted: 02/27/2023] [Indexed: 05/04/2023]
Abstract
Artificial intelligence (AI) speeds up the drug development process and reduces its time, as well as the cost which is of enormous importance in outbreaks such as COVID-19. It uses a set of machine learning algorithms that collects the available data from resources, categorises, processes and develops novel learning methodologies. Virtual screening is a successful application of AI, which is used in screening huge drug-like databases and filtering to a small number of compounds. The brain's thinking of AI is its neural networking which uses techniques such as Convoluted Neural Network (CNN), Recursive Neural Network (RNN) or Generative Adversial Neural Network (GANN). The application ranges from small molecule drug discovery to the development of vaccines. In the present review article, we discussed various techniques of drug design, structure and ligand-based, pharmacokinetics and toxicity prediction using AI. The rapid phase of discovery is the need of the hour and AI is a targeted approach to achieve this.
Collapse
Affiliation(s)
- Anjana Vidya Srivathsa
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, M.S. Ramaiah University of Applied Sciences, M.S.R. Nagar, Bengaluru, 560054, India
| | - Nandini Markuli Sadashivappa
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, M.S. Ramaiah University of Applied Sciences, M.S.R. Nagar, Bengaluru, 560054, India
| | - Apeksha Krishnamurthy Hegde
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, M.S. Ramaiah University of Applied Sciences, M.S.R. Nagar, Bengaluru, 560054, India
| | - Srimathi Radha
- Department of Pharmaceutical Chemistry, SRM College of Pharmacy, Faculty of Medicine and Health Sciences, SRM Institute of Science and Technology, Chengalpattu District, Kattankulathur, Tamil Nadu, 603203, India
| | - Agasa Ramu Mahesh
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, M.S. Ramaiah University of Applied Sciences, M.S.R. Nagar, Bengaluru, 560054, India
| | - Damodar Nayak Ammunje
- Department of Pharmacology, Faculty of Pharmacy, M.S. Ramaiah University of Applied Sciences, M.S.R. Nagar, Bengaluru, 560054, India
| | - Debanjan Sen
- Department of Pharmaceutical Chemistry, BCDA College of Pharmacy & Technology, Hridaypur, Kolkata, 700127, West Bengal, India
| | - Panneerselvam Theivendren
- Department of Pharmaceutical Chemistry, Swamy Vivekanandha College of Pharmacy, Elayampalayam, Tiruchengode, 637205, India
| | - Saravanan Govindaraj
- Department of Pharmaceutical Chemistry, MNR College of Pharmacy, Fasalwadi, Sangareddy, 502 001, India
| | - Selvaraj Kunjiappan
- Department of Biotechnology, Kalasalingam Academy of Research and Education, Krishnankoil, 626126, India
| | - Parasuraman Pavadai
- Department of Pharmaceutical Chemistry, Faculty of Pharmacy, M.S. Ramaiah University of Applied Sciences, M.S.R. Nagar, Bengaluru, 560054, India
| |
Collapse
|
15
|
Luo L, Wei CH, Lai PT, Leaman R, Chen Q, Lu Z. AIONER: all-in-one scheme-based biomedical named entity recognition using deep learning. Bioinformatics 2023; 39:btad310. [PMID: 37171899 PMCID: PMC10212279 DOI: 10.1093/bioinformatics/btad310] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Revised: 04/12/2023] [Accepted: 05/11/2023] [Indexed: 05/14/2023] Open
Abstract
MOTIVATION Biomedical named entity recognition (BioNER) seeks to automatically recognize biomedical entities in natural language text, serving as a necessary foundation for downstream text mining tasks and applications such as information extraction and question answering. Manually labeling training data for the BioNER task is costly, however, due to the significant domain expertise required for accurate annotation. The resulting data scarcity causes current BioNER approaches to be prone to overfitting, to suffer from limited generalizability, and to address a single entity type at a time (e.g. gene or disease). RESULTS We therefore propose a novel all-in-one (AIO) scheme that uses external data from existing annotated resources to enhance the accuracy and stability of BioNER models. We further present AIONER, a general-purpose BioNER tool based on cutting-edge deep learning and our AIO schema. We evaluate AIONER on 14 BioNER benchmark tasks and show that AIONER is effective, robust, and compares favorably to other state-of-the-art approaches such as multi-task learning. We further demonstrate the practical utility of AIONER in three independent tasks to recognize entity types not previously seen in training data, as well as the advantages of AIONER over existing methods for processing biomedical text at a large scale (e.g. the entire PubMed data). AVAILABILITY AND IMPLEMENTATION The source code, trained models and data for AIONER are freely available at https://github.com/ncbi/AIONER.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, United States
| |
Collapse
|
16
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
17
|
Ramesh Kashyap A, Yang Y, Kan MY. Scientific document processing: challenges for modern learning methods. INTERNATIONAL JOURNAL ON DIGITAL LIBRARIES 2023:1-27. [PMID: 37361127 PMCID: PMC10036973 DOI: 10.1007/s00799-023-00352-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2022] [Revised: 02/20/2023] [Accepted: 02/21/2023] [Indexed: 03/26/2023]
Abstract
Neural network models enjoy success on language tasks related to Web documents, including news and Wikipedia articles. However, the characteristics of scientific publications pose specific challenges that have yet to be satisfactorily addressed: the discourse structure of scientific documents crucial in scholarly document processing (SDP) tasks, the interconnected nature of scientific documents, and their multimodal nature. We survey modern neural network learning methods that tackle these challenges: those that can model discourse structure and their interconnectivity and use their multimodal nature. We also highlight efforts to collect large-scale datasets and tools developed to enable effective deep learning deployment for SDP. We conclude with a discussion on upcoming trends and recommend future directions for pursuing neural natural language processing approaches for SDP.
Collapse
Affiliation(s)
- Abhinav Ramesh Kashyap
- ASUS Intelligent Cloud Services (AICS), Singapore, Singapore
- School of Computing, National University of Singapore, Computing 1, 13 Computing Drive, Singapore, 11741 Singapore
| | - Yajing Yang
- School of Computing, National University of Singapore, Computing 1, 13 Computing Drive, Singapore, 11741 Singapore
| | - Min-Yen Kan
- School of Computing, National University of Singapore, Computing 1, 13 Computing Drive, Singapore, 11741 Singapore
| |
Collapse
|
18
|
On the Use of Knowledge Transfer Techniques for Biomedical Named Entity Recognition. FUTURE INTERNET 2023. [DOI: 10.3390/fi15020079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/19/2023] Open
Abstract
Biomedical named entity recognition (BioNER) is a preliminary task for many other tasks, e.g., relation extraction and semantic search. Extracting the text of interest from biomedical documents becomes more demanding as the availability of online data is increasing. Deep learning models have been adopted for biomedical named entity recognition (BioNER) as deep learning has been found very successful in many other tasks. Nevertheless, the complex structure of biomedical text data is still a challenging aspect for deep learning models. Limited annotated biomedical text data make it more difficult to train deep learning models with millions of trainable parameters. The single-task model, which focuses on learning a specific task, has issues in learning complex feature representations from a limited quantity of annotated data. Moreover, manually constructing annotated data is a time-consuming job. It is, therefore, vital to exploit other efficient ways to train deep learning models on the available annotated data. This work enhances the performance of the BioNER task by taking advantage of various knowledge transfer techniques: multitask learning and transfer learning. This work presents two multitask models (MTMs), which learn shared features and task-specific features by implementing the shared and task-specific layers. In addition, the presented trained MTM is also fine-tuned for each specific dataset to tailor it from a general features representation to a specialized features representation. The presented empirical results and statistical analysis from this work illustrate that the proposed techniques enhance significantly the performance of the corresponding single-task model (STM).
Collapse
|
19
|
Guan Z, Zhou X. A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition. BMC Bioinformatics 2023; 24:42. [PMID: 36755230 PMCID: PMC9907889 DOI: 10.1186/s12859-023-05172-9] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Accepted: 02/03/2023] [Indexed: 02/10/2023] Open
Abstract
BACKGROUND The biomedical literature is growing rapidly, and it is increasingly important to extract meaningful information from the vast amount of literature. Biomedical named entity recognition (BioNER) is one of the key and fundamental tasks in biomedical text mining. It also acts as a primitive step for many downstream applications such as relation extraction and knowledge base completion. Therefore, the accurate identification of entities in biomedical literature has certain research value. However, this task is challenging due to the insufficiency of sequence labeling and the lack of large-scale labeled training data and domain knowledge. RESULTS In this paper, we use a novel word-pair classification method, design a simple attention mechanism and propose a novel architecture to solve the research difficulties of BioNER more efficiently without leveraging any external knowledge. Specifically, we break down the limitations of sequence labeling-based approaches by predicting the relationship between word pairs. Based on this, we enhance the pre-trained model BioBERT, through the proposed prefix and attention map dscrimination fusion guided attention and propose the E-BioBERT. Our proposed attention differentiates the distribution of different heads in different layers in the BioBERT, which enriches the diversity of self-attention. Our model is superior to state-of-the-art compared models on five available datasets: BC4CHEMD, BC2GM, BC5CDR-Disease, BC5CDR-Chem, and NCBI-Disease, achieving F1-score of 92.55%, 85.45%, 87.53%, 94.16% and 90.55%, respectively. CONCLUSION Compared with many previous various models, our method does not require additional training datasets, external knowledge, and complex training process. The experimental results on five BioNER benchmark datasets demonstrate that our model is better at mining semantic information, alleviating the problem of label inconsistency, and has higher entity recognition ability. More importantly, we analyze and demonstrate the effectiveness of our proposed attention.
Collapse
Affiliation(s)
- Zhengyi Guan
- grid.440773.30000 0000 9342 2456School of Information Science and Engineering, Yunnan University, Kunming, China
| | - Xiaobing Zhou
- School of Information Science and Engineering, Yunnan University, Kunming, China.
| |
Collapse
|
20
|
Raza S, Schwartz B. Entity and relation extraction from clinical case reports of COVID-19: a natural language processing approach. BMC Med Inform Decis Mak 2023; 23:20. [PMID: 36703154 PMCID: PMC9879259 DOI: 10.1186/s12911-023-02117-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2022] [Accepted: 01/20/2023] [Indexed: 01/28/2023] Open
Abstract
BACKGROUND Extracting relevant information about infectious diseases is an essential task. However, a significant obstacle in supporting public health research is the lack of methods for effectively mining large amounts of health data. OBJECTIVE This study aims to use natural language processing (NLP) to extract the key information (clinical factors, social determinants of health) from published cases in the literature. METHODS The proposed framework integrates a data layer for preparing a data cohort from clinical case reports; an NLP layer to find the clinical and demographic-named entities and relations in the texts; and an evaluation layer for benchmarking performance and analysis. The focus of this study is to extract valuable information from COVID-19 case reports. RESULTS The named entity recognition implementation in the NLP layer achieves a performance gain of about 1-3% compared to benchmark methods. Furthermore, even without extensive data labeling, the relation extraction method outperforms benchmark methods in terms of accuracy (by 1-8% better). A thorough examination reveals the disease's presence and symptoms prevalence in patients. CONCLUSIONS A similar approach can be generalized to other infectious diseases. It is worthwhile to use prior knowledge acquired through transfer learning when researching other infectious diseases.
Collapse
Affiliation(s)
- Shaina Raza
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| | - Brian Schwartz
- grid.415400.40000 0001 1505 2354Public Health Ontario (PHO), Toronto, ON Canada ,grid.17063.330000 0001 2157 2938Dalla Lana School of Public Health, University of Toronto, Toronto, ON Canada
| |
Collapse
|
21
|
Li Y, Wehbe RM, Ahmad FS, Wang H, Luo Y. A comparative study of pretrained language models for long clinical text. J Am Med Inform Assoc 2023; 30:340-347. [PMID: 36451266 PMCID: PMC9846675 DOI: 10.1093/jamia/ocac225] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 11/06/2022] [Accepted: 11/14/2022] [Indexed: 12/03/2022] Open
Abstract
OBJECTIVE Clinical knowledge-enriched transformer models (eg, ClinicalBERT) have state-of-the-art results on clinical natural language processing (NLP) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (eg, Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts. MATERIALS AND METHODS Inspired by the success of long-sequence transformer models and the fact that clinical notes are mostly long, we introduce 2 domain-enriched language models, Clinical-Longformer and Clinical-BigBird, which are pretrained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks. RESULTS The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results. DISCUSSION Our pretrained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pretrained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer. CONCLUSION This study demonstrates that clinical knowledge-enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.
Collapse
Affiliation(s)
- Yikuan Li
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Ramsey M Wehbe
- Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, Illinois, USA
| | - Faraz S Ahmad
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, Illinois, USA
| | - Hanyin Wang
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Yuan Luo
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| |
Collapse
|
22
|
PharmKE: Knowledge Extraction Platform for Pharmaceutical Texts Using Transfer Learning. COMPUTERS 2023. [DOI: 10.3390/computers12010017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
Abstract
Even though named entity recognition (NER) has seen tremendous development in recent years, some domain-specific use-cases still require tagging of unique entities, which is not well handled by pre-trained models. Solutions based on enhancing pre-trained models or creating new ones are efficient, but creating reliable labeled training for them to learn on is still challenging. In this paper, we introduce PharmKE, a text analysis platform tailored to the pharmaceutical industry that uses deep learning at several stages to perform an in-depth semantic analysis of relevant publications. The proposed methodology is used to produce reliably labeled datasets leveraging cutting-edge transfer learning, which are later used to train models for specific entity labeling tasks. By building models for the well-known text-processing libraries spaCy and AllenNLP, this technique is used to find Pharmaceutical Organizations and Drugs in texts from the pharmaceutical domain. The PharmKE platform also incorporates the NER findings to resolve co-references of entities and examine the semantic linkages in each phrase, creating a foundation for further text analysis tasks, such as fact extraction and question answering. Additionally, the knowledge graph created by DBpedia Spotlight for a specific pharmaceutical text is expanded using the identified entities. The obtained results with the proposed methodology result in about a 96% F1-score on the NER tasks, which is up to 2% better than those of the fine-tuned BERT and BioBERT models developed using the same dataset. The ultimate benefits of the platform are that pharmaceutical domain specialists may more easily identify the knowledge extracted from the input texts thanks to the platform’s visualization of the model findings. Likewise, the proposed techniques can be integrated into mobile and pervasive systems to give patients more relevant and comprehensive information from scanned medication guides. Similarly, it can provide preliminary insights to patients and even medical personnel on whether a drug from a different vendor is compatible with the patient’s prescription medication.
Collapse
|
23
|
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y, Lian Q. Noise Reduction Learning Based on XLNet-CRF for Biomedical Named Entity Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:595-605. [PMID: 35259113 DOI: 10.1109/tcbb.2022.3157630] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In recent years, Biomedical Named Entity Recognition (BioNER) systems have mainly been based on deep neural networks, which are used to extract information from the rapidly expanding biomedical literature. Long-distance context autoencoding language models based on transformers have recently been employed for BioNER with great success. However, noise interference exists in the process of pre-training and fine-tuning, and there is no effective decoder for label dependency. Current models have many aspects in need of improvement for better performance. We propose two kinds of noise reduction models, Shared Labels and Dynamic Splicing, based on XLNet encoding which is a permutation language pre-training model and decoding by Conditional Random Field (CRF). By testing 15 biomedical named entity recognition datasets, the two models improved the average F1-score by 1.504 and 1.48, respectively, and state-of-the-art performance was achieved on 7 of them. Further analysis proves the effectiveness of the two models and the improvement of the recognition effect of CRF, and suggests the applicable scope of the models according to different data characteristics.
Collapse
|
24
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-236011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
25
|
Kumar A, Sharaff A. ABEE: automated bio entity extraction from biomedical text documents. DATA TECHNOLOGIES AND APPLICATIONS 2022. [DOI: 10.1108/dta-04-2022-0151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
Abstract
PurposeThe purpose of this study was to design a multitask learning model so that biomedical entities can be extracted without having any ambiguity from biomedical texts.Design/methodology/approachIn the proposed automated bio entity extraction (ABEE) model, a multitask learning model has been introduced with the combination of single-task learning models. Our model used Bidirectional Encoder Representations from Transformers to train the single-task learning model. Then combined model's outputs so that we can find the verity of entities from biomedical text.FindingsThe proposed ABEE model targeted unique gene/protein, chemical and disease entities from the biomedical text. The finding is more important in terms of biomedical research like drug finding and clinical trials. This research aids not only to reduce the effort of the researcher but also to reduce the cost of new drug discoveries and new treatments.Research limitations/implicationsAs such, there are no limitations with the model, but the research team plans to test the model with gigabyte of data and establish a knowledge graph so that researchers can easily estimate the entities of similar groups.Practical implicationsAs far as the practical implication concerned, the ABEE model will be helpful in various natural language processing task as in information extraction (IE), it plays an important role in the biomedical named entity recognition and biomedical relation extraction and also in the information retrieval task like literature-based knowledge discovery.Social implicationsDuring the COVID-19 pandemic, the demands for this type of our work increased because of the increase in the clinical trials at that time. If this type of research has been introduced previously, then it would have reduced the time and effort for new drug discoveries in this area.Originality/valueIn this work we proposed a novel multitask learning model that is capable to extract biomedical entities from the biomedical text without any ambiguity. The proposed model achieved state-of-the-art performance in terms of precision, recall and F1 score.
Collapse
|
26
|
Bashir SR, Raza S, Kocaman V, Qamar U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses 2022; 14:2761. [PMID: 36560764 PMCID: PMC9781729 DOI: 10.3390/v14122761] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022] Open
Abstract
The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1-5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.
Collapse
Affiliation(s)
- Syed Raza Bashir
- Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
| | - Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| | | | - Urooj Qamar
- Institute of Business & Information Technology, University of the Punjab, Lahore 54590, Pakistan
| |
Collapse
|
27
|
Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022; 23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. RESULTS In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. CONCLUSION The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Haijian Du
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Wei Song
- Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
28
|
Zhang Z, Chen ALP. Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning. BMC Bioinformatics 2022; 23:458. [PMID: 36329384 PMCID: PMC9632084 DOI: 10.1186/s12859-022-04994-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 10/19/2022] [Indexed: 11/06/2022] Open
Abstract
Background Biomedical named entity recognition (BioNER) is a basic and important task for biomedical text mining with the purpose of automatically recognizing and classifying biomedical entities. The performance of BioNER systems directly impacts downstream applications. Recently, deep neural networks, especially pre-trained language models, have made great progress for BioNER. However, because of the lack of high-quality and large-scale annotated data and relevant external knowledge, the capability of the BioNER system remains limited. Results In this paper, we propose a novel fully-shared multi-task learning model based on the pre-trained language model in biomedical domain, namely BioBERT, with a new attention module to integrate the auto-processed syntactic information for the BioNER task. We have conducted numerous experiments on seven benchmark BioNER datasets. The proposed best multi-task model obtains F1 score improvements of 1.03% on BC2GM, 0.91% on NCBI-disease, 0.81% on Linnaeus, 1.26% on JNLPBA, 0.82% on BC5CDR-Chemical, 0.87% on BC5CDR-Disease, and 1.10% on Species-800 compared to the single-task BioBERT model. Conclusion The results demonstrate our model outperforms previous studies on all datasets. Further analysis and case studies are also provided to prove the importance of the proposed attention module and fully-shared multi-task learning method used in our model.
Collapse
Affiliation(s)
- Zhiyu Zhang
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Arbee L P Chen
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan. .,Department of Computer Science and Information Engineering, Asia University, Taichung, Taiwan.
| |
Collapse
|
29
|
Sung M, Jeong M, Choi Y, Kim D, Lee J, Kang J. BERN2: an advanced neural biomedical named entity recognition and normalization tool. Bioinformatics 2022; 38:4837-4839. [PMID: 36053172 PMCID: PMC9563680 DOI: 10.1093/bioinformatics/btac598] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2022] [Revised: 07/09/2022] [Accepted: 08/31/2022] [Indexed: 11/14/2022] Open
Abstract
In biomedical natural language processing, named entity recognition (NER) and named entity normalization (NEN) are key tasks that enable the automatic extraction of biomedical entities (e.g. diseases and drugs) from the ever-growing biomedical literature. In this article, we present BERN2 (Advanced Biomedical Entity Recognition and Normalization), a tool that improves the previous neural network-based NER tool by employing a multi-task NER model and neural network-based NEN models to achieve much faster and more accurate inference. We hope that our tool can help annotate large-scale biomedical texts for various tasks such as biomedical knowledge graph construction. Availability and implementation Web service of BERN2 is publicly available at http://bern2.korea.ac.kr. We also provide local installation of BERN2 at https://github.com/dmis-lab/BERN2. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mujeen Sung
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Minbyul Jeong
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Yonghwa Choi
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Donghyeon Kim
- AIRS Company, Hyundai Motor Group, Seoul, 06620, Republic of Korea
| | - Jinhyuk Lee
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, 02841, Republic of Korea.,AIGEN Sciences, Seoul, 04778, Republic of Korea
| |
Collapse
|
30
|
Research on Named Entity Recognition Based on Multi-Task Learning and Biaffine Mechanism. COMPUTATIONAL INTELLIGENCE AND NEUROSCIENCE 2022; 2022:2687615. [PMID: 36059424 PMCID: PMC9436550 DOI: 10.1155/2022/2687615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/05/2022] [Revised: 07/17/2022] [Accepted: 08/08/2022] [Indexed: 12/04/2022]
Abstract
Commonly used nested entity recognition methods are span-based entity recognition methods, which focus on learning the head and tail representations of entities. This method lacks obvious boundary supervision, which leads to the failure of the correct candidate entities to be predicted, resulting in the problem of high precision and low recall. To solve the above problems, this paper proposes a named entity recognition method based on multi-task learning and biaffine mechanism, introduces the idea of multi-task learning, and divides the task into two subtasks, entity span classification and boundary detection. The entity span classification task uses biaffine mechanism to score the resulting spans and select the most likely entity class. The boundary detection task mainly solves the problem of low recall caused by the lack of boundary supervision in span classification. It captures the relationship between adjacent words in the input text according to the context, indicates the boundary range of entities, and enhances the span representation through additional boundary supervision. The experimental results show that the named entity recognition method based on multi-task learning and biaffine mechanism can improve the F1 value by up to 7.05%, 12.63%, and 14.68% on the GENIA, ACE2004, and ACE2005 nested datasets compared with other methods, which verifies that this method has better performance on the nested entity recognition task.
Collapse
|
31
|
Saeed N, Naveed H. Medical terminology-based computing system: a lightweight post-processing solution for out-of-vocabulary multi-word terms. Front Mol Biosci 2022; 9:928530. [PMID: 36032678 PMCID: PMC9411640 DOI: 10.3389/fmolb.2022.928530] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2022] [Accepted: 07/04/2022] [Indexed: 11/24/2022] Open
Abstract
The linguistic rules of medical terminology assist in gaining acquaintance with rare/complex clinical and biomedical terms. The medical language follows a Greek and Latin-inspired nomenclature. This nomenclature aids the stakeholders in simplifying the medical terms and gaining semantic familiarity. However, natural language processing models misrepresent rare and complex biomedical words. In this study, we present MedTCS—a lightweight, post-processing module—to simplify hybridized or compound terms into regular words using medical nomenclature. MedTCS enabled the word-based embedding models to achieve 100% coverage and enabled the BiowordVec model to achieve high correlation scores (0.641 and 0.603 in UMNSRS similarity and relatedness datasets, respectively) that significantly surpass the n-gram and sub-word approaches of FastText and BERT. In the downstream task of named entity recognition (NER), MedTCS enabled the latest clinical embedding model of FastText-OA-All-300d to improve the F1-score from 0.45 to 0.80 on the BC5CDR corpus and from 0.59 to 0.81 on the NCBI-Disease corpus, respectively. Similarly, in the drug indication classification task, our model was able to increase the coverage by 9% and the F1-score by 1%. Our results indicate that incorporating a medical terminology-based module provides distinctive contextual clues to enhance vocabulary as a post-processing step on pre-trained embeddings. We demonstrate that the proposed module enables the word embedding models to generate vectors of out-of-vocabulary words effectively. We expect that our study can be a stepping stone for the use of biomedical knowledge-driven resources in NLP.
Collapse
|
32
|
Tong Y, Zhuang F, Zhang H, Fang C, Zhao Y, Wang D, Zhu H, Ni B. Improving biomedical named entity recognition by dynamic caching inter-sentence information. Bioinformatics 2022; 38:3976-3983. [PMID: 35758612 DOI: 10.1093/bioinformatics/btac422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2022] [Revised: 06/03/2022] [Accepted: 06/24/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Biomedical Named Entity Recognition (BioNER) aims to identify biomedical domain-specific entities (e.g. gene, chemical and disease) from unstructured texts. Despite deep learning-based methods for BioNER achieving satisfactory results, there is still much room for improvement. Firstly, most existing methods use independent sentences as training units and ignore inter-sentence context, which usually leads to the labeling inconsistency problem. Secondly, previous document-level BioNER works have approved that the inter-sentence information is essential, but what information should be regarded as context remains ambiguous. Moreover, there are still few pre-training-based BioNER models that have introduced inter-sentence information. Hence, we propose a cache-based inter-sentence model called BioNER-Cache to alleviate the aforementioned problems. RESULTS We propose a simple but effective dynamic caching module to capture inter-sentence information for BioNER. Specifically, the cache stores recent hidden representations constrained by predefined caching rules. And the model uses a query-and-read mechanism to retrieve similar historical records from the cache as the local context. Then, an attention-based gated network is adopted to generate context-related features with BioBERT. To dynamically update the cache, we design a scoring function and implement a multi-task approach to jointly train our model. We build a comprehensive benchmark on four biomedical datasets to evaluate the model performance fairly. Finally, extensive experiments clearly validate the superiority of our proposed BioNER-Cache compared with various state-of-the-art intra-sentence and inter-sentence baselines. AVAILABILITYAND IMPLEMENTATION Code will be available at https://github.com/zgzjdx/BioNER-Cache. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yiqi Tong
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Fuzhen Zhuang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China.,SKLSDE, School of Computer Science, Beihang University, Beijing 100191, China
| | - Huajie Zhang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Chuyu Fang
- Institute of Artificial Intelligence, Beihang University, Beijing 100191, China
| | - Yu Zhao
- School of Computing and Artificial Intelligence, Southwestern University of Finance and Economics, Chengdu 611130, China
| | - Deqing Wang
- SKLSDE, School of Computer Science, Beihang University, Beijing 100191, China
| | | | - Bin Ni
- Xiamen Data Intelligence Academy of ICT, CAS, Xiamen 361021, China
| |
Collapse
|
33
|
Guo S, Yang W, Han L, Song X, Wang G. A multi-layer soft lattice based model for Chinese clinical named entity recognition. BMC Med Inform Decis Mak 2022; 22:201. [PMID: 35908055 PMCID: PMC9338545 DOI: 10.1186/s12911-022-01924-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Accepted: 07/04/2022] [Indexed: 11/13/2022] Open
Abstract
Objective Named entity recognition (NER) is a key and fundamental part of many medical and clinical tasks, including the establishment of a medical knowledge graph, decision-making support, and question answering systems. When extracting entities from electronic health records (EHRs), NER models mostly apply long short-term memory (LSTM) and have surprising performance in clinical NER. However, increasing the depth of the network is often required by these LSTM-based models to capture long-distance dependencies. Therefore, these LSTM-based models that have achieved high accuracy generally require long training times and extensive training data, which has obstructed the adoption of LSTM-based models in clinical scenarios with limited training time. Method Inspired by Transformer, we combine Transformer with Soft Term Position Lattice to form soft lattice structure Transformer, which models long-distance dependencies similarly to LSTM. Our model consists of four components: the WordPiece module, the BERT module, the soft lattice structure Transformer module, and the CRF module. Result Our experiments demonstrated that this approach increased the F1 by 1–5% in the CCKS NER task compared to other models based on LSTM with CRF and consumed less training time. Additional evaluations showed that lattice structure transformer shows good performance for recognizing long medical terms, abbreviations, and numbers. The proposed model achieve 91.6% f-measure in recognizing long medical terms and 90.36% f-measure in abbreviations, and numbers. Conclusions By using soft lattice structure Transformer, the method proposed in this paper captured Chinese words to lattice information, making our model suitable for Chinese clinical medical records. Transformers with Mutilayer soft lattice Chinese word construction can capture potential interactions between Chinese characters and words.
Collapse
Affiliation(s)
- Shuli Guo
- State Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, China
| | - Wentao Yang
- State Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, China
| | - Lina Han
- Department of Cardiology, The Second Medical Center, National Clinical Research Center for Geriatric Diseases, Chinese PLA General Hospital, Beijing, China.
| | - Xiaowei Song
- State Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, China
| | - Guowei Wang
- State Key Laboratory of Intelligent Control and Decision of Complex Systems, School of Automation, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
34
|
Abstract
In today’s dynamic complex cyber environments, Cyber Threat Intelligence (CTI) and the risk of cyberattacks are both increasing. This means that organizations need to have a strong understanding of both their internal CTI and their external CTI. The potential for cybersecurity knowledge graphs is evident in their ability to aggregate and represent knowledge about cyber threats, as well as their ability to manage and reason with that knowledge. While most existing research has focused on how to create a full knowledge graph, how to utilize the knowledge graph to tackle real-world industrial difficulties in cyberattack and defense situations is still unclear. In this article, we give a quick overview of the cybersecurity knowledge graph’s core concepts, schema, and building methodologies. We also give a relevant dataset review and open-source frameworks on the information extraction and knowledge creation job to aid future studies on cybersecurity knowledge graphs. We perform a comparative assessment of the many works that expound on the recent advances in the application scenarios of cybersecurity knowledge graph in the majority of this paper. In addition, a new comprehensive classification system is developed to define the linked works from 9 core categories and 18 subcategories. Finally, based on the analyses of existing research issues, we have a detailed overview of various possible research directions.
Collapse
|
35
|
|
36
|
A semantic and syntactic enhanced neural model for financial sentiment analysis. Inf Process Manag 2022. [DOI: 10.1016/j.ipm.2022.102943] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
|
37
|
Zhou B, Chen J, Cai Q, Xue Y, Yang C, He J. Cross‐domain sequence labelling using language modelling and parameter generating. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY 2022. [DOI: 10.1049/cit2.12107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Affiliation(s)
- Bo Zhou
- School of Electronics and Information Engineering South China Normal University Foshan China
| | - Jianying Chen
- School of Electronics and Information Engineering South China Normal University Foshan China
| | - Qianhua Cai
- School of Electronics and Information Engineering South China Normal University Foshan China
| | - Yun Xue
- School of Electronics and Information Engineering South China Normal University Foshan China
| | - Chi Yang
- School of Electronics and Information Engineering South China Normal University Foshan China
| | - Jing He
- Department of Neroscience University of Oxford Oxford Oxfordshire Britain
| |
Collapse
|
38
|
Iterative Annotation of Biomedical NER Corpora with Deep Neural Networks and Knowledge Bases. APPLIED SCIENCES-BASEL 2022. [DOI: 10.3390/app12125775] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
The large availability of clinical natural language documents, such as clinical narratives or diagnoses, requires the definition of smart automatic systems for their processing and analysis, but the lack of annotated corpora in the biomedical domain, especially in languages different from English, makes it difficult to exploit the state-of-art machine-learning systems to extract information from such kinds of documents. For these reasons, healthcare professionals lose big opportunities that can arise from the analysis of this data. In this paper, we propose a methodology to reduce the manual efforts needed to annotate a biomedical named entity recognition (B-NER) corpus, exploiting both active learning and distant supervision, respectively based on deep learning models (e.g., Bi-LSTM, word2vec FastText, ELMo and BERT) and biomedical knowledge bases, in order to speed up the annotation task and limit class imbalance issues. We assessed this approach by creating an Italian-language electronic health record corpus annotated with biomedical domain entities in a small fraction of the time required for a fully manual annotation. The obtained corpus was used to train a B-NER deep neural network whose performances are comparable with the state of the art, with an F1-Score equal to 0.9661 and 0.8875 on two test sets.
Collapse
|
39
|
Deng B, Zhu W, Sun X, Xie Y, Dan W, Zhan Y, Xia Y, Liang X, Li J, Shi Q, Jiang L. Development and Validation of an Automatic System for Intracerebral Hemorrhage Medical Text Recognition and Treatment Plan Output. Front Aging Neurosci 2022; 14:798132. [PMID: 35462698 PMCID: PMC9028758 DOI: 10.3389/fnagi.2022.798132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Accepted: 02/25/2022] [Indexed: 11/30/2022] Open
Abstract
The main purpose of the study was to explore a reliable way to automatically handle emergency cases, such as intracerebral hemorrhage (ICH). Therefore, an artificial intelligence (AI) system, named, H-system, was designed to automatically recognize medical text data of ICH patients and output the treatment plan. Furthermore, the efficiency and reliability of the H-system were tested and analyzed. The H-system, which is mainly based on a pretrained language model Bidirectional Encoder Representations from Transformers (BERT) and an expert module for logical judgment of extracted entities, was designed and founded by the neurosurgeon and AI experts together. All emergency medical text data were from the neurosurgery emergency electronic medical record database (N-eEMRD) of the First Affiliated Hospital of Chongqing Medical University, Chongqing Emergency Medical Center, and Chongqing First People’s Hospital, and the treatment plans of these ICH cases were divided into two types. A total of 1,000 simulated ICH cases were randomly selected as training and validation sets. After training and validating on simulated cases, real cases from three medical centers were provided to test the efficiency of the H-system. Doctors with 1 and 5 years of working experience in neurosurgery (Doctor-1Y and Doctor-5Y) were included to compare with H-system. Furthermore, the data of the H-system, for instance, sensitivity, specificity, accuracy, positive predictive value (PPV), negative predictive value (NPV), and the area under the receiver operating characteristics curve (AUC), were calculated and compared with Doctor-1Y and Doctor-5Y. In the testing set, the time H-system spent on ICH cases was significantly shorter than that of doctors with Doctor-1Y and Doctor-5Y. In the testing set, the accuracy of the H-system’s treatment plan was 88.55 (88.16–88.94)%, the specificity was 85.71 (84.99–86.43)%, and the sensitivity was 91.83 (91.01–92.65)%. The AUC value of the H-system in the testing set was 0.887 (0.884–0.891). Furthermore, the time H-system spent on ICH cases was significantly shorter than that of doctors with Doctor-1Y and Doctor-5Y. The accuracy and AUC of the H-system were significantly higher than that of Doctor-1Y. In addition, the accuracy of the H-system was more closed to that of Doctor-5Y. The H-system designed in the study can automatically recognize and analyze medical text data of patients with ICH and rapidly output accurate treatment plans with high efficiency. It may provide a reliable and novel way to automatically and rapidly handle emergency cases, such as ICH.
Collapse
Affiliation(s)
- Bo Deng
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Wenwen Zhu
- School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing, China
| | - Xiaochuan Sun
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Yanfeng Xie
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Wei Dan
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Yan Zhan
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Yulong Xia
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Xinyi Liang
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
| | - Jie Li
- School of Intelligent Technology and Engineering, Chongqing University of Science and Technology, Chongqing, China
- Jie Li,
| | - Quanhong Shi
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
- Quanhong Shi,
| | - Li Jiang
- Department of Neurosurgery, The First Affiliated Hospital of Chongqing Medical University, Chongqing, China
- *Correspondence: Li Jiang,
| |
Collapse
|
40
|
Rodriguez NE, Nguyen M, McInnes BT. Effects of Data and Entity Ablation on Multitask Learning Models for Biomedical Entity Recognition. J Biomed Inform 2022; 130:104062. [DOI: 10.1016/j.jbi.2022.104062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 02/11/2022] [Accepted: 03/27/2022] [Indexed: 11/24/2022]
|
41
|
Tan K, Huang W, Liu X, Hu J, Dong S. A multi-modal fusion framework based on multi-task correlation learning for cancer prognosis prediction. Artif Intell Med 2022; 126:102260. [DOI: 10.1016/j.artmed.2022.102260] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2021] [Revised: 01/07/2022] [Accepted: 02/16/2022] [Indexed: 12/30/2022]
|
42
|
Phenonizer: A Fine-Grained Phenotypic Named Entity Recognizer for Chinese Clinical Texts. BIOMED RESEARCH INTERNATIONAL 2022; 2022:3524090. [PMID: 35342762 PMCID: PMC8941495 DOI: 10.1155/2022/3524090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 02/09/2022] [Indexed: 11/25/2022]
Abstract
Biomedical named entity recognition (BioNER) from clinical texts is a fundamental task for clinical data analysis due to the availability of large volume of electronic medical record data, which are mostly in free text format, in real-world clinical settings. Clinical text data incorporates significant phenotypic medical entities (e.g., symptoms, diseases, and laboratory indexes), which could be used for profiling the clinical characteristics of patients in specific disease conditions (e.g., Coronavirus Disease 2019 (COVID-19)). However, general BioNER approaches mostly rely on coarse-grained annotations of phenotypic entities in benchmark text dataset. Owing to the numerous negation expressions of phenotypic entities (e.g., “no fever,” “no cough,” and “no hypertension”) in clinical texts, this could not feed the subsequent data analysis process with well-prepared structured clinical data. In this paper, we developed Human-machine Cooperative Phenotypic Spectrum Annotation System (http://www.tcmai.org/login, HCPSAS) and constructed a fine-grained Chinese clinical corpus. Thereafter, we proposed a phenotypic named entity recognizer: Phenonizer, which utilized BERT to capture character-level global contextual representation, extracted local contextual features combined with bidirectional long short-term memory, and finally obtained the optimal label sequences through conditional random field. The results on COVID-19 dataset show that Phenonizer outperforms those methods based on Word2Vec with an F1-score of 0.896. By comparing character embeddings from different data, it is found that character embeddings trained by clinical corpora can improve F-score by 0.0103. In addition, we evaluated Phenonizer on two kinds of granular datasets and proved that fine-grained dataset can boost methods' F1-score slightly by about 0.005. Furthermore, the fine-grained dataset enables methods to distinguish between negated symptoms and presented symptoms. Finally, we tested the generalization performance of Phenonizer, achieving a superior F1-score of 0.8389. In summary, together with fine-grained annotated benchmark dataset, Phenonizer proposes a feasible approach to effectively extract symptom information from Chinese clinical texts with acceptable performance.
Collapse
|
43
|
Comparison of Text Mining Models for Food and Dietary Constituent Named-Entity Recognition. MACHINE LEARNING AND KNOWLEDGE EXTRACTION 2022. [DOI: 10.3390/make4010012] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Biomedical Named-Entity Recognition (BioNER) has become an essential part of text mining due to the continuously increasing digital archives of biological and medical articles. While there are many well-performing BioNER tools for entities such as genes, proteins, diseases or species, there is very little research into food and dietary constituent named-entity recognition. For this reason, in this paper, we study seven BioNER models for food and dietary constituents recognition. Specifically, we study a dictionary-based model, a conditional random fields (CRF) model and a new hybrid model, called FooDCoNER (Food and Dietary Constituents Named-Entity Recognition), which we introduce combining the former two models. In addition, we study deep language models including BERT, BioBERT, RoBERTa and ELECTRA. As a result, we find that FooDCoNER does not only lead to the overall best results, comparable with the deep language models, but FooDCoNER is also much more efficient with respect to run time and sample size requirements of the training data. The latter has been identified via the study of learning curves. Overall, our results not only provide a new tool for food and dietary constituent NER but also shed light on the difference between classical machine learning models and recent deep language models.
Collapse
|
44
|
Kim H, Kang J. How Do Your Biomedical Named Entity Recognition Models Generalize to Novel Entities? IEEE ACCESS : PRACTICAL INNOVATIONS, OPEN SOLUTIONS 2022; 10:31513-31523. [PMID: 35582496 PMCID: PMC9014470 DOI: 10.1109/access.2022.3157854] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/11/2022] [Accepted: 03/03/2022] [Indexed: 06/15/2023]
Abstract
The number of biomedical literature on new biomedical concepts is rapidly increasing, which necessitates a reliable biomedical named entity recognition (BioNER) model for identifying new and unseen entity mentions. However, it is questionable whether existing models can effectively handle them. In this work, we systematically analyze the three types of recognition abilities of BioNER models: memorization, synonym generalization, and concept generalization. We find that although current best models achieve state-of-the-art performance on benchmarks based on overall performance, they have limitations in identifying synonyms and new biomedical concepts, indicating they are overestimated in terms of their generalization abilities. We also investigate failure cases of models and identify several difficulties in recognizing unseen mentions in biomedical literature as follows: (1) models tend to exploit dataset biases, which hinders the models' abilities to generalize, and (2) several biomedical names have novel morphological patterns with weak name regularity, and models fail to recognize them. We apply a statistics-based debiasing method to our problem as a simple remedy and show the improvement in generalization to unseen mentions. We hope that our analyses and findings would be able to facilitate further research into the generalization capabilities of NER models in a domain where their reliability is of utmost importance.
Collapse
Affiliation(s)
- Hyunjae Kim
- Department of Computer Science and EngineeringKorea UniversitySeoul02841South Korea
| | - Jaewoo Kang
- Department of Computer Science and EngineeringKorea UniversitySeoul02841South Korea
| |
Collapse
|
45
|
Zhu C, Yang Z, Xia X, Li N, Zhong F, Liu L. Multimodal reasoning based on knowledge graph embedding for specific diseases. Bioinformatics 2022; 38:2235-2245. [PMID: 35150235 PMCID: PMC9004655 DOI: 10.1093/bioinformatics/btac085] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Revised: 01/06/2022] [Accepted: 02/07/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Knowledge Graph (KG) is becoming increasingly important in the biomedical field. Deriving new and reliable knowledge from existing knowledge by KG embedding technology is a cutting-edge method. Some add a variety of additional information to aid reasoning, namely multimodal reasoning. However, few works based on the existing biomedical KGs are focused on specific diseases. RESULTS This work develops a construction and multimodal reasoning process of Specific Disease Knowledge Graphs (SDKGs). We construct SDKG-11, a SDKG set including five cancers, six non-cancer diseases, a combined Cancer5 and a combined Diseases11, aiming to discover new reliable knowledge and provide universal pre-trained knowledge for that specific disease field. SDKG-11 is obtained through original triplet extraction, standard entity set construction, entity linking and relation linking. We implement multimodal reasoning by reverse-hyperplane projection for SDKGs based on structure, category and description embeddings. Multimodal reasoning improves pre-existing models on all SDKGs using entity prediction task as the evaluation protocol. We verify the model's reliability in discovering new knowledge by manually proofreading predicted drug-gene, gene-disease and disease-drug pairs. Using embedding results as initialization parameters for the biomolecular interaction classification, we demonstrate the universality of embedding models. AVAILABILITY AND IMPLEMENTATION The constructed SDKG-11 and the implementation by TensorFlow are available from https://github.com/ZhuChaoY/SDKG-11. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chaoyu Zhu
- Institute of Biomedical Sciences and School of Basic Medical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Zhihao Yang
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Xiaoqiong Xia
- Institute of Biomedical Sciences and School of Basic Medical Science, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Nan Li
- College of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Fan Zhong
- To whom correspondence should be addressed. or
| | - Lei Liu
- To whom correspondence should be addressed. or
| |
Collapse
|
46
|
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinformatics 2022; 23:8. [PMID: 34983362 PMCID: PMC8729142 DOI: 10.1186/s12859-021-04551-4] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 12/22/2021] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Biomedical named entity recognition (BioNER) is a basic and important medical information extraction task to extract medical entities with special meaning from medical texts. In recent years, deep learning has become the main research direction of BioNER due to its excellent data-driven context coding ability. However, in BioNER task, deep learning has the problem of poor generalization and instability. RESULTS we propose the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features. We select 14 datasets containing 4 types of entities for training and evaluate the model. The experimental results showed that the F1-scores of the five gold standard datasets BC5CDR-chemical, BC5CDR-disease, BC2GM, BC4CHEMD, NCBI-disease and LINNAEUS were increased by 0.57, 0.90, 0.42, 0.77, 0.98 and - 2.16 compared to the single-task XLNet-CRF model. BC5CDR-chemical, BC5CDR-disease and BC4CHEMD achieved state-of-the-art results.The reasons why LINNAEUS's multi-task results are lower than single-task results are discussed at the dataset level. CONCLUSION Compared with using multi-task learning and fine-tuning alone, the model has more accurate recognition ability of medical entities, and has higher generalization and stability.
Collapse
Affiliation(s)
- Zhaoying Chai
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Han Jin
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Shenghui Shi
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China.
| | - Siyan Zhan
- School of Public Health, Peking University, Beijing, China.
| | - Lin Zhuo
- Research Center of Clinical Epidemiology, Peking University Third Hospital, Beijing, China
| | - Yu Yang
- National Institute of Health Data Science, Peking University, Beijing, China
| |
Collapse
|
47
|
Cheng M, Xiong S, Li F, Liang P, Gao J. Multi-task learning for Chinese clinical named entity recognition with external knowledge. BMC Med Inform Decis Mak 2021; 21:372. [PMID: 34972505 PMCID: PMC8719412 DOI: 10.1186/s12911-021-01717-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Accepted: 12/07/2021] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Named entity recognition (NER) on Chinese electronic medical/healthcare records has attracted significantly attentions as it can be applied to building applications to understand these records. Most previous methods have been purely data-driven, requiring high-quality and large-scale labeled medical data. However, labeled data is expensive to obtain, and these data-driven methods are difficult to handle rare and unseen entities. METHODS To tackle these problems, this study presents a novel multi-task deep neural network model for Chinese NER in the medical domain. We incorporate dictionary features into neural networks, and a general secondary named entity segmentation is used as auxiliary task to improve the performance of the primary task of named entity recognition. RESULTS In order to evaluate the proposed method, we compare it with other currently popular methods, on three benchmark datasets. Two of the datasets are publicly available, and the other one is constructed by us. Experimental results show that the proposed model achieves 91.07% average f-measure on the two public datasets and 87.05% f-measure on private dataset. CONCLUSIONS The comparison results of different models demonstrated the effectiveness of our model. The proposed model outperformed traditional statistical models.
Collapse
Affiliation(s)
- Ming Cheng
- Department of Medical Information, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Shufeng Xiong
- Colleges of Information and Management Science, Henan Agricultural University, Zhengzhou, China
| | - Fei Li
- School of Cyber Science and Engineering, Wuhan University, Wuhan, China
| | - Pan Liang
- Department of Radiology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| | - Jianbo Gao
- Department of Radiology, The First Affiliated Hospital of Zhengzhou University, Zhengzhou, China
| |
Collapse
|
48
|
Xiong Y, Chen S, Tang B, Chen Q, Wang X, Yan J, Zhou Y. Improving deep learning method for biomedical named entity recognition by using entity definition information. BMC Bioinformatics 2021; 22:600. [PMID: 34920699 PMCID: PMC8680061 DOI: 10.1186/s12859-021-04236-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 06/04/2021] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining that finds the boundaries of entity mentions in biomedical text and determines their entity type. To accelerate the development of biomedical NER techniques in Spanish, the PharmaCoNER organizers launched a competition to recognize pharmacological substances, compounds, and proteins. Biomedical NER is usually recognized as a sequence labeling task, and almost all state-of-the-art sequence labeling methods ignore the meaning of different entity types. In this paper, we investigate some methods to introduce the meaning of entity types in deep learning methods for biomedical NER and apply them to the PharmaCoNER 2019 challenge. The meaning of each entity type is represented by its definition information. MATERIAL AND METHOD We investigate how to use entity definition information in the following two methods: (1) SQuad-style machine reading comprehension (MRC) methods that treat entity definition information as query and biomedical text as context and predict answer spans as entities. (2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type and introduce entity type meaning, which is represented by entity definition information. All models are trained and tested on the PharmaCoNER 2019 corpus, and their performance is evaluated by strict micro-average precision, recall, and F1-score. RESULTS Entity definition information brings improvements to both SQuad-style MRC and SOne methods by about 0.003 in micro-averaged F1-score. The SQuad-style MRC model using entity definition information as query achieves the best performance with a micro-averaged precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively. It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in F1-score. Compared with the state-of-the-art model without using manually-crafted features, our model obtains a 1% improvement in F1-score, which is significant. These results indicate that entity definition information is useful for deep learning methods on biomedical NER. CONCLUSION Our entity definition information enhanced models achieve the state-of-the-art micro-average F1 score of 0.9137, which implies that entity definition information has a positive impact on biomedical NER detection. In the future, we will explore more entity definition information from knowledge graph.
Collapse
Affiliation(s)
- Ying Xiong
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Shuai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
| | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China.
- Peng Cheng Laboratory, Shenzhen, China.
| | - Qingcai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Xiaolong Wang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| | - Yi Zhou
- Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou, 510080, China.
| |
Collapse
|
49
|
Sun C, Yang Z, Wang L, Zhang Y, Lin H, Wang J. MRC4BioER: Joint extraction of biomedical entities and relations in the machine reading comprehension framework. J Biomed Inform 2021; 125:103956. [PMID: 34848329 DOI: 10.1016/j.jbi.2021.103956] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 11/09/2021] [Accepted: 11/16/2021] [Indexed: 10/19/2022]
Abstract
Extracting entities and their relations from unstructured literature to form structured triplets is essential for biomedical knowledge extraction. Because sentences in biomedical datasets usually have many special overlapping triplets, it is difficult to use previous work to extract these triplets effectively. In this work, we propose a novel tagging strategy to achieve joint extraction in the machine reading comprehension framework. On the one hand, our method uses Query in the machine reading comprehension framework to introduce the information of the specific relation. On the other hand, our method introduces a tagging strategy for overlapping triplets in the biomedical domain. We use CHEMPROT and DDIExtraction2013 datasets to evaluate our method. The experimental results demonstrate that our proposed method can enhance the model's ability to deal with overlapping triplets, improving extraction performance.
Collapse
Affiliation(s)
- Cong Sun
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China.
| | - Lei Wang
- Beijing Institute of Health Administration and Medical Information, Beijing 100850, China.
| | - Yin Zhang
- Beijing Institute of Health Administration and Medical Information, Beijing 100850, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
50
|
Learning with joint cross-document information via multi-task learning for named entity recognition. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2021.08.015] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
|