1
|
Hou G, Jian Y, Zhao Q, Quan X, Zhang H. Language model based on deep learning network for biomedical named entity recognition. Methods 2024; 226:71-77. [PMID: 38641084 DOI: 10.1016/j.ymeth.2024.04.013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Revised: 12/22/2023] [Accepted: 04/16/2024] [Indexed: 04/21/2024] Open
Abstract
Biomedical Named Entity Recognition (BioNER) is one of the most basic tasks in biomedical text mining, which aims to automatically identify and classify biomedical entities in text. Recently, deep learning-based methods have been applied to Biomedical Named Entity Recognition and have shown encouraging results. However, many biological entities are polysemous and ambiguous, which is one of the main obstacles to the task of biomedical named entity recognition. Deep learning methods require large amounts of training data, so the lack of data also affect the performance of model recognition. To solve the problem of polysemous words and insufficient data, for the task of biomedical named entity recognition, we propose a multi-task learning framework fused with language model based on the BiLSTM-CRF architecture. Our model uses a language model to design a differential encoding of the context, which could obtain dynamic word vectors to distinguish words in different datasets. Moreover, we use a multi-task learning method to collectively share the dynamic word vector of different types of entities to improve the recognition performance of each type of entity. Experimental results show that our model reduces the false positives caused by polysemous words through differentiated coding, and improves the performance of each subtask by sharing information between different entity data. Compared with other state-of-the art methods, our model achieved superior results in four typical training sets, and achieved the best results in F1 values.
Collapse
Affiliation(s)
- Guan Hou
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Yuhao Jian
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Qingqing Zhao
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Xiongwen Quan
- College of Artificial Intelligence, Nankai University, Tianjin, China
| | - Han Zhang
- College of Artificial Intelligence, Nankai University, Tianjin, China.
| |
Collapse
|
2
|
Alamro H, Gojobori T, Essack M, Gao X. BioBBC: a multi-feature model that enhances the detection of biomedical entities. Sci Rep 2024; 14:7697. [PMID: 38565624 PMCID: PMC10987643 DOI: 10.1038/s41598-024-58334-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 03/27/2024] [Indexed: 04/04/2024] Open
Abstract
The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.
Collapse
Affiliation(s)
- Hind Alamro
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia
| | - Takashi Gojobori
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Magbubah Essack
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| |
Collapse
|
3
|
Cao L, Wu C, Luo G, Guo C, Zheng A. Online biomedical named entities recognition by data and knowledge-driven model. Artif Intell Med 2024; 150:102813. [PMID: 38553155 DOI: 10.1016/j.artmed.2024.102813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 12/15/2023] [Accepted: 02/12/2024] [Indexed: 04/02/2024]
Abstract
Named entity recognition (NER) is an important task for the natural language processing of biomedical text. Currently, most NER studies standardized biomedical text, but NER for unstandardized biomedical text draws less attention from researchers. Named entities in online biomedical text exist with errors and polymorphisms, which negatively impact NER models' performance and impede support from knowledge representation methods. In this paper, we propose a neural network method that can effectively recognize entities in unstandardized online medical/health text. We introduce a new pre-training scheme that uses large-scale online question-answering pairs to enhance transformers' model capacity on online biomedical text. Moreover, we supply models with knowledge representations from a knowledge base called multi-channel knowledge labels, and this method overcomes the restriction from languages, like Chinese, that require word segmentation tools to represent knowledge. Our model outperforms other baseline methods significantly in experiments on a dataset for Chinese online medical entity recognition and achieves state-of-the-art results.
Collapse
Affiliation(s)
- Lulu Cao
- Department of Rheumatology and Immunology, Peking University People's Hospital, 100044, China
| | - Chaochen Wu
- Renmin University of China, Beijing, 100872, China.
| | - Guan Luo
- State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences, China.
| | - Chao Guo
- Department of Cardiology, Fuwai Hospital CAMS and PUMC, Beijing, 100037, China
| | - Anni Zheng
- State Key Laboratory of Multimodal Artificial Intelligence Systems Institute of Automation, Chinese Academy of Sciences, China
| |
Collapse
|
4
|
Chen JQ, Zhu ZC, Zhang F, Zeng K, Jiang HZ, Cheng ZN. A BIGRU-Based Stacked Attention Network for Biomedical Named Entity Recognition with Chinese EMRs. Stud Health Technol Inform 2023; 308:757-767. [PMID: 38007808 DOI: 10.3233/shti230909] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2023]
Abstract
Biomedical named entity recognition (BNER) is an effective method to structure the medical text data. It is an important basic task for building the medical application services such as the medical knowledge graphs and the intelligent auxiliary diagnosis systems. Existing medical named entity recognition methods generally leverage the word embedding model to construct text representation, and then integrate multiple semantic understanding models to enhance the semantic understanding ability of the model to achieve high-performance entity recognition. However, in the medical field, there are many professional terms that rarely appear in the general field, which cannot be represented well by the general domain word embedding model. Second, existing approaches typically only focus on the extraction of global semantic features, which generate a loss of local semantic features between characters. Moreover, as the word embedding dimension becomes much higher, the standard single-layer structure fails to fully and deeply extract the global semantic features. We put forward the BIGRU-based Stacked Attention Network (BSAN) model for biomedical named entity recognition. Firstly, we use the large-scale real-world medical electronic medical record (EMR) data to fine-tune BERT to build the proprietary embedding representations of the medical terms. Second, we use the Convolutional Neural Network model to extract semantic features. Finally, a stacked BIGRU is constructed using a multi-layer structure and a novel stacking method. It not only enables comprehensive and in-depth extraction of global semantic features, but also requires less time. Experimentally validated on the real-world datasets in Chinese EMRs, the proposed BSAN model achieves 90.9% performance on F1-values, which is stronger than the BNER performance of other state-of-the-art models.
Collapse
Affiliation(s)
- Jie-Qing Chen
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing, China
| | - Zhi-Chao Zhu
- Beijing University of Technology, Beijing, China
| | - Feng Zhang
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing, China
| | - Ke Zeng
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing, China
| | - Hui-Zhen Jiang
- Peking Union Medical College Hospital, Chinese Academy of Medical Sciences and Peking Union Medical College (CAMS & PUMC), Beijing, China
| | | |
Collapse
|
5
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
6
|
Guan Z, Zhou X. A prefix and attention map discrimination fusion guided attention for biomedical named entity recognition. BMC Bioinformatics 2023; 24:42. [PMID: 36755230 PMCID: PMC9907889 DOI: 10.1186/s12859-023-05172-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Accepted: 02/03/2023] [Indexed: 02/10/2023] Open
Abstract
BACKGROUND The biomedical literature is growing rapidly, and it is increasingly important to extract meaningful information from the vast amount of literature. Biomedical named entity recognition (BioNER) is one of the key and fundamental tasks in biomedical text mining. It also acts as a primitive step for many downstream applications such as relation extraction and knowledge base completion. Therefore, the accurate identification of entities in biomedical literature has certain research value. However, this task is challenging due to the insufficiency of sequence labeling and the lack of large-scale labeled training data and domain knowledge. RESULTS In this paper, we use a novel word-pair classification method, design a simple attention mechanism and propose a novel architecture to solve the research difficulties of BioNER more efficiently without leveraging any external knowledge. Specifically, we break down the limitations of sequence labeling-based approaches by predicting the relationship between word pairs. Based on this, we enhance the pre-trained model BioBERT, through the proposed prefix and attention map dscrimination fusion guided attention and propose the E-BioBERT. Our proposed attention differentiates the distribution of different heads in different layers in the BioBERT, which enriches the diversity of self-attention. Our model is superior to state-of-the-art compared models on five available datasets: BC4CHEMD, BC2GM, BC5CDR-Disease, BC5CDR-Chem, and NCBI-Disease, achieving F1-score of 92.55%, 85.45%, 87.53%, 94.16% and 90.55%, respectively. CONCLUSION Compared with many previous various models, our method does not require additional training datasets, external knowledge, and complex training process. The experimental results on five BioNER benchmark datasets demonstrate that our model is better at mining semantic information, alleviating the problem of label inconsistency, and has higher entity recognition ability. More importantly, we analyze and demonstrate the effectiveness of our proposed attention.
Collapse
Affiliation(s)
- Zhengyi Guan
- grid.440773.30000 0000 9342 2456School of Information Science and Engineering, Yunnan University, Kunming, China
| | - Xiaobing Zhou
- School of Information Science and Engineering, Yunnan University, Kunming, China.
| |
Collapse
|
7
|
Li M, Yang H, Liu Y. Biomedical named entity recognition based on fusion multi-features embedding. Technol Health Care 2023; 31:111-121. [PMID: 37038786 DOI: 10.3233/thc-236011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
BACKGROUND With the exponential increase in the volume of biomedical literature, text mining tasks are becoming increasingly important in the medical domain. Named entities are the primary identification tasks in text mining, prerequisites and critical parts for building medical domain knowledge graphs, medical question and answer systems, medical text classification. OBJECTIVE The study goal is to recognize biomedical entities effectively by fusing multi-feature embedding. Multiple features provide more comprehensive information so that better predictions can be obtained. METHODS Firstly, three different kinds of features are generated, including deep contextual word-level features, local char-level features, and part-of-speech features at the word representation layer. The word representation vectors are inputs into BiLSTM as features to obtain the dependency information. Finally, the CRF algorithm is used to learn the features of the state sequences to obtain the global optimal tagging sequences. RESULTS The experimental results showed that the model outperformed other state-of-the-art methods for all-around performance in six datasets among eight of four biomedical entity types. CONCLUSION The proposed method has a positive effect on the prediction results. It comprehensively considers the relevant factors of named entity recognition because the semantic information is enhanced by fusing multi-features embedding.
Collapse
|
8
|
Zheng X, Du H, Luo X, Tong F, Song W, Zhao D. BioByGANS: biomedical named entity recognition by fusing contextual and syntactic features through graph attention network in node classification framework. BMC Bioinformatics 2022; 23:501. [PMID: 36418937 PMCID: PMC9682683 DOI: 10.1186/s12859-022-05051-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Accepted: 11/10/2022] [Indexed: 11/24/2022] Open
Abstract
BACKGROUND Automatic and accurate recognition of various biomedical named entities from literature is an important task of biomedical text mining, which is the foundation of extracting biomedical knowledge from unstructured texts into structured formats. Using the sequence labeling framework and deep neural networks to implement biomedical named entity recognition (BioNER) is a common method at present. However, the above method often underutilizes syntactic features such as dependencies and topology of sentences. Therefore, it is an urgent problem to be solved to integrate semantic and syntactic features into the BioNER model. RESULTS In this paper, we propose a novel biomedical named entity recognition model, named BioByGANS (BioBERT/SpaCy-Graph Attention Network-Softmax), which uses a graph to model the dependencies and topology of a sentence and formulate the BioNER task as a node classification problem. This formulation can introduce more topological features of language and no longer be only concerned about the distance between words in the sequence. First, we use periods to segment sentences and spaces and symbols to segment words. Second, contextual features are encoded by BioBERT, and syntactic features such as part of speeches, dependencies and topology are preprocessed by SpaCy respectively. A graph attention network is then used to generate a fusing representation considering both the contextual features and syntactic features. Last, a softmax function is used to calculate the probabilities and get the results. We conduct experiments on 8 benchmark datasets, and our proposed model outperforms existing BioNER state-of-the-art methods on the BC2GM, JNLPBA, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, NCBI-disease, Species-800, and LINNAEUS datasets, and achieves F1-scores of 85.15%, 78.16%, 92.97%, 94.74%, 87.74%, 91.57%, 75.01%, 90.99%, respectively. CONCLUSION The experimental results on 8 biomedical benchmark datasets demonstrate the effectiveness of our model, and indicate that formulating the BioNER task into a node classification problem and combining syntactic features into the graph attention networks can significantly improve model performance.
Collapse
Affiliation(s)
- Xiangwen Zheng
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Haijian Du
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Xiaowei Luo
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Fan Tong
- Academy of Military Medical Sciences, Beijing, 100039, China
| | - Wei Song
- Beijing MedPeer Information Technology Co., Ltd, Beijing, 102300, China
| | - Dongsheng Zhao
- Academy of Military Medical Sciences, Beijing, 100039, China.
| |
Collapse
|
9
|
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinformatics 2022; 23:8. [PMID: 34983362 PMCID: PMC8729142 DOI: 10.1186/s12859-021-04551-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2021] [Accepted: 12/22/2021] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Biomedical named entity recognition (BioNER) is a basic and important medical information extraction task to extract medical entities with special meaning from medical texts. In recent years, deep learning has become the main research direction of BioNER due to its excellent data-driven context coding ability. However, in BioNER task, deep learning has the problem of poor generalization and instability. RESULTS we propose the hierarchical shared transfer learning, which combines multi-task learning and fine-tuning, and realizes the multi-level information fusion between the underlying entity features and the upper data features. We select 14 datasets containing 4 types of entities for training and evaluate the model. The experimental results showed that the F1-scores of the five gold standard datasets BC5CDR-chemical, BC5CDR-disease, BC2GM, BC4CHEMD, NCBI-disease and LINNAEUS were increased by 0.57, 0.90, 0.42, 0.77, 0.98 and - 2.16 compared to the single-task XLNet-CRF model. BC5CDR-chemical, BC5CDR-disease and BC4CHEMD achieved state-of-the-art results.The reasons why LINNAEUS's multi-task results are lower than single-task results are discussed at the dataset level. CONCLUSION Compared with using multi-task learning and fine-tuning alone, the model has more accurate recognition ability of medical entities, and has higher generalization and stability.
Collapse
Affiliation(s)
- Zhaoying Chai
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Han Jin
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China
| | - Shenghui Shi
- College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China.
| | - Siyan Zhan
- School of Public Health, Peking University, Beijing, China.
| | - Lin Zhuo
- Research Center of Clinical Epidemiology, Peking University Third Hospital, Beijing, China
| | - Yu Yang
- National Institute of Health Data Science, Peking University, Beijing, China
| |
Collapse
|
10
|
Xiong Y, Chen S, Tang B, Chen Q, Wang X, Yan J, Zhou Y. Improving deep learning method for biomedical named entity recognition by using entity definition information. BMC Bioinformatics 2021; 22:600. [PMID: 34920699 PMCID: PMC8680061 DOI: 10.1186/s12859-021-04236-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 06/04/2021] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining that finds the boundaries of entity mentions in biomedical text and determines their entity type. To accelerate the development of biomedical NER techniques in Spanish, the PharmaCoNER organizers launched a competition to recognize pharmacological substances, compounds, and proteins. Biomedical NER is usually recognized as a sequence labeling task, and almost all state-of-the-art sequence labeling methods ignore the meaning of different entity types. In this paper, we investigate some methods to introduce the meaning of entity types in deep learning methods for biomedical NER and apply them to the PharmaCoNER 2019 challenge. The meaning of each entity type is represented by its definition information. MATERIAL AND METHOD We investigate how to use entity definition information in the following two methods: (1) SQuad-style machine reading comprehension (MRC) methods that treat entity definition information as query and biomedical text as context and predict answer spans as entities. (2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type and introduce entity type meaning, which is represented by entity definition information. All models are trained and tested on the PharmaCoNER 2019 corpus, and their performance is evaluated by strict micro-average precision, recall, and F1-score. RESULTS Entity definition information brings improvements to both SQuad-style MRC and SOne methods by about 0.003 in micro-averaged F1-score. The SQuad-style MRC model using entity definition information as query achieves the best performance with a micro-averaged precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively. It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in F1-score. Compared with the state-of-the-art model without using manually-crafted features, our model obtains a 1% improvement in F1-score, which is significant. These results indicate that entity definition information is useful for deep learning methods on biomedical NER. CONCLUSION Our entity definition information enhanced models achieve the state-of-the-art micro-average F1 score of 0.9137, which implies that entity definition information has a positive impact on biomedical NER detection. In the future, we will explore more entity definition information from knowledge graph.
Collapse
Affiliation(s)
- Ying Xiong
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Shuai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
| | - Buzhou Tang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China.
- Peng Cheng Laboratory, Shenzhen, China.
| | - Qingcai Chen
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Xiaolong Wang
- Department of Computer Science, Harbin Institute of Technology, Shenzhen, Shenzhen, 518055, China
| | - Jun Yan
- Yidu Cloud (Beijing) Technology Co., Ltd, Beijing, China
| | - Yi Zhou
- Zhongshan School of Medicine, Sun Yat-Sen University, Guangzhou, 510080, China.
| |
Collapse
|
11
|
Zhou H, Liu Z, Lang C, Xu Y, Lin Y, Hou J. Improving the recall of biomedical named entity recognition with label re-correction and knowledge distillation. BMC Bioinformatics 2021; 22:295. [PMID: 34078270 PMCID: PMC8170952 DOI: 10.1186/s12859-021-04200-w] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Accepted: 05/06/2021] [Indexed: 11/10/2022] Open
Abstract
Background Biomedical named entity recognition is one of the most essential tasks in biomedical information extraction. Previous studies suffer from inadequate annotated datasets, especially the limited knowledge contained in them. Methods To remedy the above issue, we propose a novel Biomedical Named Entity Recognition (BioNER) framework with label re-correction and knowledge distillation strategies, which could not only create large and high-quality datasets but also obtain a high-performance recognition model. Our framework is inspired by two points: (1) named entity recognition should be considered from the perspective of both coverage and accuracy; (2) trustable annotations should be yielded by iterative correction. Firstly, for coverage, we annotate chemical and disease entities in a large-scale unlabeled dataset by PubTator to generate a weakly labeled dataset. For accuracy, we then filter it by utilizing multiple knowledge bases to generate another weakly labeled dataset. Next, the two datasets are revised by a label re-correction strategy to construct two high-quality datasets, which are used to train two recognition models, respectively. Finally, we compress the knowledge in the two models into a single recognition model with knowledge distillation. Results Experiments on the BioCreative V chemical-disease relation corpus and NCBI Disease corpus show that knowledge from large-scale datasets significantly improves the performance of BioNER, especially the recall of it, leading to new state-of-the-art results. Conclusions We propose a framework with label re-correction and knowledge distillation strategies. Comparison results show that the two perspectives of knowledge in the two re-corrected datasets respectively are complementary and both effective for BioNER.
Collapse
Affiliation(s)
- Huiwei Zhou
- School of Computer Science and Technology, Ganjingzi District, Dalian University of Technology, Address Chuangxinyuan Building, No.2 Linggong Road, Dalian, 116024, Liaoning, China.
| | - Zhe Liu
- School of Computer Science and Technology, Ganjingzi District, Dalian University of Technology, Address Chuangxinyuan Building, No.2 Linggong Road, Dalian, 116024, Liaoning, China
| | - Chengkun Lang
- School of Computer Science and Technology, Ganjingzi District, Dalian University of Technology, Address Chuangxinyuan Building, No.2 Linggong Road, Dalian, 116024, Liaoning, China
| | - Yibin Xu
- School of Computer Science and Technology, Ganjingzi District, Dalian University of Technology, Address Chuangxinyuan Building, No.2 Linggong Road, Dalian, 116024, Liaoning, China
| | - Yingyu Lin
- School of Foreign Languages, Ganjingzi District, Dalian University of Technology, Address Chuangxinyuan Building, No.2 Linggong Road, Dalian, 116024, Liaoning, China
| | - Junjie Hou
- School of Business, Panjin Campus of Dalian University of Technology, No. 2 Dagong Road, Liaodongwan New District, PanJin, 124221, Liaoning, China
| |
Collapse
|
12
|
Gajendran S, D M, Sugumaran V. Character level and word level embedding with bidirectional LSTM - Dynamic recurrent neural network for biomedical named entity recognition from literature. J Biomed Inform 2020; 112:103609. [PMID: 33122119 DOI: 10.1016/j.jbi.2020.103609] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Revised: 10/14/2020] [Accepted: 10/22/2020] [Indexed: 12/22/2022]
Abstract
Named Entity Recognition is the process of identifying different entities in a given context. Biomedical Named Entity Recognition (BNER) is the task of extracting chemical names from biomedical texts to support biomedical and translational research. The aim of the system is to extract useful chemical names from biomedical literature text without a lot of handcrafted engineering features. This approach introduces a novel neural network architecture with the composition of bidirectional long short-term memory (BLSTM), dynamic recurrent neural network (RNN) and conditional random field (CRF) that uses character level and word level embedding as the only features to identify the chemical entities. Using this approach we have achieved the F1 score of 89.98 on BioCreAtIvE II GM corpus and 90.84 on NCBI corpus by outperforming the existing systems. Our system is based on the deep neural architecture that uses both character and word level embedding which captures the morphological and orthographic information eliminating the need for handcrafted engineering features. The proposed system outperforms the existing systems without a lot of handcrafted engineering features. The embedding concept along with the bidirectional LSTM network proved to be an effective method to identify most of the chemical entities.
Collapse
Affiliation(s)
- Sudhakaran Gajendran
- Department of Computer Science and Engineering, College of Engineering Guindy, Anna University, Chennai, India.
| | - Manjula D
- Department of Computer Science and Engineering, College of Engineering Guindy, Anna University, Chennai, India.
| | - Vijayan Sugumaran
- Center for Data Science and Big Data Analytics, Oakland University, Rochester, MI, USA; Department of Decision and Information Sciences, School of Business Administration, Oakland University, Rochester, MI, USA.
| |
Collapse
|
13
|
El-Allaly ED, Sarrouti M, En-Nahnahi N, Ouatik El Alaoui S. An adverse drug effect mentions extraction method based on weighted online recurrent extreme learning machine. Comput Methods Programs Biomed 2019; 176:33-41. [PMID: 31200909 DOI: 10.1016/j.cmpb.2019.04.029] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/06/2019] [Revised: 04/05/2019] [Accepted: 04/26/2019] [Indexed: 06/09/2023]
Abstract
BACKGROUND AND OBJECTIVE Automatic extraction of adverse drug effect (ADE) mentions from biomedical texts is a challenging research problem that has attracted significant attention from the pharmacovigilance and biomedical text mining communities. Indeed, deep learning based methods have recently been employed to solve this issue with great success. However, they fail to effectively identify the boundary of mentions. In this paper, we propose a weighted online recurrent extreme learning machine (WOR-ELM) based method to overcome this drawback. METHODS The proposed method for ADE mentions extraction from biomedical texts is divided into two stages: span detection and ADE mentions classification. At the first stage, we identify the boundary of the mentions irrespective of their types with a WOR-ELM in a given sentence. At the second stage, another WOR-ELM is used to classify the identified mentions to the appropriate type. Both stages use the concatenation of character-level and word-level embeddings as features. The character-level embedding is obtained using a modified online recurrent extreme learning machine, whereas the word-level embedding is obtained from a pre-trained model. RESULTS Several experiments were carried out on a well-known ADE corpus to evaluate the effectiveness and demonstrate the usefulness of the proposed method. The obtained results show that our method achieves an F-score of 87.5%, which outperforms the current state-of-the-art methods. CONCLUSIONS Our research results indicate that the proposed method for adverse drug effect mentions extraction from text can significantly improve performance over existing methods. Our experiments show the effectiveness of incorporating word-level and character level embeddings as features for WOR-ELM. They also illustrate the benefits of using IOU segment to represent ADE mentions.
Collapse
Affiliation(s)
- Ed-Drissiya El-Allaly
- Laboratory of Informatics and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco.
| | - Mourad Sarrouti
- Laboratory of Informatics and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco.
| | - Noureddine En-Nahnahi
- Laboratory of Informatics and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco.
| | - Said Ouatik El Alaoui
- Laboratory of Informatics and Modeling, FSDM, Sidi Mohammed Ben Abdellah University, Fez, Morocco.
| |
Collapse
|
14
|
Hemati W, Mehler A. CRFVoter: gene and protein related object recognition using a conglomerate of CRF-based tools. J Cheminform 2019; 11:21. [PMID: 30874918 PMCID: PMC6419804 DOI: 10.1186/s13321-019-0343-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2018] [Accepted: 03/01/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Gene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects. For this purpose, we transform the task as posed by BioCreative V.5 into a sequence labeling problem. We present a series of sequence labeling systems that we used and adapted in our experiments for solving this task. Our experiments show how to optimize the hyperparameters of the classifiers involved. To this end, we utilize various algorithms for hyperparameter optimization. Finally, we present CRFVoter, a two-stage application of Conditional Random Field (CRF) that integrates the optimized sequence labelers from our study into one ensemble classifier. RESULTS We analyze the impact of hyperparameter optimization regarding named entity recognition in biomedical research and show that this optimization results in a performance increase of up to 60%. In our evaluation, our ensemble classifier based on multiple sequence labelers, called CRFVoter, outperforms each individual extractor's performance. For the blinded test set provided by the BioCreative organizers, CRFVoter achieves an F-score of 75%, a recall of 71% and a precision of 80%. For the GPRO type 1 evaluation, CRFVoter achieves an F-Score of 73%, a recall of 70% and achieved the best precision (77%) among all task participants. CONCLUSION CRFVoter is effective when multiple sequence labeling systems are to be used and performs better then the individual systems collected by it.
Collapse
Affiliation(s)
- Wahed Hemati
- Text Technology Lab, Goethe-University Frankfurt, Robert-Mayer-Straße 10, 60325 Frankfurt am Main, Germany
| | - Alexander Mehler
- Text Technology Lab, Goethe-University Frankfurt, Robert-Mayer-Straße 10, 60325 Frankfurt am Main, Germany
| |
Collapse
|
15
|
Abstract
Background Biomedical named entity recognition(BNER) is a crucial initial step of information extraction in biomedical domain. The task is typically modeled as a sequence labeling problem. Various machine learning algorithms, such as Conditional Random Fields (CRFs), have been successfully used for this task. However, these state-of-the-art BNER systems largely depend on hand-crafted features. Results We present a recurrent neural network (RNN) framework based on word embeddings and character representation. On top of the neural network architecture, we use a CRF layer to jointly decode labels for the whole sentence. In our approach, contextual information from both directions and long-range dependencies in the sequence, which is useful for this task, can be well modeled by bidirectional variation and long short-term memory (LSTM) unit, respectively. Although our models use word embeddings and character embeddings as the only features, the bidirectional LSTM-RNN (BLSTM-RNN) model achieves state-of-the-art performance — 86.55% F1 on BioCreative II gene mention (GM) corpus and 73.79% F1 on JNLPBA 2004 corpus. Conclusions Our neural network architecture can be successfully used for BNER without any manual feature engineering. Experimental results show that domain-specific pre-trained word embeddings and character-level representation can improve the performance of the LSTM-RNN models. On the GM corpus, we achieve comparable performance compared with other systems using complex hand-crafted features. Considering the JNLPBA corpus, our model achieves the best results, outperforming the previously top performing systems. The source code of our method is freely available under GPL at https://github.com/lvchen1989/BNER.
Collapse
Affiliation(s)
- Chen Lyu
- School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China
| | - Bo Chen
- Department of Chinese Language & Literature, Hubei University of Art & Science, Xiangyang, 24105, Hubei, China
| | - Yafeng Ren
- Guangdong Collaborative Innovation Center for Language Research & Services, Guangdong University of Foreign Studies, Guangzhou, 510420, Guangdong, China
| | - Donghong Ji
- School of Computer Science, Wuhan University, Wuhan, 430072, Hubei, China.
| |
Collapse
|
16
|
Bhasuran B, Murugesan G, Abdulkadhar S, Natarajan J. Stacked ensemble combined with fuzzy matching for biomedical named entity recognition of diseases. J Biomed Inform 2016; 64:1-9. [PMID: 27634494 DOI: 10.1016/j.jbi.2016.09.009] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2016] [Revised: 09/07/2016] [Accepted: 09/11/2016] [Indexed: 11/16/2022]
Abstract
Biomedical Named Entity Recognition (Bio-NER) is the crucial initial step in the information extraction process and a majorly focused research area in biomedical text mining. In the past years, several models and methodologies have been proposed for the recognition of semantic types related to gene, protein, chemical, drug and other biological relevant named entities. In this paper, we implemented a stacked ensemble approach combined with fuzzy matching for biomedical named entity recognition of disease names. The underlying concept of stacked generalization is to combine the outputs of base-level classifiers using a second-level meta-classifier in an ensemble. We used Conditional Random Field (CRF) as the underlying classification method that makes use of a diverse set of features, mostly based on domain specific, and are orthographic and morphologically relevant. In addition, we used fuzzy string matching to tag rare disease names from our in-house disease dictionary. For fuzzy matching, we incorporated two best fuzzy search algorithms Rabin Karp and Tuned Boyer Moore. Our proposed approach shows promised result of 94.66%, 89.12%, 84.10%, and 76.71% of F-measure while on evaluating training and testing set of both NCBI disease and BioCreative V CDR Corpora.
Collapse
Affiliation(s)
- Balu Bhasuran
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore 641046, India
| | - Gurusamy Murugesan
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore 641046, India
| | - Sabenabanu Abdulkadhar
- Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore 641046, India
| | - Jeyakumar Natarajan
- DRDO-BU Center for Life Sciences, Bharathiar University Campus, Coimbatore 641046, India; Data Mining and Text Mining Laboratory, Department of Bioinformatics, Bharathiar University, Coimbatore 641046, India.
| |
Collapse
|