1
|
Dai HJ, Chen CC, Mir TH, Wang TY, Wang CK, Chang YC, Yu SJ, Shen YW, Huang CJ, Tsai CH, Wang CY, Chen HJ, Weng PS, Lin YX, Chen SW, Tsai MJ, Juang SF, Wu SY, Tsai WT, Huang MY, Huang CJ, Yang CJ, Liu PZ, Huang CW, Huang CY, Wang WYC, Chong IW, Yang YH. Integrating predictive coding and a user-centric interface for enhanced auditing and quality in cancer registry data. Comput Struct Biotechnol J 2024; 24:322-333. [PMID: 38690549 PMCID: PMC11059324 DOI: 10.1016/j.csbj.2024.04.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2023] [Revised: 04/02/2024] [Accepted: 04/03/2024] [Indexed: 05/02/2024] Open
Abstract
Data curation for a hospital-based cancer registry heavily relies on the labor-intensive manual abstraction process by cancer registrars to identify cancer-related information from free-text electronic health records. To streamline this process, a natural language processing system incorporating a hybrid of deep learning-based and rule-based approaches for identifying lung cancer registry-related concepts, along with a symbolic expert system that generates registry coding based on weighted rules, was developed. The system is integrated with the hospital information system at a medical center to provide cancer registrars with a patient journey visualization platform. The embedded system offers a comprehensive view of patient reports annotated with significant registry concepts to facilitate the manual coding process and elevate overall quality. Extensive evaluations, including comparisons with state-of-the-art methods, were conducted using a lung cancer dataset comprising 1428 patients from the medical center. The experimental results illustrate the effectiveness of the developed system, consistently achieving F1-scores of 0.85 and 1.00 across 30 coding items. Registrar feedback highlights the system's reliability as a tool for assisting and auditing the abstraction. By presenting key registry items along the timeline of a patient's reports with accurate code predictions, the system improves the quality of registrar outcomes and reduces the labor resources and time required for data abstraction. Our study highlights advancements in cancer registry coding practices, demonstrating that the proposed hybrid weighted neural-symbolic cancer registry system is reliable and efficient for assisting cancer registrars in the coding workflow and contributing to clinical outcomes.
Collapse
Affiliation(s)
- Hong-Jie Dai
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan 70456, Taiwan
- School of Post-Baccalaureate Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
- Center for Big Data Research, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Chien-Chang Chen
- Electromagnetic Sensing Control and AI Computing System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
| | - Tatheer Hussain Mir
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan 70456, Taiwan
| | - Ting-Yu Wang
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan 70456, Taiwan
| | - Chen-Kai Wang
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan, ROC
- Advanced Technology Laboratory, Chunghwa Telecom Laboratories, Taoyuan, Taiwan, ROC
| | - Ya-Chen Chang
- National Institute of Cancer Research, National Health Research Institutes, Tainan 70456, Taiwan
| | - Shu-Jung Yu
- Center for Big Data Research, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Yi-Wen Shen
- Cancer Center, Kaohsiung Medical University Hospital, Kaohsiung 80708, Taiwan
| | - Cheng-Jiun Huang
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
| | - Chia-Hsuan Tsai
- School of Post-Baccalaureate Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Ching-Yun Wang
- School of Post-Baccalaureate Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Hsiao-Jou Chen
- School of Post-Baccalaureate Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Pei-Shan Weng
- School of Post-Baccalaureate Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - You-Xiang Lin
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
| | - Sheng-Wei Chen
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung 80778, Taiwan
| | - Ming-Ju Tsai
- Division of Pulmonary and Critical Care Medicine, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Shian-Fei Juang
- Department of Medical Information, Kaohsiung Medical University Hospital, Kaohsiung 80708, Taiwan
| | - Su-Ying Wu
- Department of Medical Information, Kaohsiung Medical University Hospital, Kaohsiung 80708, Taiwan
| | - Wen-Tsung Tsai
- Department of Medical Information, Kaohsiung Medical University Hospital, Kaohsiung 80708, Taiwan
| | - Ming-Yii Huang
- Cancer Center, Kaohsiung Medical University Hospital, Kaohsiung 80708, Taiwan
- Department of Radiation Oncology, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Chih-Jen Huang
- Cancer Center, Kaohsiung Medical University Hospital, Kaohsiung 80708, Taiwan
| | - Chih-Jen Yang
- School of Post-Baccalaureate Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
- Division of Pulmonary and Critical Care Medicine, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
| | - Ping-Zun Liu
- Health Promotion Administration, Ministry of Health and Welfare, Taipei 10341, Taiwan
| | - Chiao-Wen Huang
- Health Promotion Administration, Ministry of Health and Welfare, Taipei 10341, Taiwan
| | - Chi-Yen Huang
- Health Promotion Administration, Ministry of Health and Welfare, Taipei 10341, Taiwan
| | | | - Inn-Wen Chong
- Division of Chest Medicine, Kaohsiung Medical University Hospital, Kaohsiung Medical University, Kaohsiung 80708, Taiwan
- Department of Biological Science and Technology, National Yang Ming Chiao Tung University, Hsinchu 30010, Taiwan
| | - Yi-Hsin Yang
- National Institute of Cancer Research, National Health Research Institutes, Tainan 70456, Taiwan
| |
Collapse
|
2
|
Han P, Li X, Zhang Z, Zhong Y, Gu L, Hua Y, Li X. CMCN: Chinese medical concept normalization using continual learning and knowledge-enhanced. Artif Intell Med 2024; 157:102965. [PMID: 39241561 DOI: 10.1016/j.artmed.2024.102965] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 05/10/2024] [Accepted: 08/19/2024] [Indexed: 09/09/2024]
Abstract
Medical Concept Normalization (MCN) is a crucial process for deep information extraction and natural language processing tasks, which plays a vital role in biomedical research. Although MCN in English has achieved significant research achievements, Chinese medical concept normalization (CMCN) remains insufficiently explored due to its complex syntactic structure and the paucity of Chinese medical semantic and ontology resources. In recent years, deep learning has been extensively applied across numerous natural language processing tasks, owing to its robust learning capabilities, adaptability, and transferability. It has proven to be well suited for intricate and specialized knowledge discovery research in the biomedical field. In this study, we conduct research on CMCN through the lens of deep learning. Specifically, our research introduces a model that leverages polymorphic semantic information and knowledge enhanced through multi-task learning and retain more important medical features through continual learning. As the cornerstone of CMCN, disease names are the main focus of this research. We evaluated various methodologies on Chinese disease dataset built by ourselves, finally achieving 76.12 % on Accuracy@1, 87.20 % on Accuracy@5 and 90.02 % on Accuracy@10 with our best-performing model GCBM-BSCL. This research not only advances the fields of knowledge mining and medical concept normalization but also enhances the integration and application of artificial intelligence in the medical and health field. We have published the source code and results on https://github.com/BearLiX/CMCN.
Collapse
Affiliation(s)
- Pu Han
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China; Jiangsu Provincial Key Laboratory of Data Engineering and Knowledge Service, Nanjing 210023, China.
| | - Xiong Li
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Zhanpeng Zhang
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Yule Zhong
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Liang Gu
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Yingying Hua
- School of Management, Nanjing University of Posts & Telecommunications, Nanjing 210003, China
| | - Xiaoyan Li
- School of Basic Medical Sciences, Nanjing Medical University, Nanjing 210029, China.
| |
Collapse
|
3
|
Li J, Li Y, Pan Y, Guo J, Sun Z, Li F, He Y, Tao C. Mapping vaccine names in clinical trials to vaccine ontology using cascaded fine-tuned domain-specific language models. J Biomed Semantics 2024; 15:14. [PMID: 39123237 PMCID: PMC11316402 DOI: 10.1186/s13326-024-00318-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2023] [Accepted: 07/31/2024] [Indexed: 08/12/2024] Open
Abstract
BACKGROUND Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. CLINICALTRIALS gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. RESULTS In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, the Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. CONCLUSION This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
Collapse
Affiliation(s)
- Jianfu Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Yiming Li
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Yuanyi Pan
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
| | - Jinjing Guo
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA
| | - Zenan Sun
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Fang Li
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, Center for Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, 48109, USA.
| | - Cui Tao
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Jacksonville, FL, 32224, USA.
| |
Collapse
|
4
|
Jonker RAA, Almeida T, Antunes R, Almeida JR, Matos S. Multi-head CRF classifier for biomedical multi-class named entity recognition on Spanish clinical notes. Database (Oxford) 2024; 2024:baae068. [PMID: 39083461 PMCID: PMC11290360 DOI: 10.1093/database/baae068] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 05/15/2024] [Accepted: 07/08/2024] [Indexed: 08/02/2024]
Abstract
The identification of medical concepts from clinical narratives has a large interest in the biomedical scientific community due to its importance in treatment improvements or drug development research. Biomedical named entity recognition (NER) in clinical texts is crucial for automated information extraction, facilitating patient record analysis, drug development, and medical research. Traditional approaches often focus on single-class NER tasks, yet recent advancements emphasize the necessity of addressing multi-class scenarios, particularly in complex biomedical domains. This paper proposes a strategy to integrate a multi-head conditional random field (CRF) classifier for multi-class NER in Spanish clinical documents. Our methodology overcomes overlapping entity instances of different types, a common challenge in traditional NER methodologies, by using a multi-head CRF model. This architecture enhances computational efficiency and ensures scalability for multi-class NER tasks, maintaining high performance. By combining four diverse datasets, SympTEMIST, MedProcNER, DisTEMIST, and PharmaCoNER, we expand the scope of NER to encompass five classes: symptoms, procedures, diseases, chemicals, and proteins. To the best of our knowledge, these datasets combined create the largest Spanish multi-class dataset focusing on biomedical entity recognition and linking for clinical notes, which is important to train a biomedical model in Spanish. We also provide entity linking to the multi-lingual Systematized Nomenclature of Medicine Clinical Terms (SNOMED CT) vocabulary, with the eventual goal of performing biomedical relation extraction. Through experimentation and evaluation of Spanish clinical documents, our strategy provides competitive results against single-class NER models. For NER, our system achieves a combined micro-averaged F1-score of 78.73, with clinical mentions normalized to SNOMED CT with an end-to-end F1-score of 54.51. The code to run our system is publicly available at https://github.com/ieeta-pt/Multi-Head-CRF. Database URL: https://github.com/ieeta-pt/Multi-Head-CRF.
Collapse
Affiliation(s)
- Richard A A Jonker
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Tiago Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Rui Antunes
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - João R Almeida
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| | - Sérgio Matos
- IEETA/DETI, LASI, University of Aveiro, Campus Universitário de Santiago, Aveiro 3810-193, Portugal
| |
Collapse
|
5
|
Li J, Li Y, Pan Y, Guo J, Sun Z, Li F, He Y, Tao C. Mapping Vaccine Names in Clinical Trials to Vaccine Ontology using Cascaded Fine-Tuned Domain-Specific Language Models. RESEARCH SQUARE 2023:rs.3.rs-3362256. [PMID: 37841880 PMCID: PMC10571639 DOI: 10.21203/rs.3.rs-3362256/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/17/2023]
Abstract
Background Vaccines have revolutionized public health by providing protection against infectious diseases. They stimulate the immune system and generate memory cells to defend against targeted diseases. Clinical trials evaluate vaccine performance, including dosage, administration routes, and potential side effects. ClinicalTrials.gov is a valuable repository of clinical trial information, but the vaccine data in them lacks standardization, leading to challenges in automatic concept mapping, vaccine-related knowledge development, evidence-based decision-making, and vaccine surveillance. Results In this study, we developed a cascaded framework that capitalized on multiple domain knowledge sources, including clinical trials, Unified Medical Language System (UMLS), and the Vaccine Ontology (VO), to enhance the performance of domain-specific language models for automated mapping of VO from clinical trials. The Vaccine Ontology (VO) is a community-based ontology that was developed to promote vaccine data standardization, integration, and computer-assisted reasoning. Our methodology involved extracting and annotating data from various sources. We then performed pre-training on the PubMedBERT model, leading to the development of CTPubMedBERT. Subsequently, we enhanced CTPubMedBERT by incorporating SAPBERT, which was pretrained using the UMLS, resulting in CTPubMedBERT + SAPBERT. Further refinement was accomplished through fine-tuning using the Vaccine Ontology corpus and vaccine data from clinical trials, yielding the CTPubMedBERT + SAPBERT + VO model. Finally, we utilized a collection of pre-trained models, along with the weighted rule-based ensemble approach, to normalize the vaccine corpus and improve the accuracy of the process. The ranking process in concept normalization involves prioritizing and ordering potential concepts to identify the most suitable match for a given context. We conducted a ranking of the Top 10 concepts, and our experimental results demonstrate that our proposed cascaded framework consistently outperformed existing effective baselines on vaccine mapping, achieving 71.8% on top 1 candidate's accuracy and 90.0% on top 10 candidate's accuracy. Conclusion This study provides a detailed insight into a cascaded framework of fine-tuned domain-specific language models improving mapping of VO from clinical trials. By effectively leveraging domain-specific information and applying weighted rule-based ensembles of different pre-trained BERT models, our framework can significantly enhance the mapping of VO from clinical trials.
Collapse
Affiliation(s)
- Jianfu Li
- The University of Texas Health Science Center at Houston
| | - Yiming Li
- The University of Texas Health Science Center at Houston
| | | | | | - Zenan Sun
- The University of Texas Health Science Center at Houston
| | - Fang Li
- The University of Texas Health Science Center at Houston
| | | | - Cui Tao
- The University of Texas Health Science Center at Houston
| |
Collapse
|
6
|
Xu D, Miller T. A simple neural vector space model for medical concept normalization using concept embeddings. J Biomed Inform 2022; 130:104080. [PMID: 35472514 PMCID: PMC9351985 DOI: 10.1016/j.jbi.2022.104080] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2022] [Revised: 04/15/2022] [Accepted: 04/19/2022] [Indexed: 11/24/2022]
Abstract
OBJECTIVE Medical concept normalization (MCN), the task of linking textual mentions to concepts in an ontology, provides a solution to unify different ways of referring to the same concept. In this paper, we present a simple neural MCN model that takes mentions as input and directly predicts concepts. MATERIALS AND METHODS We evaluate our proposed model on clinical datasets from ShARe/CLEF eHealth 2013 shared task and 2019 n2c2/OHNLP shared task track 3. Our neural MCN model consists of an encoder, and a normalized temperature-scaled softmax (NT-softmax) layer that maximizes the cosine similarity score of matching the mention to the correct concept. We adopt SAPBERT as the encoder and initialize the weights in the NT-softmax layer with pre-computed concept embeddings from SAPBERT. RESULTS Our proposed neural model achieves competitive performance on ShARe/CLEF 2013 and establishes a new state-of-the-art on 2019-n2c2-MCN. Yet this model is simpler than most prior work: it requires no complex pipelines, no hand-crafted rules, and no preprocessing, making it simpler to apply in new settings. DISCUSSION Analyses of our proposed model show that the NT-softmax is better than the conventional softmax on the MCN task, and both the CUI-less threshold parameter and the initialization of the weight vectors in the NT-softmax layer contribute to the improvements. CONCLUSION We propose a simple neural model for clinical MCN, an one-step approach with simpler inference and more effective performance than prior work. Our analyses demonstrate future work on MCN may require more effort on unseen concepts.
Collapse
Affiliation(s)
- Dongfang Xu
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School Boston, MA, USA.
| | - Timothy Miller
- Computational Health Informatics Program, Boston Children's Hospital, Boston, MA, USA; Department of Pediatrics, Harvard Medical School Boston, MA, USA
| |
Collapse
|
7
|
Delayed Combination of Feature Embedding in Bidirectional LSTM CRF for NER. APPLIED SCIENCES-BASEL 2020. [DOI: 10.3390/app10217557] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Named Entity Recognition (NER) plays a vital role in natural language processing (NLP). Currently, deep neural network models have achieved significant success in NER. Recent advances in NER systems have introduced various feature selections to identify appropriate representations and handle Out-Of-the-Vocabulary (OOV) words. After selecting the features, they are all concatenated at the embedding layer before being fed into a model to label the input sequences. However, when concatenating the features, information collisions may occur and this would cause the limitation or degradation of the performance. To overcome the information collisions, some works tried to directly connect some features to latter layers, which we call the delayed combination and show its effectiveness by comparing it to the early combination. As feature encodings for input, we selected the character-level Convolutional Neural Network (CNN) or Long Short-Term Memory (LSTM) word encoding, the pre-trained word embedding, and the contextual word embedding and additionally designed CNN-based sentence encoding using a dictionary. These feature encodings are combined at early or delayed position of the bidirectional LSTM Conditional Random Field (CRF) model according to each feature’s characteristics. We evaluated the performance of this model on the CoNLL 2003 and OntoNotes 5.0 datasets using the F1 score and compared the delayed combination model with our own implementation of the early combination as well as the previous works. This comparison convinces us that our delayed combination is more effective than the early one and also highly competitive.
Collapse
|
8
|
Xu D, Gopale M, Zhang J, Brown K, Begoli E, Bethard S. Unified Medical Language System resources improve sieve-based generation and Bidirectional Encoder Representations from Transformers (BERT)-based ranking for concept normalization. J Am Med Inform Assoc 2020; 27:1510-1519. [PMID: 32719838 PMCID: PMC7566510 DOI: 10.1093/jamia/ocaa080] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2020] [Revised: 03/25/2020] [Accepted: 04/27/2020] [Indexed: 12/02/2022] Open
Abstract
OBJECTIVE Concept normalization, the task of linking phrases in text to concepts in an ontology, is useful for many downstream tasks including relation extraction, information retrieval, etc. We present a generate-and-rank concept normalization system based on our participation in the 2019 National NLP Clinical Challenges Shared Task Track 3 Concept Normalization. MATERIALS AND METHODS The shared task provided 13 609 concept mentions drawn from 100 discharge summaries. We first design a sieve-based system that uses Lucene indices over the training data, Unified Medical Language System (UMLS) preferred terms, and UMLS synonyms to generate a list of possible concepts for each mention. We then design a listwise classifier based on the BERT (Bidirectional Encoder Representations from Transformers) neural network to rank the candidate concepts, integrating UMLS semantic types through a regularizer. RESULTS Our generate-and-rank system was third of 33 in the competition, outperforming the candidate generator alone (81.66% vs 79.44%) and the previous state of the art (76.35%). During postevaluation, the model's accuracy was increased to 83.56% via improvements to how training data are generated from UMLS and incorporation of our UMLS semantic type regularizer. DISCUSSION Analysis of the model shows that prioritizing UMLS preferred terms yields better performance, that the UMLS semantic type regularizer results in qualitatively better concept predictions, and that the model performs well even on concepts not seen during training. CONCLUSIONS Our generate-and-rank framework for UMLS concept normalization integrates key UMLS features like preferred terms and semantic types with a neural network-based ranking model to accurately link phrases in text to UMLS concepts.
Collapse
Affiliation(s)
- Dongfang Xu
- School of Information, University of Arizona, Tucson, Arizona, USA
| | - Manoj Gopale
- Department of Electrical and Computer Engineering, University of Arizona, Tucson, Arizona, USA
| | - Jiacheng Zhang
- Department of Computer Science, University of Arizona, Tucson, Arizona, USA
| | - Kris Brown
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Edmon Begoli
- National Center for Computational Sciences, Oak Ridge National Laboratory, Oak Ridge, Tennessee, USA
| | - Steven Bethard
- School of Information, University of Arizona, Tucson, Arizona, USA
| |
Collapse
|
9
|
Dai HJ, Wang CK, Chang NW, Huang MS, Jonnagaddala J, Wang FD, Hsu WL. Statistical principle-based approach for recognizing and normalizing microRNAs described in scientific literature. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2019; 2019:5365313. [PMID: 30809637 PMCID: PMC6391575 DOI: 10.1093/database/baz030] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/28/2018] [Revised: 02/01/2019] [Accepted: 02/06/2019] [Indexed: 01/08/2023]
Abstract
The detection of MicroRNA (miRNA) mentions in scientific literature facilitates researchers with the ability to find relevant and appropriate literature based on queries formulated using miRNA information. Considering most published biological studies elaborated on signal transduction pathways or genetic regulatory information in the form of figure captions, the extraction of miRNA from both the main content and figure captions of a manuscript is useful in aggregate analysis and comparative analysis of the studies published. In this study, we present a statistical principle-based miRNA recognition and normalization method to identify miRNAs and link them to the identifiers in the Rfam database. As one of the core components in the text mining pipeline of the database miRTarBase, the proposed method combined the advantages of previous works relying on pattern, dictionary and supervised learning and provided an integrated solution for the problem of miRNA identification. Furthermore, the knowledge learned from the training data was organized in a human-interpretable manner to understand the reason why the system considers a span of text as a miRNA mention, and the represented knowledge can be further complemented by domain experts. We studied the ambiguity level of miRNA nomenclature to connect the miRNA mentions to the Rfam database and evaluated the performance of our approach on two datasets: the BioCreative VI Bio-ID corpus and the miRNA interaction corpus by extending the later corpus with additional Rfam normalization information. Our study highlights and also proposes a better understanding of the challenges associated with miRNA identification and normalization in scientific literature and the research gap that needs to be further explored in prospective studies.
Collapse
Affiliation(s)
- Hong-Jie Dai
- Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan, ROC
| | - Chen-Kai Wang
- Big Data Laboratories, Chunghwa Telecom Co., Taoyuan, Taiwan, ROC
| | - Nai-Wen Chang
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan.,Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Ming-Siang Huang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Jitendra Jonnagaddala
- School of Public Health and Community Medicine, University of New South Wales, Sydney, Australia
| | - Feng-Duo Wang
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| |
Collapse
|
10
|
Couto FM, Lamurias A. MER: a shell script and annotation server for minimal named entity recognition and linking. J Cheminform 2018; 10:58. [PMID: 30519990 PMCID: PMC6755715 DOI: 10.1186/s13321-018-0312-9] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 11/30/2018] [Indexed: 01/17/2023] Open
Abstract
Named-entity recognition aims at identifying the fragments of text that mention entities of interest, that afterwards could be linked to a knowledge base where those entities are described. This manuscript presents our minimal named-entity recognition and linking tool (MER), designed with flexibility, autonomy and efficiency in mind. To annotate a given text, MER only requires: (1) a lexicon (text file) with the list of terms representing the entities of interest; (2) optionally a tab-separated values file with a link for each term; (3) and a Unix shell. Alternatively, the user can provide an ontology from where MER will automatically generate the lexicon and links files. The efficiency of MER derives from exploring the high performance and reliability of the text processing command-line tools grep and awk, and a novel inverted recognition technique. MER was deployed in a cloud infrastructure using multiple Virtual Machines to work as an annotation server and participate in the Technical Interoperability and Performance of annotation Servers task of BioCreative V.5. The results show that our solution processed each document (text retrieval and annotation) in less than 3 s on average without using any type of cache. MER was also compared to a state-of-the-art dictionary lookup solution obtaining competitive results not only in computational performance but also in precision and recall. MER is publicly available in a GitHub repository ( https://github.com/lasigeBioTM/MER ) and through a RESTful Web service ( http://labs.fc.ul.pt/mer/ ).
Collapse
Affiliation(s)
- Francisco M. Couto
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, 1749 016 Lisbon, Portugal
| | - Andre Lamurias
- LASIGE, Faculdade de Ciências, Universidade de Lisboa, 1749 016 Lisbon, Portugal
- Faculty of Sciences, BioISI - Biosystems and Integrative Sciences Institute, University of Lisboa, Campo Grande, C8 bdg, 1749 016 Lisbon, Portugal
| |
Collapse
|
11
|
Abstract
Background Clinical notes such as discharge summaries have a semi- or unstructured format. These documents contain information about diseases, treatments, drugs, etc. Extracting meaningful information from them becomes challenging due to their narrative format. In this context, we aimed to compare the automatic extraction capacity of medical entities using two tools: MetaMap and cTAKES. Methods We worked with i2b2 (Informatics for Integrating Biology to the Bedside) Obesity Challenge data. Two experiments were constructed. In the first one, only one UMLS concept related with the diseases annotated was extracted. In the second, some UMLS concepts were aggregated. Results Results were evaluated with manually annotated medical entities. With the aggregation process the result shows a better improvement. MetaMap had an average of 0.88 in recall, 0.89 in precision, and 0.88 in F-score. With cTAKES, the average of recall, precision and F-score were 0.91, 0.89, and 0.89, respectively. Conclusions The aggregation of concepts (with similar and different semantic types) was shown to be a good strategy for improving the extraction of medical entities, and automatic aggregation could be considered in future works.
Collapse
Affiliation(s)
- Ruth Reátegui
- École de technologie supérieure, Montreal, Canada. .,Universidad Técnica Particular de Loja, Loja, Ecuador.
| | - Sylvie Ratté
- École de technologie supérieure, Montreal, Canada
| |
Collapse
|
12
|
Dai HJ, Su ECY, Uddin M, Jonnagaddala J, Wu CS, Syed-Abdul S. Exploring associations of clinical and social parameters with violent behaviors among psychiatric patients. J Biomed Inform 2017; 75S:S149-S159. [PMID: 28822857 DOI: 10.1016/j.jbi.2017.08.009] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 07/20/2017] [Accepted: 08/14/2017] [Indexed: 02/07/2023]
Abstract
Evidence has revealed interesting associations of clinical and social parameters with violent behaviors of patients with psychiatric disorders. Men are more violent preceding and during hospitalization, whereas women are more violent than men throughout the 3days following a hospital admission. It has also been proven that mental disorders may be a consistent risk factor for the occurrence of violence. In order to better understand violent behaviors of patients with psychiatric disorders, it is important to investigate both the clinical symptoms and psychosocial factors that accompany violence in these patients. In this study, we utilized a dataset released by the Partners Healthcare and Neuropsychiatric Genome-scale and RDoC Individualized Domains project of Harvard Medical School to develop a unique text mining pipeline that processes unstructured clinical data in order to recognize clinical and social parameters such asage, gender, history of alcohol use, and violent behaviors, and explored the associations between these parameters and violent behaviors of patients with psychiatric disorders. The aim of our work was to demonstrate the feasibility of mining factors that are strongly associated with violent behaviors among psychiatric patients from unstructured psychiatric evaluation records using clinical text mining. Experiment results showed that stimulants, followed by a family history of violent behavior, suicidal behaviors, and financial stress were strongly associated with violent behaviors. Key aspects explicated in this paper include employing our text mining pipeline to extract clinical and social factors linked with violent behaviors, generating association rules to uncover possible associations between these factors and violent behaviors, and lastly the ranking of top rules associated with violent behaviors using statistical analysis and interpretation.
Collapse
Affiliation(s)
- Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan; Interdisciplinary Program of Green and Information Technology, National Taitung University, Taitung, Taiwan
| | - Emily Chia-Yu Su
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| | - Mohy Uddin
- King Abdullah International Medical Research Center, King Saud Bin Abdulaziz University for Health Sciences, Publication Office, King Abdulaziz Medical City, Ministry of National Guard Health Affairs, Riyadh, Saudi Arabia
| | - Jitendra Jonnagaddala
- School of Public Health and Community Medicine, UNSW Sydney, Australia; Prince of Wales Clinical School, UNSW Sydney, Australia
| | - Chi-Shin Wu
- Department of Psychiatry, National Taiwan University Hospital and College of Medicine, National Taiwan University, Taipei, Taiwan
| | - Shabbir Syed-Abdul
- Graduate Institute of Biomedical Informatics, Taipei Medical University, Taipei, Taiwan
| |
Collapse
|