1
|
Wang M, Vijayaraghavan A, Beck T, Posma JM. Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. J Proteome Res 2024. [PMID: 38733346 DOI: 10.1021/acs.jproteome.3c00367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]
Abstract
Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.
Collapse
Affiliation(s)
- Meiqi Wang
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K
| | - Avish Vijayaraghavan
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K
- UKRI Centre for Doctoral Training in AI for Healthcare, Department of Computing, Imperial College London, London SW7 2AZ, U.K
| | - Tim Beck
- School of Medicine, University of Nottingham, Biodiscovery Institute, Nottingham NG7 2RD, U.K
- Health Data Research (HDR) U.K., London NW1 2BE, U.K
| | - Joram M Posma
- Section of Bioinformatics, Division of Systems Medicine, Department of Metabolism, Digestion and Reproduction, Imperial College London, London W12 0NN, U.K
- Health Data Research (HDR) U.K., London NW1 2BE, U.K
| |
Collapse
|
2
|
Zhang G, Zhou Y, Hu Y, Xu H, Weng C, Peng Y. A span-based model for extracting overlapping PICO entities from randomized controlled trial publications. J Am Med Inform Assoc 2024; 31:1163-1171. [PMID: 38471120 PMCID: PMC11031223 DOI: 10.1093/jamia/ocae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 02/20/2024] [Accepted: 03/11/2024] [Indexed: 03/14/2024] Open
Abstract
OBJECTIVES Extracting PICO (Populations, Interventions, Comparison, and Outcomes) entities is fundamental to evidence retrieval. We present a novel method, PICOX, to extract overlapping PICO entities. MATERIALS AND METHODS PICOX first identifies entities by assessing whether a word marks the beginning or conclusion of an entity. Then, it uses a multi-label classifier to assign one or more PICO labels to a span candidate. PICOX was evaluated using 1 of the best-performing baselines, EBM-NLP, and 3 more datasets, ie, PICO-Corpus and randomized controlled trial publications on Alzheimer's Disease (AD) or COVID-19, using entity-level precision, recall, and F1 scores. RESULTS PICOX achieved superior precision, recall, and F1 scores across the board, with the micro F1 score improving from 45.05 to 50.87 (P ≪.01). On the PICO-Corpus, PICOX obtained higher recall and F1 scores than the baseline and improved the micro recall score from 56.66 to 67.33. On the COVID-19 dataset, PICOX also outperformed the baseline and improved the micro F1 score from 77.10 to 80.32. On the AD dataset, PICOX demonstrated comparable F1 scores with higher precision when compared to the baseline. CONCLUSION PICOX excels in identifying overlapping entities and consistently surpasses a leading baseline across multiple datasets. Ablation studies reveal that its data augmentation strategy effectively minimizes false positives and improves precision.
Collapse
Affiliation(s)
- Gongbo Zhang
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Yiliang Zhou
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Yan Hu
- McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, CT 06510, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| |
Collapse
|
3
|
Gérardin C, Xiong Y, Wajsbürt P, Carrat F, Tannier X. Impact of Translation on Biomedical Information Extraction: Experiment on Real-Life Clinical Notes. JMIR Med Inform 2024; 12:e49607. [PMID: 38596859 DOI: 10.2196/49607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2023] [Revised: 01/07/2024] [Accepted: 01/10/2024] [Indexed: 03/03/2024] Open
Abstract
Background Biomedical natural language processing tasks are best performed with English models, and translation tools have undergone major improvements. On the other hand, building annotated biomedical data sets remains a challenge. Objective The aim of our study is to determine whether the use of English tools to extract and normalize French medical concepts based on translations provides comparable performance to that of French models trained on a set of annotated French clinical notes. Methods We compared 2 methods: 1 involving French-language models and 1 involving English-language models. For the native French method, the named entity recognition and normalization steps were performed separately. For the translated English method, after the first translation step, we compared a 2-step method and a terminology-oriented method that performs extraction and normalization at the same time. We used French, English, and bilingual annotated data sets to evaluate all stages (named entity recognition, normalization, and translation) of our algorithms. Results The native French method outperformed the translated English method, with an overall F1-score of 0.51 (95% CI 0.47-0.55), compared with 0.39 (95% CI 0.34-0.44) and 0.38 (95% CI 0.36-0.40) for the 2 English methods tested. Conclusions Despite recent improvements in translation models, there is a significant difference in performance between the 2 approaches in favor of the native French method, which is more effective on French medical texts, even with few annotated documents.
Collapse
Affiliation(s)
- Christel Gérardin
- Institut Pierre Louis d'Epidémiologie et de Santé Publique, Sorbonne Université, Institut National de la Santé et de la Recherche Médicale, Paris, France
| | - Yuhan Xiong
- Institut Pierre Louis d'Epidémiologie et de Santé Publique, Sorbonne Université, Institut National de la Santé et de la Recherche Médicale, Paris, France
- Shanghai Jiaotong University, Shanghai, China
| | - Perceval Wajsbürt
- Innovation and Data Unit, Assistance Publique Hôpitaux de Paris, Paris, France
| | - Fabrice Carrat
- Institut Pierre Louis d'Epidémiologie et de Santé Publique, Sorbonne Université, Institut National de la Santé et de la Recherche Médicale, Paris, France
- Department of Public Health, Assistance Publique Hôpitaux de Paris, Hôpital Saint-Antoine, Paris, France
| | - Xavier Tannier
- 5, Sorbonne Université, Institut National de la Santé et de la Recherche Médicale, Université Sorbonne Paris-Nord, Laboratoire d'Informatique Médicale et de Connaissance en e-Santé, Paris, France
| |
Collapse
|
4
|
Iscoe M, Socrates V, Gilson A, Chi L, Li H, Huang T, Kearns T, Perkins R, Khandjian L, Taylor RA. Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models. Acad Emerg Med 2024. [PMID: 38567658 DOI: 10.1111/acem.14883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 01/24/2024] [Accepted: 01/24/2024] [Indexed: 04/04/2024]
Abstract
BACKGROUND Natural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large-scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms. OBJECTIVES This study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large-scale research, public health surveillance, and EHR-based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes. METHODS The study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task-specific LLMs to perform the task of named entity recognition: a convolutional neural network-based model (SpaCy) and a transformer-based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level. RESULTS A total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note-level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model. CONCLUSIONS The study demonstrated the utility of LLMs and transformer-based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.
Collapse
Affiliation(s)
- Mark Iscoe
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
| | - Vimig Socrates
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA
| | - Aidan Gilson
- Yale School of Medicine, New Haven, Connecticut, USA
| | - Ling Chi
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| | - Huan Li
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, USA
| | - Thomas Huang
- Yale School of Medicine, New Haven, Connecticut, USA
| | - Thomas Kearns
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - Rachelle Perkins
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - Laura Khandjian
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
| | - R Andrew Taylor
- Department of Emergency Medicine, Yale School of Medicine, New Haven, Connecticut, USA
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, Connecticut, USA
| |
Collapse
|
5
|
Li Z, Wei Q, Huang LC, Li J, Hu Y, Chuang YS, He J, Das A, Keloth VK, Yang Y, Diala CS, Roberts KE, Tao C, Jiang X, Zheng WJ, Xu H. Ensemble pretrained language models to extract biomedical knowledge from literature. J Am Med Inform Assoc 2024:ocae061. [PMID: 38520725 DOI: 10.1093/jamia/ocae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 02/14/2024] [Accepted: 03/12/2024] [Indexed: 03/25/2024] Open
Abstract
OBJECTIVES The rapid expansion of biomedical literature necessitates automated techniques to discern relationships between biomedical concepts from extensive free text. Such techniques facilitate the development of detailed knowledge bases and highlight research deficiencies. The LitCoin Natural Language Processing (NLP) challenge, organized by the National Center for Advancing Translational Science, aims to evaluate such potential and provides a manually annotated corpus for methodology development and benchmarking. MATERIALS AND METHODS For the named entity recognition (NER) task, we utilized ensemble learning to merge predictions from three domain-specific models, namely BioBERT, PubMedBERT, and BioM-ELECTRA, devised a rule-driven detection method for cell line and taxonomy names and annotated 70 more abstracts as additional corpus. We further finetuned the T0pp model, with 11 billion parameters, to boost the performance on relation extraction and leveraged entites' location information (eg, title, background) to enhance novelty prediction performance in relation extraction (RE). RESULTS Our pioneering NLP system designed for this challenge secured first place in Phase I-NER and second place in Phase II-relation extraction and novelty prediction, outpacing over 200 teams. We tested OpenAI ChatGPT 3.5 and ChatGPT 4 in a Zero-Shot setting using the same test set, revealing that our finetuned model considerably surpasses these broad-spectrum large language models. DISCUSSION AND CONCLUSION Our outcomes depict a robust NLP system excelling in NER and RE across various biomedical entities, emphasizing that task-specific models remain superior to generic large ones. Such insights are valuable for endeavors like knowledge graph development and hypothesis formulation in biomedical research.
Collapse
Affiliation(s)
- Zhao Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Qiang Wei
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Liang-Chin Huang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Jianfu Li
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Yan Hu
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Yao-Shun Chuang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Jianping He
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Avisha Das
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Vipina Kuttichi Keloth
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| | - Yuntao Yang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Chiamaka S Diala
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Kirk E Roberts
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Cui Tao
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Xiaoqian Jiang
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - W Jim Zheng
- McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
| |
Collapse
|
6
|
Truhn D, Loeffler CM, Müller-Franzes G, Nebelung S, Hewitt KJ, Brandner S, Bressem KK, Foersch S, Kather JN. Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4). J Pathol 2024; 262:310-319. [PMID: 38098169 DOI: 10.1002/path.6232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 09/16/2023] [Accepted: 11/03/2023] [Indexed: 02/06/2024]
Abstract
Deep learning applied to whole-slide histopathology images (WSIs) has the potential to enhance precision oncology and alleviate the workload of experts. However, developing these models necessitates large amounts of data with ground truth labels, which can be both time-consuming and expensive to obtain. Pathology reports are typically unstructured or poorly structured texts, and efforts to implement structured reporting templates have been unsuccessful, as these efforts lead to perceived extra workload. In this study, we hypothesised that large language models (LLMs), such as the generative pre-trained transformer 4 (GPT-4), can extract structured data from unstructured plain language reports using a zero-shot approach without requiring any re-training. We tested this hypothesis by utilising GPT-4 to extract information from histopathological reports, focusing on two extensive sets of pathology reports for colorectal cancer and glioblastoma. We found a high concordance between LLM-generated structured data and human-generated structured data. Consequently, LLMs could potentially be employed routinely to extract ground truth data for machine learning from unstructured pathology reports in the future. © 2023 The Authors. The Journal of Pathology published by John Wiley & Sons Ltd on behalf of The Pathological Society of Great Britain and Ireland.
Collapse
Affiliation(s)
- Daniel Truhn
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Chiara Ml Loeffler
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
- Department of Medicine I, University Hospital Dresden, Dresden, Germany
- Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany
| | - Gustav Müller-Franzes
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Sven Nebelung
- Department of Diagnostic and Interventional Radiology, University Hospital RWTH Aachen, Aachen, Germany
| | - Katherine J Hewitt
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
- Department of Medicine III, University Hospital RWTH Aachen, Aachen, Germany
| | - Sebastian Brandner
- Department of Neurosurgery, University Hospital Erlangen, Erlangen, Germany
| | - Keno K Bressem
- Department of Radiology, Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Sebastian Foersch
- Institute of Pathology, University Medical Center Mainz, Mainz, Germany
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Technical University Dresden, Dresden, Germany
- Department of Medicine I, University Hospital Dresden, Dresden, Germany
- Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany
- Pathology and Data Analytics, Leeds Institute of Medical Research at St James's, University of Leeds, Leeds, UK
| |
Collapse
|
7
|
Yao LFL, Liew K, Wakamiya S, Aramaki E. Extracting Spatio-Temporal Trends in Medical Research Prioritization Through Natural Language Processing of Case Report Abstracts. Stud Health Technol Inform 2024; 310:634-638. [PMID: 38269886 DOI: 10.3233/shti231042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Medical research prioritization is an important aspect of decision-making by researchers and relevant stakeholders. The ever-increasing availability of technology and data has opened doors to new discoveries and new questions. This makes it difficult for researchers and relevant stakeholders to make well-informed decisions about the research areas they want to support and the nations they should look for collaborations. It is, therefore, useful to look at the spatio-temporal trends of medical research prioritization to gain insight into popular and neglected areas of research as well as the allocation of prioritization of each nation. In this study, we develop a system that collects, classifies, and summarizes case report abstracts according to the location, time, and disease category of the report. The additional classifications allow us to visualize and monitor the trends in medical research prioritization by location, time, and disease category.
Collapse
|
8
|
Neuraz A, Lerner I, Birot O, Arias C, Han L, Bonzel CL, Cai T, Huynh KT, Coulet A. TAXN: Translate Align Extract Normalize, a Multilingual Extraction Tool for Clinical Texts. Stud Health Technol Inform 2024; 310:649-653. [PMID: 38269889 DOI: 10.3233/shti231045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Several studies have shown that about 80% of the medical information in an electronic health record is only available through unstructured data. Resources such as medical terminologies in languages other than English are limited and restrain the NLP tools. We propose here to leverage English based resources in other languages using a combination of translation, word alignment, entity extraction and term normalization (TAXN). We implement this extraction pipeline in an open-source library called "medkit". We demonstrate the interest of this approach through a specific use-case: enriching a phenotypic dictionary for post-acute sequelae in COVID-19 (PASC). TAXN proved to be efficient to propose new synonyms of UMLS terms using a corpus of 70 articles in French with 356 terms enriched with at least one validated new synonym. This study was based on freely available deep-learning models.
Collapse
Affiliation(s)
- Antoine Neuraz
- Heka Team, Inria, INSERM Centre de recherche des Cordeliers, Université Paris Cité, Paris, France
| | - Ivan Lerner
- Heka Team, Inria, INSERM Centre de recherche des Cordeliers, Université Paris Cité, Paris, France
- Department of Biomedical Informatics, Hôpital Européen Georges Pompidou, Hôpital Necker-Enfants Malades, APHP, Paris, France
| | - Olivier Birot
- Heka Team, Inria, INSERM Centre de recherche des Cordeliers, Université Paris Cité, Paris, France
| | - Camila Arias
- Heka Team, Inria, INSERM Centre de recherche des Cordeliers, Université Paris Cité, Paris, France
| | - Larry Han
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Clara Lea Bonzel
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Tianxi Cai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kim Tam Huynh
- Heka Team, Inria, INSERM Centre de recherche des Cordeliers, Université Paris Cité, Paris, France
| | - Adrien Coulet
- Heka Team, Inria, INSERM Centre de recherche des Cordeliers, Université Paris Cité, Paris, France
| |
Collapse
|
9
|
Guo Y, Ge Y, Sarker A. Detection of Medication Mentions and Medication Change Events in Clinical Notes Using Transformer-Based Models. Stud Health Technol Inform 2024; 310:685-689. [PMID: 38269896 DOI: 10.3233/shti231052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
In this paper, we address the related tasks of medication extraction, event classification, and context classification from clinical text. The data for the tasks were obtained from the National Natural Language Processing (NLP) Clinical Challenges (n2c2) Track 1. We developed a named entity recognition (NER) model based on BioClinicalBERT and applied a dictionary-based fuzzy matching mechanism to identify the medication mentions in clinical notes. We developed a unified model architecture for event classification and context classification. The model used two pre-trained models-BioClinicalBERT and RoBERTa to predict the class, separately. Additionally, we applied an ensemble mechanism to combine the predictions of BioClinicalBERT and RoBERTa. For event classification, our best model achieved 0.926 micro-averaged F1-score, 5% higher than the baseline model. The shared task released the data in different stages during the evaluation phase. Our system consistently ranked among the top 10 for Releases 1 and 2.
Collapse
Affiliation(s)
- Yuting Guo
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, United States
| | - Yao Ge
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, United States
| | - Abeed Sarker
- Department of Biomedical Informatics, School of Medicine, Emory University, Atlanta, United States
| |
Collapse
|
10
|
Stevens M, Kennedy G, Churches T. Applying and Improving a Publicly Available Medication NER Pipeline in a Clinical Cancer EMR. Stud Health Technol Inform 2024; 310:679-684. [PMID: 38269895 DOI: 10.3233/shti231051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Clinical NLP can be applied to extract medication information from free-text notes in EMRs, using NER pipelines. Publicly available annotated data for clinical NLP are scarce, and research annotation budgets are often low. Fine-tuning pre-trained pipelines containing a Transformer layer can produce quality results with relatively small training corpora. We examine the transferability of a publicly available, pre-trained NER pipeline with a Transformer layer for medication targets. The pipeline performs poorly when directly validated but achieves an F1-score of 92% for drug names after fine-tuning with 1,565 annotated samples from a clinical cancer EMR - highlighting the benefits of the Transformer architecture in this setting. Performance was largely influenced by inconsistent annotation - reinforcing the need for innovative annotation processes in clinical NLP applications.
Collapse
Affiliation(s)
- Meg Stevens
- University of New South Wales, Sydney Australia
| | - Georgina Kennedy
- University of New South Wales, Sydney Australia
- Ingham Institute for Applied Medical Research, Sydney Australia
- Maridulu Budyari Gumal (SPHERE) Cancer Clinical Academic Group
| | - Timothy Churches
- University of New South Wales, Sydney Australia
- Ingham Institute for Applied Medical Research, Sydney Australia
| |
Collapse
|
11
|
Zhou H, Austin R, Lu SC, Silverman GM, Zhou Y, Kilicoglu H, Xu H, Zhang R. Complementary and Integrative Health Information in the literature: its lexicon and named entity recognition. J Am Med Inform Assoc 2024; 31:426-434. [PMID: 37952122 PMCID: PMC10797266 DOI: 10.1093/jamia/ocad216] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 10/20/2023] [Accepted: 11/08/2023] [Indexed: 11/14/2023] Open
Abstract
OBJECTIVE To construct an exhaustive Complementary and Integrative Health (CIH) Lexicon (CIHLex) to help better represent the often underrepresented physical and psychological CIH approaches in standard terminologies, and to also apply state-of-the-art natural language processing (NLP) techniques to help recognize them in the biomedical literature. MATERIALS AND METHODS We constructed the CIHLex by integrating various resources, compiling and integrating data from biomedical literature and relevant sources of knowledge. The Lexicon encompasses 724 unique concepts with 885 corresponding unique terms. We matched these concepts to the Unified Medical Language System (UMLS), and we developed and utilized BERT models comparing their efficiency in CIH named entity recognition to well-established models including MetaMap and CLAMP, as well as the large language model GPT3.5-turbo. RESULTS Of the 724 unique concepts in CIHLex, 27.2% could be matched to at least one term in the UMLS. About 74.9% of the mapped UMLS Concept Unique Identifiers were categorized as "Therapeutic or Preventive Procedure." Among the models applied to CIH named entity recognition, BLUEBERT delivered the highest macro-average F1-score of 0.91, surpassing other models. CONCLUSION Our CIHLex significantly augments representation of CIH approaches in biomedical literature. Demonstrating the utility of advanced NLP models, BERT notably excelled in CIH entity recognition. These results highlight promising strategies for enhancing standardization and recognition of CIH terminology in biomedical contexts.
Collapse
Affiliation(s)
- Huixue Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, United States
| | - Robin Austin
- School of Nursing, University of Minnesota, Minneapolis, MN, United States
| | - Sheng-Chieh Lu
- Department of Symptom Research, The University of Texas MD Anderson Cancer Center, Houston, TX, United States
| | - Greg Marc Silverman
- Department of Surgery, University of Minnesota, Minneapolis, MN, United States
| | - Yuqi Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, United States
- Department of Pharmaceutical Care & Health Systems, University of Minnesota, Minneapolis, MN, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, United States
| | - Rui Zhang
- Department of Surgery, University of Minnesota, Minneapolis, MN, United States
| |
Collapse
|
12
|
Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, Wang K. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns (N Y) 2024; 5:100887. [PMID: 38264716 PMCID: PMC10801236 DOI: 10.1016/j.patter.2023.100887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 10/25/2023] [Accepted: 11/06/2023] [Indexed: 01/25/2024]
Abstract
To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models-PhenoBCBERT and PhenoGPT-for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models to automate the detection of phenotype terms, including those not in the current HPO. We compare these models with PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also show strong performance in case studies on biomedical literature. We evaluate the strengths and weaknesses of BERT- and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.
Collapse
Affiliation(s)
- Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Wendy Deng
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Yunyun Zhou
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Biostatistics and Bioinformatics Facility, Fox Chase Cancer Center, Philadelphia, PA 19111, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
13
|
Sugimoto K, Wada S, Konishi S, Okada K, Manabe S, Matsumura Y, Takeda T. Extracting Clinical Information From Japanese Radiology Reports Using a 2-Stage Deep Learning Approach: Algorithm Development and Validation. JMIR Med Inform 2023; 11:e49041. [PMID: 37991979 PMCID: PMC10686535 DOI: 10.2196/49041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/25/2023] [Accepted: 10/03/2023] [Indexed: 11/24/2023] Open
Abstract
Background Radiology reports are usually written in a free-text format, which makes it challenging to reuse the reports. Objective For secondary use, we developed a 2-stage deep learning system for extracting clinical information and converting it into a structured format. Methods Our system mainly consists of 2 deep learning modules: entity extraction and relation extraction. For each module, state-of-the-art deep learning models were applied. We trained and evaluated the models using 1040 in-house Japanese computed tomography (CT) reports annotated by medical experts. We also evaluated the performance of the entire pipeline of our system. In addition, the ratio of annotated entities in the reports was measured to validate the coverage of the clinical information with our information model. Results The microaveraged F1-scores of our best-performing model for entity extraction and relation extraction were 96.1% and 97.4%, respectively. The microaveraged F1-score of the 2-stage system, which is a measure of the performance of the entire pipeline of our system, was 91.9%. Our system showed encouraging results for the conversion of free-text radiology reports into a structured format. The coverage of clinical information in the reports was 96.2% (6595/6853). Conclusions Our 2-stage deep system can extract clinical information from chest and abdomen CT reports accurately and comprehensively.
Collapse
Affiliation(s)
- Kento Sugimoto
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Shoya Wada
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
- Department of Transformative System for Medical Information, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Shozo Konishi
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Katsuki Okada
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Shirou Manabe
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
- Department of Transformative System for Medical Information, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| | - Yasushi Matsumura
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
- National Hospital Organization Osaka National Hospital, Osaka, Japan
| | - Toshihiro Takeda
- Department of Medical Informatics, Graduate School of Medicine, Osaka University, Suita, Osaka, Japan
| |
Collapse
|
14
|
Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, Wang K. Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT. ArXiv 2023:arXiv:2308.06294v2. [PMID: 37986722 PMCID: PMC10659449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
To enhance phenotype recognition in clinical notes of genetic diseases, we developed two models - PhenoBCBERT and PhenoGPT - for expanding the vocabularies of Human Phenotype Ontology (HPO) terms. While HPO offers a standardized vocabulary for phenotypes, existing tools often fail to capture the full scope of phenotypes, due to limitations from traditional heuristic or rule-based approaches. Our models leverage large language models (LLMs) to automate the detection of phenotype terms, including those not in the current HPO. We compared these models to PhenoTagger, another HPO recognition tool, and found that our models identify a wider range of phenotype concepts, including previously uncharacterized ones. Our models also showed strong performance in case studies on biomedical literature. We evaluated the strengths and weaknesses of BERT-based and GPT-based models in aspects such as architecture and accuracy. Overall, our models enhance automated phenotype detection from clinical texts, improving downstream analyses on human diseases.
Collapse
Affiliation(s)
- Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Cong Liu
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Wendy Deng
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA
| | - Yunyun Zhou
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Biostatistics and Bioinformatics facility, Fox Chase Cancer Center, Philadelphia, PA 19111, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
15
|
Chen S, Lan X, Yu H. A social network analysis: mental health scales used during the COVID-19 pandemic. Front Psychiatry 2023; 14:1199906. [PMID: 37706038 PMCID: PMC10495585 DOI: 10.3389/fpsyt.2023.1199906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 08/11/2023] [Indexed: 09/15/2023] Open
Abstract
Introduction The focus on psychological issues during COVID-19 has led to the development of large surveys that involve the use of mental health scales. Numerous mental health measurements are available; choosing the appropriate measurement is crucial. Methods A rule-based named entity recognition was used to recognize entities of mental health scales that occur in the articles from PubMed. The co-occurrence networks of mental health scales and Medical Subject Headings (MeSH) terms were constructed by Gephi. Results Five types of MeSH terms were filtered, including research objects, research topics, research methods, countries/regions, and factors. Seventy-eight mental health scales were discovered. Discussion The findings provide insights on the scales used most often during the pandemic, the key instruments used to measure healthcare workers' physical and mental health, the scales most often utilized for assessing maternal mental health, the tools used most commonly for assessing older adults' psychological resilience and loneliness, and new COVID-19 mental health scales. Future studies may use these findings as a guiding reference and compass.
Collapse
Affiliation(s)
| | - Xue Lan
- Department of Health Management, China Medical University, Shenyang, China
| | | |
Collapse
|
16
|
Jiang Y, Kavuluru R. End-to-End n-ary Relation Extraction for Combination Drug Therapies. IEEE Int Conf Healthc Inform 2023; 2023:72-80. [PMID: 38283165 PMCID: PMC10814995 DOI: 10.1109/ichi57859.2023.00021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
Combination drug therapies are treatment regimens that involve two or more drugs, administered more commonly for patients with cancer, HIV, malaria, or tuberculosis. Currently there are over 350K articles in PubMed that use the combination drug therapy MeSH heading with at least 10K articles published per year over the past two decades. Extracting combination therapies from scientific literature inherently constitutes an n-ary relation extraction problem. Unlike in the general n-ary setting where n is fixed (e.g., drug-gene-mutation relations where n = 3), extracting combination therapies is a special setting where n ≥ 2 is dynamic, depending on each instance. Recently, Tiktinsky et al. (NAACL 2022) introduced a first of its kind dataset, CombDrugExt, for extracting such therapies from literature. Here, we use a sequence-to-sequence style end-to-end extraction method to achieve an F1-Score of 66.7% on the CombDrugExt test set for positive (or effective) combinations. This is an absolute ≈ 5% F1-score improvement even over the prior best relation classification score with spotted drug entities (hence, not end-to-end). Thus our effort introduces a state-of-the-art first model for end-to-end extraction that is already superior to the best prior non end-to-end model for this task. Our model seamlessly extracts all drug entities and relations in a single pass and is highly suitable for dynamic n-ary extraction scenarios.
Collapse
Affiliation(s)
- Yuhang Jiang
- Department of Computer Science, University of Kentucky, Lexington, KY USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics, Dept. of Internal Medicine, Univ. of Kentucky, Lexington, KY USA
| |
Collapse
|
17
|
Sun Z, Tao C. Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria. IEEE Int Conf Healthc Inform 2023; 2023:558-564. [PMID: 38283164 PMCID: PMC10815931 DOI: 10.1109/ichi57859.2023.00100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]
Abstract
Alzheimer's Disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Finding effective treatments for this disease is crucial. Clinical trials play an essential role in developing and testing new treatments for AD. However, identifying eligible participants can be challenging, time-consuming, and costly. In recent years, the development of natural language processing (NLP) techniques, specifically named entity recognition (NER) and named entity normalization (NEN), have helped to automate the identification and extraction of relevant information from the eligibility criteria (EC) more efficiently, in order to facilitate semi-automatic patient recruitment and enable data FAIRness for clinical trial data. Nevertheless, most current biomedical NER models only provide annotations for a restricted set of entity types that may not be applicable to the clinical trial data. Additionally, accurately performing NEN on entities that are negated using a negative prefix currently lacks established techniques. In this paper, we introduce a pipeline designed for information extraction from AD clinical trial EC, which involves preprocessing of the EC data, clinical NER, and biomedical NEN to Unified Medical Language System (UMLS). Our NER model can identify named entities in seven pre-defined categories, while our NEN model employs a combination of exact match and partial match search strategies, as well as customized rules to accurately normalize entities with negative prefixes. To evaluate the performance of our pipeline, we measured the precision, recall, and F1 score for the NER component, and we manually reviewed the top five mapping results produced by the NEN component. Our evaluation of the pipeline's performance revealed that it can successfully normalize named entities in clinical trial ECs with optimal accuracies. The NER component achieved a overall F1 of 0.816, demonstrating its ability to accurately identify seven types of named entities in clinical text. The NEN component of the pipeline also demonstrated impressive performance, with customized rules and a combination of exact and partial match strategies leading to an accuracy of 0.940 for normalized entities.
Collapse
Affiliation(s)
- Zenan Sun
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas
| | - Cui Tao
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas
| |
Collapse
|
18
|
Wei J, Hu T, Dai J, Wang Z, Han P, Huang W. Research on named entity recognition of adverse drug reactions based on NLP and deep learning. Front Pharmacol 2023; 14:1121796. [PMID: 37332351 PMCID: PMC10270322 DOI: 10.3389/fphar.2023.1121796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 05/23/2023] [Indexed: 06/20/2023] Open
Abstract
Introduction: Adverse drug reactions (ADR) are directly related to public health and become the focus of public and media attention. At present, a large number of ADR events have been reported on the Internet, but the mining and utilization of such information resources is insufficient. Named entity recognition (NER) is the basic work of many natural language processing (NLP) tasks, which aims to identify entities with specific meanings from natural language texts. Methods: In order to identify entities from ADR event data resources more effectively, so as to provide valuable health knowledge for people, this paper introduces ALBERT in the input presentation layer on the basis of the classic BiLSTM-CRF model, and proposes a method of ADR named entity recognition based on the ALBERT-BiLSTM-CRF model. The textual information about ADR on the website "Chinese medical information query platform" (https://www.dayi.org.cn) was collected by the crawler and used as research data, and the BIO method was used to label three types of entities: drug name (DRN), drug component (COM), and adverse drug reactions (ADR) to build a corpus. Then, the words were mapped to the word vector by using the ALBERT module to obtain the character level semantic information, the context coding was performed by the BiLSTM module, and the label decoding was using the CRF module to predict the real label. Results: Based on the constructed corpus, experimental comparisons were made with two classical models, namely, BiLSTM-CRF and BERT-BiLSTM-CRF. The experimental results show that the F 1 of our method is 91.19% on the whole, which is 1.5% and 1.37% higher than the other two models respectively, and the performance of recognition of three types of entities is significantly improved, which proves the superiority of this method. Discussion: The method proposed can be used effectively in NER from ADR information on the Internet, which provides a basis for the extraction of drug-related entity relationships and the construction of knowledge graph, thus playing a role in practical health systems such as intelligent diagnosis, risk reasoning and automatic question answering.
Collapse
Affiliation(s)
- Jianxiang Wei
- School of Management, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Tianling Hu
- School of Cyber Science and Engineering, Southeast University, Nanjing, China
| | - Jimin Dai
- School of Internet of Things, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Ziren Wang
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Pu Han
- School of Management, Nanjing University of Posts and Telecommunications, Nanjing, China
| | - Weidong Huang
- School of Management, Nanjing University of Posts and Telecommunications, Nanjing, China
- Key Research Base of Philosophy and Social Sciences in Jiangsu-Information Industry Integration Innovation and Emergency Management Research Center, Nanjing, China
| |
Collapse
|
19
|
Xu Q, Zhou Y, Liao B, Xin Z, Xie W, Hu C, Luo A. Named Entity Recognition of Diabetes Online Health Community Data Using Multiple Machine Learning Models. Bioengineering (Basel) 2023; 10:659. [PMID: 37370590 DOI: 10.3390/bioengineering10060659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 05/19/2023] [Accepted: 05/25/2023] [Indexed: 06/29/2023] Open
Abstract
The rising prevalence of diabetes and the increasing awareness of self-health management have resulted in a surge in diabetes patients seeking health information and emotional support in online health communities. Consequently, there is a vast database of patient consultation information in these online health communities. However, due to the heterogeneity and incompleteness of the content, mining medical information and patient health data from these communities can be a challenge. To address this issue, we built the RoBERTa-BiLSTM-CRF (RBC) model for identifying entities in the online health community of diabetes. We selected 1889 question-answer texts from the most active online health community in China, Good Doctor Online, and used these public data to identify five types of entities. In addition, we conducted a comparative evaluation with three other commonly used models to validate the performance of our proposed model, including RoBERTa-CRF (RC), BilSTM-CRF (BC), and RoBERTa-Softmax (RS). The results showed that the RBC model achieved excellent performance on the test set, with an accuracy of 81.2% and an F1 score of 80.7%, outperforming the performance of traditional entity recognition models in named entity recognition in online medical communities for doctors and diabetes patients. The high performance of entity recognition in online health communities will provide a crucial knowledge source for constructing medical knowledge graphs. This integration would help alleviate the growing demand for medical consultations and the strain on healthcare resources, while assisting healthcare professionals in making informed decisions and providing personalized services to patients.
Collapse
Affiliation(s)
- Qian Xu
- Second Xiangya Hospital, Central South University, Changsha 410011, China
- School of Life Sciences, Central South University, Changsha 410013, China
- College of Computer Science and Engineering, Jishou University, Jishou 416000, China
- Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China
- Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
| | - Yue Zhou
- Second Xiangya Hospital, Central South University, Changsha 410011, China
- School of Life Sciences, Central South University, Changsha 410013, China
- Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China
- Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
| | - Bolin Liao
- College of Computer Science and Engineering, Jishou University, Jishou 416000, China
| | - Zirui Xin
- Second Xiangya Hospital, Central South University, Changsha 410011, China
- Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China
- Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
| | - Wenzhao Xie
- Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China
- Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
| | - Chao Hu
- Big Data Institute, Central South University, Changsha 410011, China
| | - Aijing Luo
- Second Xiangya Hospital, Central South University, Changsha 410011, China
- Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China
- Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
| |
Collapse
|
20
|
Šuvalov H, Laur S, Kolde R. Information Extraction from Medical Texts with BERT Using Human-in-the-Loop Labeling. Stud Health Technol Inform 2023; 302:831-832. [PMID: 37203510 DOI: 10.3233/shti230281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Neural network language models, such as BERT, can be used for information extraction from medical texts with unstructured free text. These models can be pre-trained on a large corpus to learn the language and characteristics of the relevant domain and then fine-tuned with labeled data for a specific task. We propose a pipeline using human-in-the-loop labeling to create annotated data for Estonian healthcare information extraction. This method is particularly useful for low-resource languages and is more accessible to those in the medical field than rule-based methods like regular expressions.
Collapse
|
21
|
Sezgin E, Hussain SA, Rust S, Huang Y. Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data. JMIR Form Res 2023; 7:e43014. [PMID: 36881467 PMCID: PMC10031450 DOI: 10.2196/43014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 01/24/2023] [Accepted: 01/30/2023] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Patient-generated health data (PGHD) captured via smart devices or digital health technologies can reflect an individual health journey. PGHD enables tracking and monitoring of personal health conditions, symptoms, and medications out of the clinic, which is crucial for self-care and shared clinical decisions. In addition to self-reported measures and structured PGHD (eg, self-screening, sensor-based biometric data), free-text and unstructured PGHD (eg, patient care note, medical diary) can provide a broader view of a patient's journey and health condition. Natural language processing (NLP) is used to process and analyze unstructured data to create meaningful summaries and insights, showing promise to improve the utilization of PGHD. OBJECTIVE Our aim is to understand and demonstrate the feasibility of an NLP pipeline to extract medication and symptom information from real-world patient and caregiver data. METHODS We report a secondary data analysis, using a data set collected from 24 parents of children with special health care needs (CSHCN) who were recruited via a nonrandom sampling approach. Participants used a voice-interactive app for 2 weeks, generating free-text patient notes (audio transcription or text entry). We built an NLP pipeline using a zero-shot approach (adaptive to low-resource settings). We used named entity recognition (NER) and medical ontologies (RXNorm and SNOMED CT [Systematized Nomenclature of Medicine Clinical Terms]) to identify medication and symptoms. Sentence-level dependency parse trees and part-of-speech tags were used to extract additional entity information using the syntactic properties of a note. We assessed the data; evaluated the pipeline with the patient notes; and reported the precision, recall, and F1 scores. RESULTS In total, 87 patient notes are included (audio transcriptions n=78 and text entries n=9) from 24 parents who have at least one CSHCN. The participants were between the ages of 26 and 59 years. The majority were White (n=22, 92%), had more than one child (n=16, 67%), lived in Ohio (n=22, 92%), had mid- or upper-mid household income (n=15, 62.5%), and had higher level education (n=24, 58%). Out of 87 notes, 30 were drug and medication related, and 46 were symptom related. We captured medication instances (medication, unit, quantity, and date) and symptoms satisfactorily (precision >0.65, recall >0.77, F1>0.72). These results indicate the potential when using NER and dependency parsing through an NLP pipeline on information extraction from unstructured PGHD. CONCLUSIONS The proposed NLP pipeline was found to be feasible for use with real-world unstructured PGHD to accomplish medication and symptom extraction. Unstructured PGHD can be leveraged to inform clinical decision-making, remote monitoring, and self-care including medical adherence and chronic disease management. With customizable information extraction methods using NER and medical ontologies, NLP models can feasibly extract a broad range of clinical information from unstructured PGHD in low-resource settings (eg, a limited number of patient notes or training data).
Collapse
Affiliation(s)
- Emre Sezgin
- The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, OH, United States
- The Ohio State University College of Medicine, Columbus, OH, United States
| | - Syed-Amad Hussain
- The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, OH, United States
| | - Steve Rust
- The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, OH, United States
| | - Yungui Huang
- The Abigail Wexner Research Institute at Nationwide Children's Hospital, Columbus, OH, United States
| |
Collapse
|
22
|
Frei J, Kramer F. German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation. JMIR Form Res 2023; 7:e39077. [PMID: 36853741 PMCID: PMC10015355 DOI: 10.2196/39077] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 09/11/2022] [Accepted: 11/03/2022] [Indexed: 11/06/2022] Open
Abstract
BACKGROUND Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data. OBJECTIVE We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication. METHODS The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements. RESULTS The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency-averaged F1 score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available. CONCLUSIONS We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work.
Collapse
Affiliation(s)
- Johann Frei
- IT Infrastructure for Translational Medical Research, University of Augsburg, Augsburg, Germany
| | - Frank Kramer
- IT Infrastructure for Translational Medical Research, University of Augsburg, Augsburg, Germany
| |
Collapse
|
23
|
Ma X, Yu R, Gao C, Wei Z, Xia Y, Wang X, Liu H. Research on named entity recognition method of marine natural products based on attention mechanism. Front Chem 2023; 11:958002. [PMID: 36846857 PMCID: PMC9944735 DOI: 10.3389/fchem.2023.958002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 01/24/2023] [Indexed: 02/11/2023] Open
Abstract
Marine natural product (MNP) entity property information is the basis of marine drug development, and this entity property information can be obtained from the original literature. However, the traditional methods require several manual annotations, the accuracy of the model is low and slow, and the problem of inconsistent lexical contexts cannot be solved well. In order to solve the aforementioned problems, this study proposes a named entity recognition method based on the attention mechanism, inflated convolutional neural network (IDCNN), and conditional random field (CRF), combining the attention mechanism that can use the lexicality of words to make attention-weighted mentions of the extracted features, the ability of the inflated convolutional neural network to parallelize operations and long- and short-term memory, and the excellent learning ability. A named entity recognition algorithm model is developed for the automatic recognition of entity information in the MNP domain literature. Experiments demonstrate that the proposed model can properly identify entity information from the unstructured chapter-level literature and outperform the control model in several metrics. In addition, we construct an unstructured text dataset related to MNPs from an open-source dataset, which can be used for the research and development of resource scarcity scenarios.
Collapse
Affiliation(s)
- Xiaodong Ma
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Rilei Yu
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Chunxiao Gao
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Zhiqiang Wei
- College of Computer Science and Technology, Ocean University of China, Qingdao, China,Pilot National Laboratory for Marine Science and Technology, Qingdao, China
| | - Yimin Xia
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Xiaowei Wang
- College of Computer Science and Technology, Ocean University of China, Qingdao, China
| | - Hao Liu
- College of Computer Science and Technology, Ocean University of China, Qingdao, China,Pilot National Laboratory for Marine Science and Technology, Qingdao, China,*Correspondence: Hao Liu,
| |
Collapse
|
24
|
Li Y, Wehbe RM, Ahmad FS, Wang H, Luo Y. A comparative study of pretrained language models for long clinical text. J Am Med Inform Assoc 2023; 30:340-347. [PMID: 36451266 PMCID: PMC9846675 DOI: 10.1093/jamia/ocac225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 11/06/2022] [Accepted: 11/14/2022] [Indexed: 12/03/2022] Open
Abstract
OBJECTIVE Clinical knowledge-enriched transformer models (eg, ClinicalBERT) have state-of-the-art results on clinical natural language processing (NLP) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (eg, Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts. MATERIALS AND METHODS Inspired by the success of long-sequence transformer models and the fact that clinical notes are mostly long, we introduce 2 domain-enriched language models, Clinical-Longformer and Clinical-BigBird, which are pretrained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks. RESULTS The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results. DISCUSSION Our pretrained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pretrained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer. CONCLUSION This study demonstrates that clinical knowledge-enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.
Collapse
Affiliation(s)
- Yikuan Li
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Ramsey M Wehbe
- Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, Illinois, USA
| | - Faraz S Ahmad
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- Division of Cardiology, Department of Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
- Bluhm Cardiovascular Institute Center for Artificial Intelligence, Northwestern Medicine, Chicago, Illinois, USA
| | - Hanyin Wang
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Yuan Luo
- Division of Health and Biomedical Informatics, Department of Preventive Medicine, Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| |
Collapse
|
25
|
Lee H, Jeong O. A Knowledge-Grounded Task-Oriented Dialogue System with Hierarchical Structure for Enhancing Knowledge Selection. Sensors (Basel) 2023; 23:685. [PMID: 36679481 PMCID: PMC9864774 DOI: 10.3390/s23020685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 01/04/2023] [Accepted: 01/04/2023] [Indexed: 06/17/2023]
Abstract
For a task-oriented dialogue system to provide appropriate answers to and services for users' questions, it is necessary for it to be able to utilize knowledge related to the topic of the conversation. Therefore, the system should be able to select the most appropriate knowledge snippet from the knowledge base, where external unstructured knowledge is used to respond to user requests that cannot be solved by the internal knowledge addressed by the database or application programming interface. Therefore, this paper constructs a three-step knowledge-grounded task-oriented dialogue system with knowledge-seeking-turn detection, knowledge selection, and knowledge-grounded generation. In particular, we propose a hierarchical structure of domain-classification, entity-extraction, and snippet-ranking tasks by subdividing the knowledge selection step. Each task is performed through the pre-trained language model with advanced techniques to finally determine the knowledge snippet to be used to generate a response. Furthermore, the domain and entity information obtained because of the previous task is used as knowledge to reduce the search range of candidates, thereby improving the performance and efficiency of knowledge selection and proving it through experiments.
Collapse
|
26
|
Haisa G, Altenbek G. Multi-Task Learning Model for Kazakh Query Understanding. Sensors (Basel) 2022; 22:9810. [PMID: 36560177 PMCID: PMC9785505 DOI: 10.3390/s22249810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/29/2022] [Accepted: 12/09/2022] [Indexed: 06/17/2023]
Abstract
Query understanding (QU) plays a vital role in natural language processing, particularly in regard to question answering and dialogue systems. QU finds the named entity and query intent in users' questions. Traditional pipeline approaches manage the two mentioned tasks, namely, the named entity recognition (NER) and the question classification (QC), separately. NER is seen as a sequence labeling task to predict a keyword, while QC is a semantic classification task to predict the user's intent. Considering the correlation between these two tasks, training them together could be of benefit to both of them. Kazakh is a low-resource language with wealthy lexical and agglutinative characteristics. We argue that current QU techniques restrict the power of the word-level and sentence-level features of agglutinative languages, especially the stem, suffixes, POS, and gazetteers. This paper proposes a new multi-task learning model for query understanding (MTQU). The MTQU model is designed to establish direct connections for QC and NER tasks to help them promote each other mutually, while we also designed a multi-feature input layer that significantly influenced the model's performance during training. In addition, we constructed new corpora for the Kazakh query understanding task, namely, the KQU. As a result, the MTQU model is simple and effective and obtains competitive results for the KQU.
Collapse
Affiliation(s)
- Gulizada Haisa
- College of Information Science and Engineering, Xinjiang University, Ürümqi 830017, China
- The Base of Kazakh and Kirghiz Language of National Language Resource Monitoring, Research Center on Minority Languages, Ürümqi 830017, China
- Xinjiang Laboratory of Multi-Language Information Technology, Ürümqi 830017, China
| | - Gulila Altenbek
- College of Information Science and Engineering, Xinjiang University, Ürümqi 830017, China
- The Base of Kazakh and Kirghiz Language of National Language Resource Monitoring, Research Center on Minority Languages, Ürümqi 830017, China
- Xinjiang Laboratory of Multi-Language Information Technology, Ürümqi 830017, China
| |
Collapse
|
27
|
Azizi S, Hier DB, Wunsch II DC. Enhanced neurologic concept recognition using a named entity recognition model based on transformers. Front Digit Health 2022; 4:1065581. [PMID: 36569804 PMCID: PMC9772022 DOI: 10.3389/fdgth.2022.1065581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Accepted: 11/21/2022] [Indexed: 12/12/2022] Open
Abstract
Although deep learning has been applied to the recognition of diseases and drugs in electronic health records and the biomedical literature, relatively little study has been devoted to the utility of deep learning for the recognition of signs and symptoms. The recognition of signs and symptoms is critical to the success of deep phenotyping and precision medicine. We have developed a named entity recognition model that uses deep learning to identify text spans containing neurological signs and symptoms and then maps these text spans to the clinical concepts of a neuro-ontology. We compared a model based on convolutional neural networks to one based on bidirectional encoder representation from transformers. Models were evaluated for accuracy of text span identification on three text corpora: physician notes from an electronic health record, case histories from neurologic textbooks, and clinical synopses from an online database of genetic diseases. Both models performed best on the professionally-written clinical synopses and worst on the physician-written clinical notes. Both models performed better when signs and symptoms were represented as shorter text spans. Consistent with prior studies that examined the recognition of diseases and drugs, the model based on bidirectional encoder representations from transformers outperformed the model based on convolutional neural networks for recognizing signs and symptoms. Recall for signs and symptoms ranged from 59.5% to 82.0% and precision ranged from 61.7% to 80.4%. With further advances in NLP, fully automated recognition of signs and symptoms in electronic health records and the medical literature should be feasible.
Collapse
Affiliation(s)
- Sima Azizi
- Applied Computational Intelligence Laboratory, Department of Electrical & Computer Engineering, Missouri University of Science & Technology, Rolla, MO, United States
| | - Daniel B. Hier
- Applied Computational Intelligence Laboratory, Department of Electrical & Computer Engineering, Missouri University of Science & Technology, Rolla, MO, United States
- Department of Neurology and Rehabilitation, University of Illinois at Chicago, Chicago, IL, United States
| | - Donald C. Wunsch II
- Applied Computational Intelligence Laboratory, Department of Electrical & Computer Engineering, Missouri University of Science & Technology, Rolla, MO, United States
- National Science Foundation, ECCS Division, Arlington, VA, United States
| |
Collapse
|
28
|
Ivanisenko TV, Demenkov PS, Kolchanov NA, Ivanisenko VA. The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition. Int J Mol Sci 2022; 23:ijms232314934. [PMID: 36499269 PMCID: PMC9738852 DOI: 10.3390/ijms232314934] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/19/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022] Open
Abstract
The body of scientific literature continues to grow annually. Over 1.5 million abstracts of biomedical publications were added to the PubMed database in 2021. Therefore, developing cognitive systems that provide a specialized search for information in scientific publications based on subject area ontology and modern artificial intelligence methods is urgently needed. We previously developed a web-based information retrieval system, ANDDigest, designed to search and analyze information in the PubMed database using a customized domain ontology. This paper presents an improved ANDDigest version that uses fine-tuned PubMedBERT classifiers to enhance the quality of short name recognition for molecular-genetics entities in PubMed abstracts on eight biological object types: cell components, diseases, side effects, genes, proteins, pathways, drugs, and metabolites. This approach increased average short name recognition accuracy by 13%.
Collapse
Affiliation(s)
- Timofey V. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Correspondence:
| | - Pavel S. Demenkov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
| | - Nikolay A. Kolchanov
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| | - Vladimir A. Ivanisenko
- Kurchatov Genomics Center, Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Institute of Cytology & Genetics, Siberian Branch, Russian Academy of Sciences, Prospekt Lavrentyeva 10, Novosibirsk 630090, Russia
- Faculty of Natural Sciences, Novosibirsk State University, St. Pirogova 1, Novosibirsk 630090, Russia
| |
Collapse
|
29
|
Liao T, Huang R, Zhang S, Duan S, Chen Y, Ma W, Chen X. Nested Named Entity Recognition Based on Dual Stream Feature Complementation. Entropy (Basel) 2022; 24:1454. [PMID: 37420474 DOI: 10.3390/e24101454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 09/19/2022] [Accepted: 10/05/2022] [Indexed: 07/09/2023]
Abstract
Named entity recognition is a basic task in natural language processing, and there is a large number of nested structures in named entities. Nested named entities become the basis for solving many tasks in NLP. A nested named entity recognition model based on dual-flow features complementary is proposed for obtaining efficient feature information after text coding. Firstly, sentences are embedded at both the word level and the character level of the words, then sentence context information is obtained separately via the neural network Bi-LSTM; Afterward, two vectors perform low-level feature complementary to reinforce low-level semantic information; Sentence-local information is captured with the multi-head attention mechanism, then the feature vector is sent to the high-level feature complementary module to obtain deep semantic information; Finally, the entity word recognition module and the fine-grained division module are entered to obtain the internal entity. The experimental results show that the model has a great improvement in feature extraction compared to the classical model.
Collapse
Affiliation(s)
- Tao Liao
- College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
| | - Rongmei Huang
- College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
| | - Shunxiang Zhang
- College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
| | - Songsong Duan
- College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
| | - Yanjie Chen
- College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
| | - Wenxiang Ma
- College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
| | - Xinyuan Chen
- College of Computer Science and Engineering, Anhui University of Science and Technology, Huainan 232001, China
| |
Collapse
|
30
|
Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022; 23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open
Abstract
Automated relation extraction (RE) from biomedical literature is critical for many downstream text mining applications in both research and real-world settings. However, most existing benchmarking datasets for biomedical RE only focus on relations of a single type (e.g. protein-protein interactions) at the sentence level, greatly limiting the development of RE systems in biomedicine. In this work, we first review commonly used named entity recognition (NER) and RE datasets. Then, we present a first-of-its-kind biomedical relation extraction dataset (BioRED) with multiple entity types (e.g. gene/protein, disease, chemical) and relation pairs (e.g. gene-disease; chemical-chemical) at the document level, on a set of 600 PubMed abstracts. Furthermore, we label each relation as describing either a novel finding or previously known background knowledge, enabling automated algorithms to differentiate between novel and background information. We assess the utility of BioRED by benchmarking several existing state-of-the-art methods, including Bidirectional Encoder Representations from Transformers (BERT)-based models, on the NER and RE tasks. Our results show that while existing approaches can reach high performance on the NER task (F-score of 89.3%), there is much room for improvement for the RE task, especially when extracting novel relations (F-score of 47.7%). Our experiments also demonstrate that such a rich dataset can successfully facilitate the development of more accurate, efficient and robust RE systems for biomedicine. Availability: The BioRED dataset and annotation guidelines are freely available at https://ftp.ncbi.nlm.nih.gov/pub/lu/BioRED/.
Collapse
Affiliation(s)
- Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | | | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
31
|
Doerstling SS, Akrobetu D, Engelhard MM, Chen F, Ubel PA. A Disease Identification Algorithm for Medical Crowdfunding Campaigns: Validation Study. J Med Internet Res 2022; 24:e32867. [PMID: 35727610 PMCID: PMC9257615 DOI: 10.2196/32867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 03/11/2022] [Accepted: 04/20/2022] [Indexed: 11/13/2022] Open
Abstract
Background Web-based crowdfunding has become a popular method to raise money for medical expenses, and there is growing research interest in this topic. However, crowdfunding data are largely composed of unstructured text, thereby posing many challenges for researchers hoping to answer questions about specific medical conditions. Previous studies have used methods that either failed to address major challenges or were poorly scalable to large sample sizes. To enable further research on this emerging funding mechanism in health care, better methods are needed. Objective We sought to validate an algorithm for identifying 11 disease categories in web-based medical crowdfunding campaigns. We hypothesized that a disease identification algorithm combining a named entity recognition (NER) model and word search approach could identify disease categories with high precision and accuracy. Such an algorithm would facilitate further research using these data. Methods Web scraping was used to collect data on medical crowdfunding campaigns from GoFundMe (GoFundMe Inc). Using pretrained NER and entity resolution models from Spark NLP for Healthcare in combination with targeted keyword searches, we constructed an algorithm to identify conditions in the campaign descriptions, translate conditions to International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) codes, and predict the presence or absence of 11 disease categories in the campaigns. The classification performance of the algorithm was evaluated against 400 manually labeled campaigns. Results We collected data on 89,645 crowdfunding campaigns through web scraping. The interrater reliability for detecting the presence of broad disease categories in the campaign descriptions was high (Cohen κ: range 0.69-0.96). The NER and entity resolution models identified 6594 unique (276,020 total) ICD-10-CM codes among all of the crowdfunding campaigns in our sample. Through our word search, we identified 3261 additional campaigns for which a medical condition was not otherwise detected with the NER model. When averaged across all disease categories and weighted by the number of campaigns that mentioned each disease category, the algorithm demonstrated an overall precision of 0.83 (range 0.48-0.97), a recall of 0.77 (range 0.42-0.98), an F1 score of 0.78 (range 0.56-0.96), and an accuracy of 95% (range 90%-98%). Conclusions A disease identification algorithm combining pretrained natural language processing models and ICD-10-CM code–based disease categorization was able to detect 11 disease categories in medical crowdfunding campaigns with high precision and accuracy.
Collapse
Affiliation(s)
- Steven S Doerstling
- Duke University School of Medicine, Duke University, Durham, NC, United States.,Margolis Center for Health Policy, Duke University, Durham, NC, United States
| | - Dennis Akrobetu
- Duke University School of Medicine, Duke University, Durham, NC, United States.,Margolis Center for Health Policy, Duke University, Durham, NC, United States
| | - Matthew M Engelhard
- Department of Biostatistics & Bioinformatics, Duke University, Durham, NC, United States
| | | | - Peter A Ubel
- Fuqua School of Business, Duke University, Durham, NC, United States.,Sanford School of Public Policy, Duke University, Durham, NC, United States
| |
Collapse
|
32
|
McInnes BT, Downie JS, Hao Y, Jett J, Keating K, Nakum G, Ranjan S, Rodriguez NE, Tang J, Xiang D, Young EM, Nguyen MH. Discovering Content through Text Mining for a Synthetic Biology Knowledge System. ACS Synth Biol 2022; 11:2043-2054. [PMID: 35671034 DOI: 10.1021/acssynbio.1c00611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Scientific articles contain a wealth of information about experimental methods and results describing biological designs. Due to its unstructured nature and multiple sources of ambiguity and variability, extracting this information from text is a difficult task. In this paper, we describe the development of the synthetic biology knowledge system (SBKS) text processing pipeline. The pipeline uses natural language processing techniques to extract and correlate information from the literature for synthetic biology researchers. Specifically, we apply named entity recognition, relation extraction, concept grounding, and topic modeling to extract information from published literature to link articles to elements within our knowledge system. Our results show the efficacy of each of the components on synthetic biology literature and provide future directions for further advancement of the pipeline.
Collapse
Affiliation(s)
- Bridget T McInnes
- Virginia Commonwealth University, Richmond, Virginia 23284, United States
| | - J Stephen Downie
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Yikai Hao
- University of California San Diego, La Jolla, California 92093, United States
| | - Jacob Jett
- University of Illinois at Urbana-Champaign, Urbana, Illinois 61801, United States
| | - Kevin Keating
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Gaurav Nakum
- University of California San Diego, La Jolla, California 92093, United States
| | - Sudhanshu Ranjan
- University of California San Diego, La Jolla, California 92093, United States
| | | | - Jiawei Tang
- University of California San Diego, La Jolla, California 92093, United States
| | - Du Xiang
- University of California San Diego, La Jolla, California 92093, United States
| | - Eric M Young
- Worcester Polytechnic Institute, Worcester, Massachusetts 01609, United States
| | - Mai H Nguyen
- University of California San Diego, La Jolla, California 92093, United States
| |
Collapse
|
33
|
Yeung CS, Beck T, Posma JM. MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition. Metabolites 2022; 12. [PMID: 35448463 DOI: 10.3390/metabo12040276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 03/15/2022] [Accepted: 03/17/2022] [Indexed: 11/17/2022] Open
Abstract
Reviewing the metabolomics literature is becoming increasingly difficult because of the rapid expansion of relevant journal literature. Text-mining technologies are therefore needed to facilitate more efficient literature reviews. Here we contribute a standardised corpus of full-text publications from metabolomics studies and describe the development of two metabolite named entity recognition (NER) methods. These methods are based on Bidirectional Long Short-Term Memory (BiLSTM) networks and each incorporate different transfer learning techniques (for tokenisation and word embedding). Our first model (MetaboListem) follows prior methodology using GloVe word embeddings. Our second model exploits BERT and BioBERT for embedding and is named TABoLiSTM (Transformer-Affixed BiLSTM). The methods are trained on a novel corpus annotated using rule-based methods, and evaluated on manually annotated metabolomics articles. MetaboListem (F1-score 0.890, precision 0.892, recall 0.888) and TABoLiSTM (BioBERT version: F1-score 0.909, precision 0.926, recall 0.893) have achieved state-of-the-art performance on metabolite NER. A training corpus with full-text sentences from >1000 full-text Open Access metabolomics publications with 105,335 annotated metabolites was created, as well as a manually annotated test corpus (19,138 annotations). This work demonstrates that deep learning algorithms are capable of identifying metabolite names accurately and efficiently in text. The proposed corpus and NER algorithms can be used for metabolomics text-mining tasks such as information retrieval, document classification and literature-based discovery and are available from the omicsNLP GitHub repository.
Collapse
|
34
|
Liu HY, Han CJ, Xiong J, Li HY, Lei L, Liu BY. [Automatic labeling and extraction of terms in natural language processing in acupuncture clinical literature]. Zhongguo Zhen Jiu 2022; 42:327-331. [PMID: 35272414 DOI: 10.13703/j.0255-2930.20211107-k0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
The paper analyzes the specificity of term recognition in acupuncture clinical literature and compares the advantages and disadvantages of three named entity recognition (NER) methods adopted in the field of traditional Chinese medicine. It is believed that the bi-directional long short-term memory networks-conditional random fields (Bi LSTM-CRF) may communicate the context information and complete NER by using less feature rules. This model is suitable for term recognition in acupuncture clinical literature. Based on this model, it is proposed that the process of term recognition in acupuncture clinical literature should include 4 aspects, i.e. literature pretreatment, sequence labeling, model training and effect evaluation, which provides an approach to the terminological structurization in acupuncture clinical literature.
Collapse
Affiliation(s)
- Hua-Yun Liu
- Graduate School of Tianjin University of TCM, Tianjin 301617, China
| | - Chen-Jing Han
- Graduate School of Tianjin University of TCM, Tianjin 301617, China
| | - Jie Xiong
- Institute of Information on Traditional Chinese Medicine, China Academy of Chinese Medical Sciences
| | - Hai-Yan Li
- Institute of Information on Traditional Chinese Medicine, China Academy of Chinese Medical Sciences
| | - Lei Lei
- Institute of Information on Traditional Chinese Medicine, China Academy of Chinese Medical Sciences
| | - Bao-Yan Liu
- China Academy of Chinese Medical Sciences, Beijing 100700
| |
Collapse
|
35
|
Chew R, Wenger M, Guillory J, Nonnemaker J, Kim A. Identifying Electronic Nicotine Delivery System Brands and Flavors on Instagram: Natural Language Processing Analysis. J Med Internet Res 2022; 24:e30257. [PMID: 35040793 PMCID: PMC8808345 DOI: 10.2196/30257] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 11/01/2021] [Accepted: 11/21/2021] [Indexed: 01/30/2023] Open
Abstract
Background Electronic nicotine delivery system (ENDS) brands, such as JUUL, used social media as a key component of their marketing strategy, which led to massive sales growth from 2015 to 2018. During this time, ENDS use rapidly increased among youths and young adults, with flavored products being particularly popular among these groups. Objective The aim of our study is to develop a named entity recognition (NER) model to identify potential emerging vaping brands and flavors from Instagram post text. NER is a natural language processing task for identifying specific types of words (entities) in text based on the characteristics of the entity and surrounding words. Methods NER models were trained on a labeled data set of 2272 Instagram posts coded for ENDS brands and flavors. We compared three types of NER models—conditional random fields, a residual convolutional neural network, and a fine-tuned distilled bidirectional encoder representations from transformers (FTDB) network—to identify brands and flavors in Instagram posts with key model outcomes of precision, recall, and F1 scores. We used data from Nielsen scanner sales and Wikipedia to create benchmark dictionaries to determine whether brands from established ENDS brand and flavor lists were mentioned in the Instagram posts in our sample. To prevent overfitting, we performed 5-fold cross-validation and reported the mean and SD of the model validation metrics across the folds. Results For brands, the residual convolutional neural network exhibited the highest mean precision (0.797, SD 0.084), and the FTDB exhibited the highest mean recall (0.869, SD 0.103). For flavors, the FTDB exhibited both the highest mean precision (0.860, SD 0.055) and recall (0.801, SD 0.091). All NER models outperformed the benchmark brand and flavor dictionary look-ups on mean precision, recall, and F1. Comparing between the benchmark brand lists, the larger Wikipedia list outperformed the Nielsen list in both precision and recall. Conclusions Our findings suggest that NER models correctly identified ENDS brands and flavors in Instagram posts at rates competitive with, or better than, others in the published literature. Brands identified during manual annotation showed little overlap with those in Nielsen scanner data, suggesting that NER models may capture emerging brands with limited sales and distribution. NER models address the challenges of manual brand identification and can be used to support future infodemiology and infoveillance studies. Brands identified on social media should be cross-validated with Nielsen and other data sources to differentiate emerging brands that have become established from those with limited sales and distribution.
Collapse
Affiliation(s)
- Rob Chew
- Center for Data Science, RTI International, Research Triangle Park, NC, United States
| | - Michael Wenger
- Center for Data Science, RTI International, Research Triangle Park, NC, United States
| | - Jamie Guillory
- Center for Health Analytics, Media, and Policy, RTI International, Research Triangle Park, NC, United States
| | - James Nonnemaker
- Center for Health Analytics, Media, and Policy, RTI International, Research Triangle Park, NC, United States
| | - Annice Kim
- Center for Health Analytics, Media, and Policy, RTI International, Research Triangle Park, NC, United States
| |
Collapse
|
36
|
Wang J, Ren Y, Zhang Z, Xu H, Zhang Y. From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents. Front Res Metr Anal 2022; 6:691105. [PMID: 35005421 PMCID: PMC8727901 DOI: 10.3389/frma.2021.691105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 11/02/2021] [Indexed: 11/28/2022] Open
Abstract
Chemical reactions and experimental conditions are fundamental information for chemical research and pharmaceutical applications. However, the latest information of chemical reactions is usually embedded in the free text of patents. The rapidly accumulating chemical patents urge automatic tools based on natural language processing (NLP) techniques for efficient and accurate information extraction. This work describes the participation of the Melax Tech team in the CLEF 2020—ChEMU Task of Chemical Reaction Extraction from Patent. The task consisted of two subtasks: (1) named entity recognition to identify compounds and different semantic roles in the chemical reaction and (2) event extraction to identify event triggers of chemical reaction and their relations with the semantic roles recognized in subtask 1. To build an end-to-end system with high performance, multiple strategies tailored to chemical patents were applied and evaluated, ranging from optimizing the tokenization, pre-training patent language models based on self-supervision, to domain knowledge-based rules. Our hybrid approaches combining different strategies achieved state-of-the-art results in both subtasks, with the top-ranked F1 of 0.957 for entity recognition and the top-ranked F1 of 0.9536 for event extraction, indicating that the proposed approaches are promising.
Collapse
Affiliation(s)
- Jingqi Wang
- Melax Technologies, Inc., Houston, TX, United States
| | - Yuankai Ren
- School of Medicine, Nantong University, Nantong, China
| | - Zhi Zhang
- School of Medicine, Nantong University, Nantong, China
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Yaoyun Zhang
- Melax Technologies, Inc., Houston, TX, United States
| |
Collapse
|
37
|
Wu Y, Liu Z, Wu L, Chen M, Tong W. BERT-Based Natural Language Processing of Drug Labeling Documents: A Case Study for Classifying Drug-Induced Liver Injury Risk. Front Artif Intell 2021; 4:729834. [PMID: 34939028 PMCID: PMC8685544 DOI: 10.3389/frai.2021.729834] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 11/17/2021] [Indexed: 11/16/2022] Open
Abstract
Background & Aims: The United States Food and Drug Administration (FDA) regulates a broad range of consumer products, which account for about 25% of the United States market. The FDA regulatory activities often involve producing and reading of a large number of documents, which is time consuming and labor intensive. To support regulatory science at FDA, we evaluated artificial intelligence (AI)-based natural language processing (NLP) of regulatory documents for text classification and compared deep learning-based models with a conventional keywords-based model. Methods: FDA drug labeling documents were used as a representative regulatory data source to classify drug-induced liver injury (DILI) risk by employing the state-of-the-art language model BERT. The resulting NLP-DILI classification model was statistically validated with both internal and external validation procedures and applied to the labeling data from the European Medicines Agency (EMA) for cross-agency application. Results: The NLP-DILI model developed using FDA labeling documents and evaluated by cross-validations in this study showed remarkable performance in DILI classification with a recall of 1 and a precision of 0.78. When cross-agency data were used to validate the model, the performance remained comparable, demonstrating that the model was portable across agencies. Results also suggested that the model was able to capture the semantic meanings of sentences in drug labeling. Conclusion: Deep learning-based NLP models performed well in DILI classification of drug labeling documents and learned the meanings of complex text in drug labeling. This proof-of-concept work demonstrated that using AI technologies to assist regulatory activities is a promising approach to modernize and advance regulatory science.
Collapse
Affiliation(s)
- Yue Wu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, AR, United States
| | - Zhichao Liu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, AR, United States
| | - Leihong Wu
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, AR, United States
| | - Minjun Chen
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, AR, United States
| | - Weida Tong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, United States Food and Drug Administration, Jefferson, AR, United States
| |
Collapse
|
38
|
Naderi N, Knafou J, Copara J, Ruch P, Teodoro D. Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora. Front Res Metr Anal 2021; 6:689803. [PMID: 34870074 PMCID: PMC8640190 DOI: 10.3389/frma.2021.689803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 10/11/2021] [Indexed: 11/13/2022] Open
Abstract
The health and life science domains are well known for their wealth of named entities found in large free text corpora, such as scientific literature and electronic health records. To unlock the value of such corpora, named entity recognition (NER) methods are proposed. Inspired by the success of transformer-based pretrained models for NER, we assess how individual and ensemble of deep masked language models perform across corpora of different health and life science domains-biology, chemistry, and medicine-available in different languages-English and French. Individual deep masked language models, pretrained on external corpora, are fined-tuned on task-specific domain and language corpora and ensembled using classical majority voting strategies. Experiments show statistically significant improvement of the ensemble models over an individual BERT-based baseline model, with an overall best performance of 77% macro F1-score. We further perform a detailed analysis of the ensemble results and show how their effectiveness changes according to entity properties, such as length, corpus frequency, and annotation consistency. The results suggest that the ensembles of deep masked language models are an effective strategy for tackling NER across corpora from the health and life science domains.
Collapse
Affiliation(s)
- Nona Naderi
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Julien Knafou
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland.,Computer Science Department, University of Geneva, Geneva, Switzerland
| | - Jenny Copara
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.,Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Patrick Ruch
- Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland.,Information Science Department, University of Applied Sciences and Arts of Western Switzerland (HES-SO), Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| |
Collapse
|
39
|
Wu H, Ji J, Tian H, Chen Y, Ge W, Zhang H, Yu F, Zou J, Nakamura M, Liao J. Chinese- Named Entity Recognition From Adverse Drug Event Records: Radical Embedding-Combined Dynamic Embedding-Based BERT in a Bidirectional Long Short-term Conditional Random Field (Bi-LSTM-CRF) Model. JMIR Med Inform 2021; 9:e26407. [PMID: 34855616 PMCID: PMC8686410 DOI: 10.2196/26407] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/22/2021] [Accepted: 10/05/2021] [Indexed: 12/17/2022] Open
Abstract
Background With the increasing variety of drugs, the incidence of adverse drug events (ADEs) is increasing year by year. Massive numbers of ADEs are recorded in electronic medical records and adverse drug reaction (ADR) reports, which are important sources of potential ADR information. Meanwhile, it is essential to make latent ADR information automatically available for better postmarketing drug safety reevaluation and pharmacovigilance. Objective This study describes how to identify ADR-related information from Chinese ADE reports. Methods Our study established an efficient automated tool, named BBC-Radical. BBC-Radical is a model that consists of 3 components: Bidirectional Encoder Representations from Transformers (BERT), bidirectional long short-term memory (bi-LSTM), and conditional random field (CRF). The model identifies ADR-related information from Chinese ADR reports. Token features and radical features of Chinese characters were used to represent the common meaning of a group of words. BERT and Bi-LSTM-CRF were novel models that combined these features to conduct named entity recognition (NER) tasks in the free-text section of 24,890 ADR reports from the Jiangsu Province Adverse Drug Reaction Monitoring Center from 2010 to 2016. Moreover, the man-machine comparison experiment on the ADE records from Drum Tower Hospital was designed to compare the NER performance between the BBC-Radical model and a manual method. Results The NER model achieved relatively high performance, with a precision of 96.4%, recall of 96.0%, and F1 score of 96.2%. This indicates that the performance of the BBC-Radical model (precision 87.2%, recall 85.7%, and F1 score 86.4%) is much better than that of the manual method (precision 86.1%, recall 73.8%, and F1 score 79.5%) in the recognition task of each kind of entity. Conclusions The proposed model was competitive in extracting ADR-related information from ADE reports, and the results suggest that the application of our method to extract ADR-related information is of great significance in improving the quality of ADR reports and postmarketing drug safety evaluation.
Collapse
Affiliation(s)
- Hong Wu
- School of Science, China Pharmaceutical University, Nanjing, China
| | - Jiatong Ji
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Haimei Tian
- School of Computer Engineering, Jinling Institute of Technology, Nanjing, China
| | - Yao Chen
- School of Science, China Pharmaceutical University, Nanjing, China
| | - Weihong Ge
- Department of Pharmacy, Nanjing Drum Tower Hospital, Nanjing, China
| | - Haixia Zhang
- Department of Pharmacy, Nanjing Drum Tower Hospital, Nanjing, China
| | - Feng Yu
- School of Basic Medicine and Clinical Pharmacy, China Pharmaceutical University, Nanjing, China
| | - Jianjun Zou
- Department of Clinical Pharmacology, Nanjing First Hospital, Nanjing Medical University, Nanjing, China
| | - Mitsuhiro Nakamura
- Laboratory of Drug Informatics, Gifu Pharmaceutical University, Gifu, Japan
| | - Jun Liao
- School of Science, China Pharmaceutical University, Nanjing, China
| |
Collapse
|
40
|
Larmande P, Liu Y, Yao X, Xia J. OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition. Genomics Inform 2021; 19:e27. [PMID: 34638174 PMCID: PMC8510865 DOI: 10.5808/gi.21015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Accepted: 07/27/2021] [Indexed: 12/02/2022] Open
Abstract
Due to the rapid evolution of high-throughput technologies, a tremendous amount of data is being produced in the biological domain, which poses a challenging task for information extraction and natural language understanding. Biological named entity recognition (NER) and named entity normalisation (NEN) are two common tasks aiming at identifying and linking biologically important entities such as genes or gene products mentioned in the literature to biological databases. In this paper, we present an updated version of OryzaGP, a gene and protein dataset for rice species created to help natural language processing (NLP) tools in processing NER and NEN tasks. To create the dataset, we selected more than 15,000 abstracts associated with articles previously curated for rice genes. We developed four dictionaries of gene and protein names associated with database identifiers. We used these dictionaries to annotate the dataset. We also annotated the dataset using pre-trained NLP models. Finally, we analysed the annotation results and discussed how to improve OryzaGP.
Collapse
Affiliation(s)
- Pierre Larmande
- DIADE, Univ. Montpellier, IRD, CIRAD, 34394 Montpellier, France.,French Institute of Bioinformatics (IFB)-South Green Bioinformatics Platform, Bioversity, CIRAD, INRAE, IRD, Montpellier F-34398, France
| | - Yusha Liu
- Hubei Provincial Key Laboratory of Agricultural Bioinformatics, College of informatics, Huazhong Agricultural University, Wuhan 430070, Hubei Province, China
| | - Xinzhi Yao
- Hubei Provincial Key Laboratory of Agricultural Bioinformatics, College of informatics, Huazhong Agricultural University, Wuhan 430070, Hubei Province, China
| | - Jingbo Xia
- Hubei Provincial Key Laboratory of Agricultural Bioinformatics, College of informatics, Huazhong Agricultural University, Wuhan 430070, Hubei Province, China
| |
Collapse
|
41
|
Lovis C, Rayson P. Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic With Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study. JMIR Med Inform 2021; 9:e27670. [PMID: 34346892 PMCID: PMC8451962 DOI: 10.2196/27670] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 04/20/2021] [Accepted: 06/20/2021] [Indexed: 02/05/2023] Open
Abstract
BACKGROUND Twitter is a real-time messaging platform widely used by people and organizations to share information on many topics. Systematic monitoring of social media posts (infodemiology or infoveillance) could be useful to detect misinformation outbreaks as well as to reduce reporting lag time and to provide an independent complementary source of data compared with traditional surveillance approaches. However, such an analysis is currently not possible in the Arabic-speaking world owing to a lack of basic building blocks for research and dialectal variation. OBJECTIVE We collected around 4000 Arabic tweets related to COVID-19 and influenza. We cleaned and labeled the tweets relative to the Arabic Infectious Diseases Ontology, which includes nonstandard terminology, as well as 11 core concepts and 21 relations. The aim of this study was to analyze Arabic tweets to estimate their usefulness for health surveillance, understand the impact of the informal terms in the analysis, show the effect of deep learning methods in the classification process, and identify the locations where the infection is spreading. METHODS We applied the following multilabel classification techniques: binary relevance, classifier chains, label power set, adapted algorithm (multilabel adapted k-nearest neighbors [MLKNN]), support vector machine with naive Bayes features (NBSVM), bidirectional encoder representations from transformers (BERT), and AraBERT (transformer-based model for Arabic language understanding) to identify tweets appearing to be from infected individuals. We also used named entity recognition to predict the place names mentioned in the tweets. RESULTS We achieved an F1 score of up to 88% in the influenza case study and 94% in the COVID-19 one. Adapting for nonstandard terminology and informal language helped to improve accuracy by as much as 15%, with an average improvement of 8%. Deep learning methods achieved an F1 score of up to 94% during the classifying process. Our geolocation detection algorithm had an average accuracy of 54% for predicting the location of users according to tweet content. CONCLUSIONS This study identified two Arabic social media data sets for monitoring tweets related to influenza and COVID-19. It demonstrated the importance of including informal terms, which are regularly used by social media users, in the analysis. It also proved that BERT achieves good results when used with new terms in COVID-19 tweets. Finally, the tweet content may contain useful information to determine the location of disease spread.
Collapse
Affiliation(s)
| | - Paul Rayson
- School of Computing and Communications, Lancaster University, InfoLab21, Lancaster, GB
| |
Collapse
|
42
|
Noh J, Kavuluru R. Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation. ACM BCB 2021; 2021. [PMID: 34505115 DOI: 10.1145/3459930.3469533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Named entity recognition (NER) and normalization (EN) form an indispensable first step to many biomedical natural language processing applications. In biomedical information science, recognizing entities (e.g., genes, diseases, or drugs) and normalizing them to concepts in standard terminologies or thesauri (e.g., Entrez, ICD-10, or RxNorm) is crucial for identifying more informative relations among them that drive disease etiology, progression, and treatment. In this effort we pursue two high level strategies to improve biomedical ER and EN. The first is to decouple standard entity encoding tags (e.g., "B-Drug" for the beginning of a drug) into type tags (e.g., "Drug") and positional tags (e.g., "B"). A second strategy is to use additional counterfactual training examples to handle the issue of models learning spurious correlations between surrounding context and normalized concepts in training data. We conduct elaborate experiments using the MedMentions dataset, the largest dataset of its kind for ER and EN in biomedicine. We find that our first strategy performs better in entity normalization when compared with the standard coding scheme. The second data augmentation strategy uniformly improves performance in span detection, typing, and normalization. The gains from counterfactual examples are more prominent when evaluating in zero-shot settings, for concepts that have never been encountered during training.
Collapse
Affiliation(s)
- Jiho Noh
- Department of Computer Science, University of Kentucky, Lexington, Kentucky, USA
| | - Ramakanth Kavuluru
- Division of Biomedical Informatics (Internal Medicine), University of Kentucky, Lexington, Kentucky, USA
| |
Collapse
|
43
|
Hong Z, Pauloski JG, Ward L, Chard K, Blaiszik B, Foster I. Models and Processes to Extract Drug-like Molecules From Natural Language Text. Front Mol Biosci 2021; 8:636077. [PMID: 34527701 PMCID: PMC8435623 DOI: 10.3389/fmolb.2021.636077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 08/11/2021] [Indexed: 11/28/2022] Open
Abstract
Researchers worldwide are seeking to repurpose existing drugs or discover new drugs to counter the disease caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). A promising source of candidates for such studies is molecules that have been reported in the scientific literature to be drug-like in the context of viral research. However, this literature is too large for human review and features unusual vocabularies for which existing named entity recognition (NER) models are ineffective. We report here on a project that leverages both human and artificial intelligence to detect references to such molecules in free text. We present 1) a iterative model-in-the-loop method that makes judicious use of scarce human expertise in generating training data for a NER model, and 2) the application and evaluation of this method to the problem of identifying drug-like molecules in the COVID-19 Open Research Dataset Challenge (CORD-19) corpus of 198,875 papers. We show that by repeatedly presenting human labelers only with samples for which an evolving NER model is uncertain, our human-machine hybrid pipeline requires only modest amounts of non-expert human labeling time (tens of hours to label 1778 samples) to generate an NER model with an F-1 score of 80.5%-on par with that of non-expert humans-and when applied to CORD'19, identifies 10,912 putative drug-like molecules. This enriched the computational screening team's targets by 3,591 molecules, of which 18 ranked in the top 0.1% of all 6.6 million molecules screened for docking against the 3CLPro protein.
Collapse
Affiliation(s)
- Zhi Hong
- Department of Computer Science, University of Chicago, Chicago, IL, United States
| | - J. Gregory Pauloski
- Department of Computer Science, University of Chicago, Chicago, IL, United States
| | - Logan Ward
- Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, United States
| | - Kyle Chard
- Department of Computer Science, University of Chicago, Chicago, IL, United States
- Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, United States
- Globus, University of Chicago, Chicago, IL, United States
| | - Ben Blaiszik
- Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, United States
- Globus, University of Chicago, Chicago, IL, United States
| | - Ian Foster
- Department of Computer Science, University of Chicago, Chicago, IL, United States
- Data Science and Learning Division, Argonne National Laboratory, Lemont, IL, United States
- Globus, University of Chicago, Chicago, IL, United States
| |
Collapse
|
44
|
Wang H, Yeung WLK, Ng QX, Tung A, Tay JAM, Ryanputra D, Ong MEH, Feng M, Arulanandam S. A Weakly-Supervised Named Entity Recognition Machine Learning Approach for Emergency Medical Services Clinical Audit. Int J Environ Res Public Health 2021; 18:ijerph18157776. [PMID: 34360065 PMCID: PMC8345494 DOI: 10.3390/ijerph18157776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 07/19/2021] [Accepted: 07/21/2021] [Indexed: 11/16/2022]
Abstract
Clinical performance audits are routinely performed in Emergency Medical Services (EMS) to ensure adherence to treatment protocols, to identify individual areas of weakness for remediation, and to discover systemic deficiencies to guide the development of the training syllabus. At present, these audits are performed by manual chart review, which is time-consuming and laborious. In this paper, we report a weakly-supervised machine learning approach to train a named entity recognition model that can be used for automatic EMS clinical audits. The dataset used in this study contained 58,898 unlabeled ambulance incidents encountered by the Singapore Civil Defence Force from 1st April 2019 to 30th June 2019. With only 5% labeled data, we successfully trained three different models to perform the NER task, achieving F1 scores of around 0.981 under entity type matching evaluation and around 0.976 under strict evaluation. The BiLSTM-CRF model was 1~2 orders of magnitude lighter and faster than our BERT-based models. Our proposed proof-of-concept approach may improve the efficiency of clinical audits and can also help with EMS database research. Further external validation of this approach is needed.
Collapse
Affiliation(s)
- Han Wang
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117549, Singapore;
| | - Wesley Lok Kin Yeung
- Singapore Civil Defence Force, Singapore 408827, Singapore; (W.L.K.Y.); (Q.X.N.); (A.T.); (J.A.M.T.); (D.R.); (S.A.)
- National University Hospital, National University Health System, Singapore 119074, Singapore
| | - Qin Xiang Ng
- Singapore Civil Defence Force, Singapore 408827, Singapore; (W.L.K.Y.); (Q.X.N.); (A.T.); (J.A.M.T.); (D.R.); (S.A.)
| | - Angeline Tung
- Singapore Civil Defence Force, Singapore 408827, Singapore; (W.L.K.Y.); (Q.X.N.); (A.T.); (J.A.M.T.); (D.R.); (S.A.)
- Home Team Science & Technology Agency, Singapore 329560, Singapore
| | - Joey Ai Meng Tay
- Singapore Civil Defence Force, Singapore 408827, Singapore; (W.L.K.Y.); (Q.X.N.); (A.T.); (J.A.M.T.); (D.R.); (S.A.)
| | - Davin Ryanputra
- Singapore Civil Defence Force, Singapore 408827, Singapore; (W.L.K.Y.); (Q.X.N.); (A.T.); (J.A.M.T.); (D.R.); (S.A.)
- National University Hospital, National University Health System, Singapore 119074, Singapore
| | - Marcus Eng Hock Ong
- Health Services Research Centre, Singapore Health Services, Singapore 169856, Singapore;
- Health Services and Systems Research, Duke-NUS Medical School, National University of Singapore, Singapore 169857, Singapore
- Department of Emergency Medicine, Singapore General Hospital, Singapore 169608, Singapore
| | - Mengling Feng
- Saw Swee Hock School of Public Health, National University of Singapore, Singapore 117549, Singapore;
- Institute of Data Science, National University of Singapore, Singapore 117602, Singapore
- Correspondence: ; Tel.: +65-985-038-11
| | - Shalini Arulanandam
- Singapore Civil Defence Force, Singapore 408827, Singapore; (W.L.K.Y.); (Q.X.N.); (A.T.); (J.A.M.T.); (D.R.); (S.A.)
| |
Collapse
|
45
|
Hu D, Zhang H, Li S, Wang Y, Wu N, Lu X. Automatic Extraction of Lung Cancer Staging Information From Computed Tomography Reports: Deep Learning Approach. JMIR Med Inform 2021; 9:e27955. [PMID: 34287213 PMCID: PMC8339987 DOI: 10.2196/27955] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 05/27/2021] [Accepted: 06/07/2021] [Indexed: 01/04/2023] Open
Abstract
BACKGROUND Lung cancer is the leading cause of cancer deaths worldwide. Clinical staging of lung cancer plays a crucial role in making treatment decisions and evaluating prognosis. However, in clinical practice, approximately one-half of the clinical stages of lung cancer patients are inconsistent with their pathological stages. As one of the most important diagnostic modalities for staging, chest computed tomography (CT) provides a wealth of information about cancer staging, but the free-text nature of the CT reports obstructs their computerization. OBJECTIVE We aimed to automatically extract the staging-related information from CT reports to support accurate clinical staging of lung cancer. METHODS In this study, we developed an information extraction (IE) system to extract the staging-related information from CT reports. The system consisted of the following three parts: named entity recognition (NER), relation classification (RC), and postprocessing (PP). We first summarized 22 questions about lung cancer staging based on the TNM staging guideline. Next, three state-of-the-art NER algorithms were implemented to recognize the entities of interest. Next, we designed a novel RC method using the relation sign constraint (RSC) to classify the relations between entities. Finally, a rule-based PP module was established to obtain the formatted answers using the results of NER and RC. RESULTS We evaluated the developed IE system on a clinical data set containing 392 chest CT reports collected from the Department of Thoracic Surgery II in the Peking University Cancer Hospital. The experimental results showed that the bidirectional encoder representation from transformers (BERT) model outperformed the iterated dilated convolutional neural networks-conditional random field (ID-CNN-CRF) and bidirectional long short-term memory networks-conditional random field (Bi-LSTM-CRF) for NER tasks with macro-F1 scores of 80.97% and 90.06% under the exact and inexact matching schemes, respectively. For the RC task, the proposed RSC showed better performance than the baseline methods. Further, the BERT-RSC model achieved the best performance with a macro-F1 score of 97.13% and a micro-F1 score of 98.37%. Moreover, the rule-based PP module could correctly obtain the formatted results using the extractions of NER and RC, achieving a macro-F1 score of 94.57% and a micro-F1 score of 96.74% for all the 22 questions. CONCLUSIONS We conclude that the developed IE system can effectively and accurately extract information about lung cancer staging from CT reports. Experimental results show that the extracted results have significant potential for further use in stage verification and prediction to facilitate accurate clinical staging.
Collapse
Affiliation(s)
- Danqing Hu
- College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Hangzhou, China
| | - Huanyao Zhang
- College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Hangzhou, China
| | - Shaolei Li
- Department of Thoracic Surgery II, Peking University Cancer Hospital & Institute, Beijing, China
| | - Yuhong Wang
- College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Hangzhou, China
| | - Nan Wu
- Department of Thoracic Surgery II, Peking University Cancer Hospital & Institute, Beijing, China
| | - Xudong Lu
- College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou, China
- Key Laboratory for Biomedical Engineering, Ministry of Education, Hangzhou, China
| |
Collapse
|
46
|
Li J, Zhou Y, Jiang X, Natarajan K, Pakhomov SV, Liu H, Xu H. Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition. J Am Med Inform Assoc 2021; 28:2193-2201. [PMID: 34272955 DOI: 10.1093/jamia/ocab112] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 05/09/2021] [Accepted: 06/07/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE : Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks. MATERIALS AND METHODS : We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora. RESULTS : Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only. CONCLUSIONS : Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.
Collapse
Affiliation(s)
- Jianfu Li
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yujia Zhou
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Karthik Natarajan
- Department of Biomedical Informatics, Columbia University, New York, USA
| | | | - Hongfang Liu
- Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - Hua Xu
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
47
|
Du J, Xiang Y, Sankaranarayanapillai M, Zhang M, Wang J, Si Y, Pham HA, Xu H, Chen Y, Tao C. Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning. J Am Med Inform Assoc 2021; 28:1393-1400. [PMID: 33647938 DOI: 10.1093/jamia/ocab014] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 01/14/2021] [Accepted: 01/20/2021] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Automated analysis of vaccine postmarketing surveillance narrative reports is important to understand the progression of rare but severe vaccine adverse events (AEs). This study implemented and evaluated state-of-the-art deep learning algorithms for named entity recognition to extract nervous system disorder-related events from vaccine safety reports. MATERIALS AND METHODS We collected Guillain-Barré syndrome (GBS) related influenza vaccine safety reports from the Vaccine Adverse Event Reporting System (VAERS) from 1990 to 2016. VAERS reports were selected and manually annotated with major entities related to nervous system disorders, including, investigation, nervous_AE, other_AE, procedure, social_circumstance, and temporal_expression. A variety of conventional machine learning and deep learning algorithms were then evaluated for the extraction of the above entities. We further pretrained domain-specific BERT (Bidirectional Encoder Representations from Transformers) using VAERS reports (VAERS BERT) and compared its performance with existing models. RESULTS AND CONCLUSIONS Ninety-one VAERS reports were annotated, resulting in 2512 entities. The corpus was made publicly available to promote community efforts on vaccine AEs identification. Deep learning-based methods (eg, bi-long short-term memory and BERT models) outperformed conventional machine learning-based methods (ie, conditional random fields with extensive features). The BioBERT large model achieved the highest exact match F-1 scores on nervous_AE, procedure, social_circumstance, and temporal_expression; while VAERS BERT large models achieved the highest exact match F-1 scores on investigation and other_AE. An ensemble of these 2 models achieved the highest exact match microaveraged F-1 score at 0.6802 and the second highest lenient match microaveraged F-1 score at 0.8078 among peer models.
Collapse
Affiliation(s)
- Jingcheng Du
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yang Xiang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | | | - Meng Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Jingqi Wang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yuqi Si
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Huy Anh Pham
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Yong Chen
- Perelman School of Medicine, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Cui Tao
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| |
Collapse
|
48
|
Mahendran D, Gurdin G, Lewinski N, Tang C, McInnes BT. Identifying Chemical Reactions and Their Associated Attributes in Patents. Front Res Metr Anal 2021; 6:688353. [PMID: 34322654 PMCID: PMC8312343 DOI: 10.3389/frma.2021.688353] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 05/31/2021] [Indexed: 11/13/2022] Open
Abstract
Chemical patents are an essential source of information about novel chemicals and chemical reactions. However, with the increasing volume of such patents, mining information about these chemicals and chemical reactions has become a time-intensive and laborious endeavor. In this study, we present a system to extract chemical reaction events from patents automatically. Our approach consists of two steps: 1) named entity recognition (NER)-the automatic identification of chemical reaction parameters from the corresponding text, and 2) event extraction (EE)-the automatic classifying and linking of entities based on their relationships to each other. For our NER system, we evaluate bidirectional long short-term memory (BiLSTM)-based and bidirectional encoder representations from transformer (BERT)-based methods. For our EE system, we evaluate BERT-based, convolutional neural network (CNN)-based, and rule-based methods. We evaluate our NER and EE components independently and as an end-to-end system, reporting the precision, recall, and F 1 score. Our results show that the BiLSTM-based method performed best at identifying the entities, and the CNN-based method performed best at extracting events.
Collapse
Affiliation(s)
- Darshini Mahendran
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Gabrielle Gurdin
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| | - Nastassja Lewinski
- Department of Life Science and Chemical Engineering, Virginia Commonwealth University, Richmond, VA, United States
| | - Christina Tang
- Department of Life Science and Chemical Engineering, Virginia Commonwealth University, Richmond, VA, United States
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, United States
| |
Collapse
|
49
|
Zhang Y, Zhang Y, Qi P, Manning CD, Langlotz CP. Biomedical and clinical English model packages for the Stanza Python NLP library. J Am Med Inform Assoc 2021; 28:1892-1899. [PMID: 34157094 PMCID: PMC8363782 DOI: 10.1093/jamia/ocab090] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 04/05/2021] [Accepted: 05/03/2021] [Indexed: 11/13/2022] Open
Abstract
Objective The study sought to develop and evaluate neural natural language processing (NLP) packages for the syntactic analysis and named entity recognition of biomedical and clinical English text. Materials and Methods We implement and train biomedical and clinical English NLP pipelines by extending the widely used Stanza library originally designed for general NLP tasks. Our models are trained with a mix of public datasets such as the CRAFT treebank as well as with a private corpus of radiology reports annotated with 5 radiology-domain entities. The resulting pipelines are fully based on neural networks, and are able to perform tokenization, part-of-speech tagging, lemmatization, dependency parsing, and named entity recognition for both biomedical and clinical text. We compare our systems against popular open-source NLP libraries such as CoreNLP and scispaCy, state-of-the-art models such as the BioBERT models, and winning systems from the BioNLP CRAFT shared task. Results For syntactic analysis, our systems achieve much better performance compared with the released scispaCy models and CoreNLP models retrained on the same treebanks, and are on par with the winning system from the CRAFT shared task. For NER, our systems substantially outperform scispaCy, and are better or on par with the state-of-the-art performance from BioBERT, while being much more computationally efficient. Conclusions We introduce biomedical and clinical NLP packages built for the Stanza library. These packages offer performance that is similar to the state of the art, and are also optimized for ease of use. To facilitate research, we make all our models publicly available. We also provide an online demonstration (http://stanza.run/bio).
Collapse
Affiliation(s)
- Yuhao Zhang
- Biomedical Informatics Training Program, Stanford University, Stanford, California, USA
| | - Yuhui Zhang
- Computer Science Department, Stanford University, Stanford, California, USA
| | - Peng Qi
- Computer Science Department, Stanford University, Stanford, California, USA
| | - Christopher D Manning
- Computer Science and Linguistics Departments, Stanford University, Stanford, California, USA
| | - Curtis P Langlotz
- Department of Radiology, Stanford University, Stanford, California, USA
| |
Collapse
|
50
|
Jia Q, Zhang D, Xu H, Xie Y. Extraction of Traditional Chinese Medicine Entity: Design of a Novel Span-Level Named Entity Recognition Method With Distant Supervision. JMIR Med Inform 2021; 9:e28219. [PMID: 34125076 PMCID: PMC8240806 DOI: 10.2196/28219] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 04/14/2021] [Accepted: 04/19/2021] [Indexed: 11/23/2022] Open
Abstract
Background Traditional Chinese medicine (TCM) clinical records contain the symptoms of patients, diagnoses, and subsequent treatment of doctors. These records are important resources for research and analysis of TCM diagnosis knowledge. However, most of TCM clinical records are unstructured text. Therefore, a method to automatically extract medical entities from TCM clinical records is indispensable. Objective Training a medical entity extracting model needs a large number of annotated corpus. The cost of annotated corpus is very high and there is a lack of gold-standard data sets for supervised learning methods. Therefore, we utilized distantly supervised named entity recognition (NER) to respond to the challenge. Methods We propose a span-level distantly supervised NER approach to extract TCM medical entity. It utilizes the pretrained language model and a simple multilayer neural network as classifier to detect and classify entity. We also designed a negative sampling strategy for the span-level model. The strategy randomly selects negative samples in every epoch and filters the possible false-negative samples periodically. It reduces the bad influence from the false-negative samples. Results We compare our methods with other baseline methods to illustrate the effectiveness of our method on a gold-standard data set. The F1 score of our method is 77.34 and it remarkably outperforms the other baselines. Conclusions We developed a distantly supervised NER approach to extract medical entity from TCM clinical records. We estimated our approach on a TCM clinical record data set. Our experimental results indicate that the proposed approach achieves a better performance than other baselines.
Collapse
Affiliation(s)
- Qi Jia
- School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China.,Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing, China
| | - Dezheng Zhang
- School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China.,Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing, China
| | - Haifeng Xu
- School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China.,Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing, China
| | - Yonghong Xie
- School of Computer and Communication Engineering, University of Science and Technology Beijing, Beijing, China.,Beijing Key Laboratory of Knowledge Engineering for Materials Science, Beijing, China
| |
Collapse
|