Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

Download

Total Articles

90
(from Reference Citation Analysis)

Article PDFs (28)

Cited by > 0 (52)

Searched Name

named entity recognition

Ranked By

Results Analysis

Year Published Analysis
Article Type Analysis
Publication Title Analysis
Category Analysis

Results Analysis

Indexed Articles

Year Published

Show more Refine

Article Type

Show more Refine

Article Statistics

Refine

MESH Headings

Show more Refine

First Author

Show more Refine

First Author Affiliations

Show more Refine

Authors

Show more Refine

Publication Titles

Show more Refine

Grant Agencies

Show more Refine

Countries/Regions

Show more Refine

Affiliations

Show more Refine

Corresponding Author Affiliations

Show more Refine

Category

Show more Refine

Number

Citation Analysis

Wang M, Vijayaraghavan A, Beck T, Posma JM. Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. J Proteome Res 2024. [PMID: 38733346 DOI: 10.1021/acs.jproteome.3c00367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/13/2024]

Zhang G, Zhou Y, Hu Y, Xu H, Weng C, Peng Y. A span-based model for extracting overlapping PICO entities from randomized controlled trial publications. J Am Med Inform Assoc 2024;31:1163-1171. [PMID: 38471120 PMCID: PMC11031223 DOI: 10.1093/jamia/ocae065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 02/20/2024] [Accepted: 03/11/2024] [Indexed: 03/14/2024] Open

Gérardin C, Xiong Y, Wajsbürt P, Carrat F, Tannier X. Impact of Translation on Biomedical Information Extraction: Experiment on Real-Life Clinical Notes. JMIR Med Inform 2024;12:e49607. [PMID: 38596859 DOI: 10.2196/49607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2023] [Revised: 01/07/2024] [Accepted: 01/10/2024] [Indexed: 03/03/2024] Open

Iscoe M, Socrates V, Gilson A, Chi L, Li H, Huang T, Kearns T, Perkins R, Khandjian L, Taylor RA. Identifying signs and symptoms of urinary tract infection from emergency department clinical notes using large language models. Acad Emerg Med 2024. [PMID: 38567658 DOI: 10.1111/acem.14883] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 01/24/2024] [Accepted: 01/24/2024] [Indexed: 04/04/2024]

Abstract

BACKGROUND

Natural language processing (NLP) tools including recently developed large language models (LLMs) have myriad potential applications in medical care and research, including the efficient labeling and classification of unstructured text such as electronic health record (EHR) notes. This opens the door to large-scale projects that rely on variables that are not typically recorded in a structured form, such as patient signs and symptoms.

OBJECTIVES

This study is designed to acquaint the emergency medicine research community with the foundational elements of NLP, highlighting essential terminology, annotation methodologies, and the intricacies involved in training and evaluating NLP models. Symptom characterization is critical to urinary tract infection (UTI) diagnosis, but identification of symptoms from the EHR has historically been challenging, limiting large-scale research, public health surveillance, and EHR-based clinical decision support. We therefore developed and compared two NLP models to identify UTI symptoms from unstructured emergency department (ED) notes.

METHODS

The study population consisted of patients aged ≥ 18 who presented to an ED in a northeastern U.S. health system between June 2013 and August 2021 and had a urinalysis performed. We annotated a random subset of 1250 ED clinician notes from these visits for a list of 17 UTI symptoms. We then developed two task-specific LLMs to perform the task of named entity recognition: a convolutional neural network-based model (SpaCy) and a transformer-based model designed to process longer documents (Clinical Longformer). Models were trained on 1000 notes and tested on a holdout set of 250 notes. We compared model performance (precision, recall, F1 measure) at identifying the presence or absence of UTI symptoms at the note level.

RESULTS

A total of 8135 entities were identified in 1250 notes; 83.6% of notes included at least one entity. Overall F1 measure for note-level symptom identification weighted by entity frequency was 0.84 for the SpaCy model and 0.88 for the Longformer model. F1 measure for identifying presence or absence of any UTI symptom in a clinical note was 0.96 (232/250 correctly classified) for the SpaCy model and 0.98 (240/250 correctly classified) for the Longformer model.

CONCLUSIONS

The study demonstrated the utility of LLMs and transformer-based models in particular for extracting UTI symptoms from unstructured ED clinical notes; models were highly accurate for detecting the presence or absence of any UTI symptom on the note level, with variable performance for individual symptoms.

Collapse

Li Z, Wei Q, Huang LC, Li J, Hu Y, Chuang YS, He J, Das A, Keloth VK, Yang Y, Diala CS, Roberts KE, Tao C, Jiang X, Zheng WJ, Xu H. Ensemble pretrained language models to extract biomedical knowledge from literature. J Am Med Inform Assoc 2024:ocae061. [PMID: 38520725 DOI: 10.1093/jamia/ocae061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 02/14/2024] [Accepted: 03/12/2024] [Indexed: 03/25/2024] Open

Affiliation(s)

Zhao Li McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Qiang Wei McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Liang-Chin Huang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Jianfu Li McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Yan Hu McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Yao-Shun Chuang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Jianping He McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Avisha Das McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Vipina Kuttichi Keloth Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States
Yuntao Yang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Chiamaka S Diala McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Kirk E Roberts McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Cui Tao McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Xiaoqian Jiang McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
W Jim Zheng McWilliams School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX 77030, United States
Hua Xu Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT 06510, United States

Collapse

Truhn D, Loeffler CM, Müller-Franzes G, Nebelung S, Hewitt KJ, Brandner S, Bressem KK, Foersch S, Kather JN. Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4). J Pathol 2024;262:310-319. [PMID: 38098169 DOI: 10.1002/path.6232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2023] [Revised: 09/16/2023] [Accepted: 11/03/2023] [Indexed: 02/06/2024]

Yao LFL, Liew K, Wakamiya S, Aramaki E. Extracting Spatio-Temporal Trends in Medical Research Prioritization Through Natural Language Processing of Case Report Abstracts. Stud Health Technol Inform 2024;310:634-638. [PMID: 38269886 DOI: 10.3233/shti231042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]

Neuraz A, Lerner I, Birot O, Arias C, Han L, Bonzel CL, Cai T, Huynh KT, Coulet A. TAXN: Translate Align Extract Normalize, a Multilingual Extraction Tool for Clinical Texts. Stud Health Technol Inform 2024;310:649-653. [PMID: 38269889 DOI: 10.3233/shti231045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]

Guo Y, Ge Y, Sarker A. Detection of Medication Mentions and Medication Change Events in Clinical Notes Using Transformer-Based Models. Stud Health Technol Inform 2024;310:685-689. [PMID: 38269896 DOI: 10.3233/shti231052] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]

Stevens M, Kennedy G, Churches T. Applying and Improving a Publicly Available Medication NER Pipeline in a Clinical Cancer EMR. Stud Health Technol Inform 2024;310:679-684. [PMID: 38269895 DOI: 10.3233/shti231051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]

Zhou H, Austin R, Lu SC, Silverman GM, Zhou Y, Kilicoglu H, Xu H, Zhang R. Complementary and Integrative Health Information in the literature: its lexicon and named entity recognition. J Am Med Inform Assoc 2024;31:426-434. [PMID: 37952122 PMCID: PMC10797266 DOI: 10.1093/jamia/ocad216] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 10/20/2023] [Accepted: 11/08/2023] [Indexed: 11/14/2023] Open

Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, Wang K. Enhancing phenotype recognition in clinical notes using large language models: PhenoBCBERT and PhenoGPT. Patterns (N Y) 2024;5:100887. [PMID: 38264716 PMCID: PMC10801236 DOI: 10.1016/j.patter.2023.100887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 10/25/2023] [Accepted: 11/06/2023] [Indexed: 01/25/2024]

Sugimoto K, Wada S, Konishi S, Okada K, Manabe S, Matsumura Y, Takeda T. Extracting Clinical Information From Japanese Radiology Reports Using a 2-Stage Deep Learning Approach: Algorithm Development and Validation. JMIR Med Inform 2023;11:e49041. [PMID: 37991979 PMCID: PMC10686535 DOI: 10.2196/49041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Revised: 09/25/2023] [Accepted: 10/03/2023] [Indexed: 11/24/2023] Open

Yang J, Liu C, Deng W, Wu D, Weng C, Zhou Y, Wang K. Enhancing Phenotype Recognition in Clinical Notes Using Large Language Models: PhenoBCBERT and PhenoGPT. ArXiv 2023:arXiv:2308.06294v2. [PMID: 37986722 PMCID: PMC10659449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]

Chen S, Lan X, Yu H. A social network analysis: mental health scales used during the COVID-19 pandemic. Front Psychiatry 2023;14:1199906. [PMID: 37706038 PMCID: PMC10495585 DOI: 10.3389/fpsyt.2023.1199906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 08/11/2023] [Indexed: 09/15/2023] Open

Jiang Y, Kavuluru R. End-to-End n-ary Relation Extraction for Combination Drug Therapies. IEEE Int Conf Healthc Inform 2023;2023:72-80. [PMID: 38283165 PMCID: PMC10814995 DOI: 10.1109/ichi57859.2023.00021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]

Sun Z, Tao C. Named Entity Recognition and Normalization for Alzheimer's Disease Eligibility Criteria. IEEE Int Conf Healthc Inform 2023;2023:558-564. [PMID: 38283164 PMCID: PMC10815931 DOI: 10.1109/ichi57859.2023.00100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/30/2024]

Abstract

Alzheimer's Disease (AD) is a complex neurodegenerative disorder that affects millions of people worldwide. Finding effective treatments for this disease is crucial. Clinical trials play an essential role in developing and testing new treatments for AD. However, identifying eligible participants can be challenging, time-consuming, and costly. In recent years, the development of natural language processing (NLP) techniques, specifically named entity recognition (NER) and named entity normalization (NEN), have helped to automate the identification and extraction of relevant information from the eligibility criteria (EC) more efficiently, in order to facilitate semi-automatic patient recruitment and enable data FAIRness for clinical trial data. Nevertheless, most current biomedical NER models only provide annotations for a restricted set of entity types that may not be applicable to the clinical trial data. Additionally, accurately performing NEN on entities that are negated using a negative prefix currently lacks established techniques. In this paper, we introduce a pipeline designed for information extraction from AD clinical trial EC, which involves preprocessing of the EC data, clinical NER, and biomedical NEN to Unified Medical Language System (UMLS). Our NER model can identify named entities in seven pre-defined categories, while our NEN model employs a combination of exact match and partial match search strategies, as well as customized rules to accurately normalize entities with negative prefixes. To evaluate the performance of our pipeline, we measured the precision, recall, and F1 score for the NER component, and we manually reviewed the top five mapping results produced by the NEN component. Our evaluation of the pipeline's performance revealed that it can successfully normalize named entities in clinical trial ECs with optimal accuracies. The NER component achieved a overall F1 of 0.816, demonstrating its ability to accurately identify seven types of named entities in clinical text. The NEN component of the pipeline also demonstrated impressive performance, with customized rules and a combination of exact and partial match strategies leading to an accuracy of 0.940 for normalized entities.

Collapse

Wei J, Hu T, Dai J, Wang Z, Han P, Huang W. Research on named entity recognition of adverse drug reactions based on NLP and deep learning. Front Pharmacol 2023;14:1121796. [PMID: 37332351 PMCID: PMC10270322 DOI: 10.3389/fphar.2023.1121796] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 05/23/2023] [Indexed: 06/20/2023] Open

Abstract

Introduction: Adverse drug reactions (ADR) are directly related to public health and become the focus of public and media attention. At present, a large number of ADR events have been reported on the Internet, but the mining and utilization of such information resources is insufficient. Named entity recognition (NER) is the basic work of many natural language processing (NLP) tasks, which aims to identify entities with specific meanings from natural language texts. Methods: In order to identify entities from ADR event data resources more effectively, so as to provide valuable health knowledge for people, this paper introduces ALBERT in the input presentation layer on the basis of the classic BiLSTM-CRF model, and proposes a method of ADR named entity recognition based on the ALBERT-BiLSTM-CRF model. The textual information about ADR on the website "Chinese medical information query platform" (https://www.dayi.org.cn) was collected by the crawler and used as research data, and the BIO method was used to label three types of entities: drug name (DRN), drug component (COM), and adverse drug reactions (ADR) to build a corpus. Then, the words were mapped to the word vector by using the ALBERT module to obtain the character level semantic information, the context coding was performed by the BiLSTM module, and the label decoding was using the CRF module to predict the real label. Results: Based on the constructed corpus, experimental comparisons were made with two classical models, namely, BiLSTM-CRF and BERT-BiLSTM-CRF. The experimental results show that the F ₁ of our method is 91.19% on the whole, which is 1.5% and 1.37% higher than the other two models respectively, and the performance of recognition of three types of entities is significantly improved, which proves the superiority of this method. Discussion: The method proposed can be used effectively in NER from ADR information on the Internet, which provides a basis for the extraction of drug-related entity relationships and the construction of knowledge graph, thus playing a role in practical health systems such as intelligent diagnosis, risk reasoning and automatic question answering.

Collapse

Xu Q, Zhou Y, Liao B, Xin Z, Xie W, Hu C, Luo A. Named Entity Recognition of Diabetes Online Health Community Data Using Multiple Machine Learning Models. Bioengineering (Basel) 2023;10:659. [PMID: 37370590 DOI: 10.3390/bioengineering10060659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 05/19/2023] [Accepted: 05/25/2023] [Indexed: 06/29/2023] Open

Affiliation(s)

Qian Xu Second Xiangya Hospital, Central South University, Changsha 410011, China School of Life Sciences, Central South University, Changsha 410013, China College of Computer Science and Engineering, Jishou University, Jishou 416000, China Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
Yue Zhou Second Xiangya Hospital, Central South University, Changsha 410011, China School of Life Sciences, Central South University, Changsha 410013, China Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
Bolin Liao College of Computer Science and Engineering, Jishou University, Jishou 416000, China
Zirui Xin Second Xiangya Hospital, Central South University, Changsha 410011, China Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
Wenzhao Xie Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China
Chao Hu Big Data Institute, Central South University, Changsha 410011, China
Aijing Luo Second Xiangya Hospital, Central South University, Changsha 410011, China Key Laboratory of Medical Information Research, Central South University, College of Hunan Province, Changsha 410013, China Clinical Research Center for Cardiovascular Intelligent Healthcare in Hunan Province, Changsha 410011, China

Collapse

Šuvalov H, Laur S, Kolde R. Information Extraction from Medical Texts with BERT Using Human-in-the-Loop Labeling. Stud Health Technol Inform 2023;302:831-832. [PMID: 37203510 DOI: 10.3233/shti230281] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]

Sezgin E, Hussain SA, Rust S, Huang Y. Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data. JMIR Form Res 2023;7:e43014. [PMID: 36881467 PMCID: PMC10031450 DOI: 10.2196/43014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 01/24/2023] [Accepted: 01/30/2023] [Indexed: 01/31/2023] Open

Abstract

BACKGROUND

Patient-generated health data (PGHD) captured via smart devices or digital health technologies can reflect an individual health journey. PGHD enables tracking and monitoring of personal health conditions, symptoms, and medications out of the clinic, which is crucial for self-care and shared clinical decisions. In addition to self-reported measures and structured PGHD (eg, self-screening, sensor-based biometric data), free-text and unstructured PGHD (eg, patient care note, medical diary) can provide a broader view of a patient's journey and health condition. Natural language processing (NLP) is used to process and analyze unstructured data to create meaningful summaries and insights, showing promise to improve the utilization of PGHD.

OBJECTIVE

Our aim is to understand and demonstrate the feasibility of an NLP pipeline to extract medication and symptom information from real-world patient and caregiver data.

METHODS

We report a secondary data analysis, using a data set collected from 24 parents of children with special health care needs (CSHCN) who were recruited via a nonrandom sampling approach. Participants used a voice-interactive app for 2 weeks, generating free-text patient notes (audio transcription or text entry). We built an NLP pipeline using a zero-shot approach (adaptive to low-resource settings). We used named entity recognition (NER) and medical ontologies (RXNorm and SNOMED CT [Systematized Nomenclature of Medicine Clinical Terms]) to identify medication and symptoms. Sentence-level dependency parse trees and part-of-speech tags were used to extract additional entity information using the syntactic properties of a note. We assessed the data; evaluated the pipeline with the patient notes; and reported the precision, recall, and F₁ scores.

RESULTS

In total, 87 patient notes are included (audio transcriptions n=78 and text entries n=9) from 24 parents who have at least one CSHCN. The participants were between the ages of 26 and 59 years. The majority were White (n=22, 92%), had more than one child (n=16, 67%), lived in Ohio (n=22, 92%), had mid- or upper-mid household income (n=15, 62.5%), and had higher level education (n=24, 58%). Out of 87 notes, 30 were drug and medication related, and 46 were symptom related. We captured medication instances (medication, unit, quantity, and date) and symptoms satisfactorily (precision >0.65, recall >0.77, F₁>0.72). These results indicate the potential when using NER and dependency parsing through an NLP pipeline on information extraction from unstructured PGHD.

CONCLUSIONS

The proposed NLP pipeline was found to be feasible for use with real-world unstructured PGHD to accomplish medication and symptom extraction. Unstructured PGHD can be leveraged to inform clinical decision-making, remote monitoring, and self-care including medical adherence and chronic disease management. With customizable information extraction methods using NER and medical ontologies, NLP models can feasibly extract a broad range of clinical information from unstructured PGHD in low-resource settings (eg, a limited number of patient notes or training data).

Collapse

Frei J, Kramer F. German Medical Named Entity Recognition Model and Data Set Creation Using Machine Translation and Word Alignment: Algorithm Development and Validation. JMIR Form Res 2023;7:e39077. [PMID: 36853741 PMCID: PMC10015355 DOI: 10.2196/39077] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2022] [Revised: 09/11/2022] [Accepted: 11/03/2022] [Indexed: 11/06/2022] Open

Abstract

BACKGROUND

Data mining in the field of medical data analysis often needs to rely solely on the processing of unstructured data to retrieve relevant data. For German natural language processing, few open medical neural named entity recognition (NER) models have been published before this work. A major issue can be attributed to the lack of German training data.

OBJECTIVE

We developed a synthetic data set and a novel German medical NER model for public access to demonstrate the feasibility of our approach. In order to bypass legal restrictions due to potential data leaks through model analysis, we did not make use of internal, proprietary data sets, which is a frequent veto factor for data set publication.

METHODS

The underlying German data set was retrieved by translation and word alignment of a public English data set. The data set served as a foundation for model training and evaluation. For demonstration purposes, our NER model follows a simple network architecture that is designed for low computational requirements.

RESULTS

The obtained data set consisted of 8599 sentences including 30,233 annotations. The model achieved a class frequency-averaged F₁ score of 0.82 on the test set after training across 7 different NER types. Artifacts in the synthesized data set with regard to translation and alignment induced by the proposed method were exposed. The annotation performance was evaluated on an external data set and measured in comparison with an existing baseline model that has been trained on a dedicated German data set in a traditional fashion. We discussed the drop in annotation performance on an external data set for our simple NER model. Our model is publicly available.

CONCLUSIONS

We demonstrated the feasibility of obtaining a data set and training a German medical NER model by the exclusive use of public training data through our suggested method. The discussion on the limitations of our approach includes ways to further mitigate remaining problems in future work.

Collapse

Ma X, Yu R, Gao C, Wei Z, Xia Y, Wang X, Liu H. Research on named entity recognition method of marine natural products based on attention mechanism. Front Chem 2023;11:958002. [PMID: 36846857 PMCID: PMC9944735 DOI: 10.3389/fchem.2023.958002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 01/24/2023] [Indexed: 02/11/2023] Open

Li Y, Wehbe RM, Ahmad FS, Wang H, Luo Y. A comparative study of pretrained language models for long clinical text. J Am Med Inform Assoc 2023;30:340-347. [PMID: 36451266 PMCID: PMC9846675 DOI: 10.1093/jamia/ocac225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 11/06/2022] [Accepted: 11/14/2022] [Indexed: 12/03/2022] Open

Abstract

OBJECTIVE

Clinical knowledge-enriched transformer models (eg, ClinicalBERT) have state-of-the-art results on clinical natural language processing (NLP) tasks. One of the core limitations of these transformer models is the substantial memory consumption due to their full self-attention mechanism, which leads to the performance degradation in long clinical texts. To overcome this, we propose to leverage long-sequence transformer models (eg, Longformer and BigBird), which extend the maximum input sequence length from 512 to 4096, to enhance the ability to model long-term dependencies in long clinical texts.

MATERIALS AND METHODS

Inspired by the success of long-sequence transformer models and the fact that clinical notes are mostly long, we introduce 2 domain-enriched language models, Clinical-Longformer and Clinical-BigBird, which are pretrained on a large-scale clinical corpus. We evaluate both language models using 10 baseline tasks including named entity recognition, question answering, natural language inference, and document classification tasks.

RESULTS

The results demonstrate that Clinical-Longformer and Clinical-BigBird consistently and significantly outperform ClinicalBERT and other short-sequence transformers in all 10 downstream tasks and achieve new state-of-the-art results.

DISCUSSION

Our pretrained language models provide the bedrock for clinical NLP using long texts. We have made our source code available at https://github.com/luoyuanlab/Clinical-Longformer, and the pretrained models available for public download at: https://huggingface.co/yikuan8/Clinical-Longformer.

CONCLUSION

This study demonstrates that clinical knowledge-enriched long-sequence transformers are able to learn long-term dependencies in long clinical text. Our methods can also inspire the development of other domain-enriched long-sequence transformers.

Collapse

Lee H, Jeong O. A Knowledge-Grounded Task-Oriented Dialogue System with Hierarchical Structure for Enhancing Knowledge Selection. Sensors (Basel) 2023;23:685. [PMID: 36679481 PMCID: PMC9864774 DOI: 10.3390/s23020685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 01/04/2023] [Accepted: 01/04/2023] [Indexed: 06/17/2023]

Haisa G, Altenbek G. Multi-Task Learning Model for Kazakh Query Understanding. Sensors (Basel) 2022;22:9810. [PMID: 36560177 PMCID: PMC9785505 DOI: 10.3390/s22249810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/15/2022] [Revised: 11/29/2022] [Accepted: 12/09/2022] [Indexed: 06/17/2023]

Azizi S, Hier DB, Wunsch II DC. Enhanced neurologic concept recognition using a named entity recognition model based on transformers. Front Digit Health 2022;4:1065581. [PMID: 36569804 PMCID: PMC9772022 DOI: 10.3389/fdgth.2022.1065581] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2022] [Accepted: 11/21/2022] [Indexed: 12/12/2022] Open

Ivanisenko TV, Demenkov PS, Kolchanov NA, Ivanisenko VA. The New Version of the ANDDigest Tool with Improved AI-Based Short Names Recognition. Int J Mol Sci 2022;23:ijms232314934. [PMID: 36499269 PMCID: PMC9738852 DOI: 10.3390/ijms232314934] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2022] [Revised: 11/19/2022] [Accepted: 11/22/2022] [Indexed: 12/05/2022] Open

Liao T, Huang R, Zhang S, Duan S, Chen Y, Ma W, Chen X. Nested Named Entity Recognition Based on Dual Stream Feature Complementation. Entropy (Basel) 2022;24:1454. [PMID: 37420474 DOI: 10.3390/e24101454] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/21/2022] [Revised: 09/19/2022] [Accepted: 10/05/2022] [Indexed: 07/09/2023]

Luo L, Lai PT, Wei CH, Arighi CN, Lu Z. BioRED: a rich biomedical relation extraction dataset. Brief Bioinform 2022;23:6645993. [PMID: 35849818 PMCID: PMC9487702 DOI: 10.1093/bib/bbac282] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2022] [Revised: 06/02/2022] [Accepted: 06/19/2022] [Indexed: 11/13/2022] Open

Doerstling SS, Akrobetu D, Engelhard MM, Chen F, Ubel PA. A Disease Identification Algorithm for Medical Crowdfunding Campaigns: Validation Study. J Med Internet Res 2022;24:e32867. [PMID: 35727610 PMCID: PMC9257615 DOI: 10.2196/32867] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2021] [Revised: 03/11/2022] [Accepted: 04/20/2022] [Indexed: 11/13/2022] Open

Abstract

Background

Web-based crowdfunding has become a popular method to raise money for medical expenses, and there is growing research interest in this topic. However, crowdfunding data are largely composed of unstructured text, thereby posing many challenges for researchers hoping to answer questions about specific medical conditions. Previous studies have used methods that either failed to address major challenges or were poorly scalable to large sample sizes. To enable further research on this emerging funding mechanism in health care, better methods are needed.

Objective

We sought to validate an algorithm for identifying 11 disease categories in web-based medical crowdfunding campaigns. We hypothesized that a disease identification algorithm combining a named entity recognition (NER) model and word search approach could identify disease categories with high precision and accuracy. Such an algorithm would facilitate further research using these data.

Methods

Web scraping was used to collect data on medical crowdfunding campaigns from GoFundMe (GoFundMe Inc). Using pretrained NER and entity resolution models from Spark NLP for Healthcare in combination with targeted keyword searches, we constructed an algorithm to identify conditions in the campaign descriptions, translate conditions to International Classification of Diseases, 10th Revision, Clinical Modification (ICD-10-CM) codes, and predict the presence or absence of 11 disease categories in the campaigns. The classification performance of the algorithm was evaluated against 400 manually labeled campaigns.

Results

We collected data on 89,645 crowdfunding campaigns through web scraping. The interrater reliability for detecting the presence of broad disease categories in the campaign descriptions was high (Cohen κ: range 0.69-0.96). The NER and entity resolution models identified 6594 unique (276,020 total) ICD-10-CM codes among all of the crowdfunding campaigns in our sample. Through our word search, we identified 3261 additional campaigns for which a medical condition was not otherwise detected with the NER model. When averaged across all disease categories and weighted by the number of campaigns that mentioned each disease category, the algorithm demonstrated an overall precision of 0.83 (range 0.48-0.97), a recall of 0.77 (range 0.42-0.98), an F₁ score of 0.78 (range 0.56-0.96), and an accuracy of 95% (range 90%-98%).

Conclusions

A disease identification algorithm combining pretrained natural language processing models and ICD-10-CM code–based disease categorization was able to detect 11 disease categories in medical crowdfunding campaigns with high precision and accuracy.

Collapse

McInnes BT, Downie JS, Hao Y, Jett J, Keating K, Nakum G, Ranjan S, Rodriguez NE, Tang J, Xiang D, Young EM, Nguyen MH. Discovering Content through Text Mining for a Synthetic Biology Knowledge System. ACS Synth Biol 2022;11:2043-2054. [PMID: 35671034 DOI: 10.1021/acssynbio.1c00611] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]

Yeung CS, Beck T, Posma JM. MetaboListem and TABoLiSTM: Two Deep Learning Algorithms for Metabolite Named Entity Recognition. Metabolites 2022;12. [PMID: 35448463 DOI: 10.3390/metabo12040276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 03/15/2022] [Accepted: 03/17/2022] [Indexed: 11/17/2022] Open

Liu HY, Han CJ, Xiong J, Li HY, Lei L, Liu BY. [Automatic labeling and extraction of terms in natural language processing in acupuncture clinical literature]. Zhongguo Zhen Jiu 2022;42:327-331. [PMID: 35272414 DOI: 10.13703/j.0255-2930.20211107-k0002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]

Chew R, Wenger M, Guillory J, Nonnemaker J, Kim A. Identifying Electronic Nicotine Delivery System Brands and Flavors on Instagram: Natural Language Processing Analysis. J Med Internet Res 2022;24:e30257. [PMID: 35040793 PMCID: PMC8808345 DOI: 10.2196/30257] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2021] [Revised: 11/01/2021] [Accepted: 11/21/2021] [Indexed: 01/30/2023] Open

Abstract

Background

Electronic nicotine delivery system (ENDS) brands, such as JUUL, used social media as a key component of their marketing strategy, which led to massive sales growth from 2015 to 2018. During this time, ENDS use rapidly increased among youths and young adults, with flavored products being particularly popular among these groups.

Objective

The aim of our study is to develop a named entity recognition (NER) model to identify potential emerging vaping brands and flavors from Instagram post text. NER is a natural language processing task for identifying specific types of words (entities) in text based on the characteristics of the entity and surrounding words.

Methods

NER models were trained on a labeled data set of 2272 Instagram posts coded for ENDS brands and flavors. We compared three types of NER models—conditional random fields, a residual convolutional neural network, and a fine-tuned distilled bidirectional encoder representations from transformers (FTDB) network—to identify brands and flavors in Instagram posts with key model outcomes of precision, recall, and F1 scores. We used data from Nielsen scanner sales and Wikipedia to create benchmark dictionaries to determine whether brands from established ENDS brand and flavor lists were mentioned in the Instagram posts in our sample. To prevent overfitting, we performed 5-fold cross-validation and reported the mean and SD of the model validation metrics across the folds.

Results

For brands, the residual convolutional neural network exhibited the highest mean precision (0.797, SD 0.084), and the FTDB exhibited the highest mean recall (0.869, SD 0.103). For flavors, the FTDB exhibited both the highest mean precision (0.860, SD 0.055) and recall (0.801, SD 0.091). All NER models outperformed the benchmark brand and flavor dictionary look-ups on mean precision, recall, and F1. Comparing between the benchmark brand lists, the larger Wikipedia list outperformed the Nielsen list in both precision and recall.

Conclusions

Our findings suggest that NER models correctly identified ENDS brands and flavors in Instagram posts at rates competitive with, or better than, others in the published literature. Brands identified during manual annotation showed little overlap with those in Nielsen scanner data, suggesting that NER models may capture emerging brands with limited sales and distribution. NER models address the challenges of manual brand identification and can be used to support future infodemiology and infoveillance studies. Brands identified on social media should be cross-validated with Nielsen and other data sources to differentiate emerging brands that have become established from those with limited sales and distribution.

Collapse

Wang J, Ren Y, Zhang Z, Xu H, Zhang Y. From Tokenization to Self-Supervision: Building a High-Performance Information Extraction System for Chemical Reactions in Patents. Front Res Metr Anal 2022;6:691105. [PMID: 35005421 PMCID: PMC8727901 DOI: 10.3389/frma.2021.691105] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Accepted: 11/02/2021] [Indexed: 11/28/2022] Open

Wu Y, Liu Z, Wu L, Chen M, Tong W. BERT-Based Natural Language Processing of Drug Labeling Documents: A Case Study for Classifying Drug-Induced Liver Injury Risk. Front Artif Intell 2021;4:729834. [PMID: 34939028 PMCID: PMC8685544 DOI: 10.3389/frai.2021.729834] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2021] [Accepted: 11/17/2021] [Indexed: 11/16/2022] Open

Naderi N, Knafou J, Copara J, Ruch P, Teodoro D. Ensemble of Deep Masked Language Models for Effective Named Entity Recognition in Health and Life Science Corpora. Front Res Metr Anal 2021;6:689803. [PMID: 34870074 PMCID: PMC8640190 DOI: 10.3389/frma.2021.689803] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Accepted: 10/11/2021] [Indexed: 11/13/2022] Open

Wu H, Ji J, Tian H, Chen Y, Ge W, Zhang H, Yu F, Zou J, Nakamura M, Liao J. Chinese-Named Entity Recognition From Adverse Drug Event Records: Radical Embedding-Combined Dynamic Embedding-Based BERT in a Bidirectional Long Short-term Conditional Random Field (Bi-LSTM-CRF) Model. JMIR Med Inform 2021;9:e26407. [PMID: 34855616 PMCID: PMC8686410 DOI: 10.2196/26407] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/22/2021] [Accepted: 10/05/2021] [Indexed: 12/17/2022] Open

Abstract

Background

With the increasing variety of drugs, the incidence of adverse drug events (ADEs) is increasing year by year. Massive numbers of ADEs are recorded in electronic medical records and adverse drug reaction (ADR) reports, which are important sources of potential ADR information. Meanwhile, it is essential to make latent ADR information automatically available for better postmarketing drug safety reevaluation and pharmacovigilance.

Objective

This study describes how to identify ADR-related information from Chinese ADE reports.

Methods

Our study established an efficient automated tool, named BBC-Radical. BBC-Radical is a model that consists of 3 components: Bidirectional Encoder Representations from Transformers (BERT), bidirectional long short-term memory (bi-LSTM), and conditional random field (CRF). The model identifies ADR-related information from Chinese ADR reports. Token features and radical features of Chinese characters were used to represent the common meaning of a group of words. BERT and Bi-LSTM-CRF were novel models that combined these features to conduct named entity recognition (NER) tasks in the free-text section of 24,890 ADR reports from the Jiangsu Province Adverse Drug Reaction Monitoring Center from 2010 to 2016. Moreover, the man-machine comparison experiment on the ADE records from Drum Tower Hospital was designed to compare the NER performance between the BBC-Radical model and a manual method.

Results

The NER model achieved relatively high performance, with a precision of 96.4%, recall of 96.0%, and F1 score of 96.2%. This indicates that the performance of the BBC-Radical model (precision 87.2%, recall 85.7%, and F1 score 86.4%) is much better than that of the manual method (precision 86.1%, recall 73.8%, and F1 score 79.5%) in the recognition task of each kind of entity.

Conclusions

The proposed model was competitive in extracting ADR-related information from ADE reports, and the results suggest that the application of our method to extract ADR-related information is of great significance in improving the quality of ADR reports and postmarketing drug safety evaluation.

Collapse

Larmande P, Liu Y, Yao X, Xia J. OryzaGP 2021 update: a rice gene and protein dataset for named-entity recognition. Genomics Inform 2021;19:e27. [PMID: 34638174 PMCID: PMC8510865 DOI: 10.5808/gi.21015] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Accepted: 07/27/2021] [Indexed: 12/02/2022] Open

Lovis C, Rayson P. Social Media Monitoring of the COVID-19 Pandemic and Influenza Epidemic With Adaptation for Informal Language in Arabic Twitter Data: Qualitative Study. JMIR Med Inform 2021;9:e27670. [PMID: 34346892 PMCID: PMC8451962 DOI: 10.2196/27670] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Revised: 04/20/2021] [Accepted: 06/20/2021] [Indexed: 02/05/2023] Open

Abstract

BACKGROUND

Twitter is a real-time messaging platform widely used by people and organizations to share information on many topics. Systematic monitoring of social media posts (infodemiology or infoveillance) could be useful to detect misinformation outbreaks as well as to reduce reporting lag time and to provide an independent complementary source of data compared with traditional surveillance approaches. However, such an analysis is currently not possible in the Arabic-speaking world owing to a lack of basic building blocks for research and dialectal variation.

OBJECTIVE

We collected around 4000 Arabic tweets related to COVID-19 and influenza. We cleaned and labeled the tweets relative to the Arabic Infectious Diseases Ontology, which includes nonstandard terminology, as well as 11 core concepts and 21 relations. The aim of this study was to analyze Arabic tweets to estimate their usefulness for health surveillance, understand the impact of the informal terms in the analysis, show the effect of deep learning methods in the classification process, and identify the locations where the infection is spreading.

METHODS

We applied the following multilabel classification techniques: binary relevance, classifier chains, label power set, adapted algorithm (multilabel adapted k-nearest neighbors [MLKNN]), support vector machine with naive Bayes features (NBSVM), bidirectional encoder representations from transformers (BERT), and AraBERT (transformer-based model for Arabic language understanding) to identify tweets appearing to be from infected individuals. We also used named entity recognition to predict the place names mentioned in the tweets.

RESULTS

We achieved an F1 score of up to 88% in the influenza case study and 94% in the COVID-19 one. Adapting for nonstandard terminology and informal language helped to improve accuracy by as much as 15%, with an average improvement of 8%. Deep learning methods achieved an F1 score of up to 94% during the classifying process. Our geolocation detection algorithm had an average accuracy of 54% for predicting the location of users according to tweet content.

CONCLUSIONS

This study identified two Arabic social media data sets for monitoring tweets related to influenza and COVID-19. It demonstrated the importance of including informal terms, which are regularly used by social media users, in the analysis. It also proved that BERT achieves good results when used with new terms in COVID-19 tweets. Finally, the tweet content may contain useful information to determine the location of disease spread.

Collapse

Noh J, Kavuluru R. Joint Learning for Biomedical NER and Entity Normalization: Encoding Schemes, Counterfactual Examples, and Zero-Shot Evaluation. ACM BCB 2021;2021. [PMID: 34505115 DOI: 10.1145/3459930.3469533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]

Hong Z, Pauloski JG, Ward L, Chard K, Blaiszik B, Foster I. Models and Processes to Extract Drug-like Molecules From Natural Language Text. Front Mol Biosci 2021;8:636077. [PMID: 34527701 PMCID: PMC8435623 DOI: 10.3389/fmolb.2021.636077] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 08/11/2021] [Indexed: 11/28/2022] Open

Wang H, Yeung WLK, Ng QX, Tung A, Tay JAM, Ryanputra D, Ong MEH, Feng M, Arulanandam S. A Weakly-Supervised Named Entity Recognition Machine Learning Approach for Emergency Medical Services Clinical Audit. Int J Environ Res Public Health 2021;18:ijerph18157776. [PMID: 34360065 PMCID: PMC8345494 DOI: 10.3390/ijerph18157776] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 07/19/2021] [Accepted: 07/21/2021] [Indexed: 11/16/2022]

Hu D, Zhang H, Li S, Wang Y, Wu N, Lu X. Automatic Extraction of Lung Cancer Staging Information From Computed Tomography Reports: Deep Learning Approach. JMIR Med Inform 2021;9:e27955. [PMID: 34287213 PMCID: PMC8339987 DOI: 10.2196/27955] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Revised: 05/27/2021] [Accepted: 06/07/2021] [Indexed: 01/04/2023] Open

Abstract

BACKGROUND

Lung cancer is the leading cause of cancer deaths worldwide. Clinical staging of lung cancer plays a crucial role in making treatment decisions and evaluating prognosis. However, in clinical practice, approximately one-half of the clinical stages of lung cancer patients are inconsistent with their pathological stages. As one of the most important diagnostic modalities for staging, chest computed tomography (CT) provides a wealth of information about cancer staging, but the free-text nature of the CT reports obstructs their computerization.

OBJECTIVE

We aimed to automatically extract the staging-related information from CT reports to support accurate clinical staging of lung cancer.

METHODS

In this study, we developed an information extraction (IE) system to extract the staging-related information from CT reports. The system consisted of the following three parts: named entity recognition (NER), relation classification (RC), and postprocessing (PP). We first summarized 22 questions about lung cancer staging based on the TNM staging guideline. Next, three state-of-the-art NER algorithms were implemented to recognize the entities of interest. Next, we designed a novel RC method using the relation sign constraint (RSC) to classify the relations between entities. Finally, a rule-based PP module was established to obtain the formatted answers using the results of NER and RC.

RESULTS

We evaluated the developed IE system on a clinical data set containing 392 chest CT reports collected from the Department of Thoracic Surgery II in the Peking University Cancer Hospital. The experimental results showed that the bidirectional encoder representation from transformers (BERT) model outperformed the iterated dilated convolutional neural networks-conditional random field (ID-CNN-CRF) and bidirectional long short-term memory networks-conditional random field (Bi-LSTM-CRF) for NER tasks with macro-F1 scores of 80.97% and 90.06% under the exact and inexact matching schemes, respectively. For the RC task, the proposed RSC showed better performance than the baseline methods. Further, the BERT-RSC model achieved the best performance with a macro-F1 score of 97.13% and a micro-F1 score of 98.37%. Moreover, the rule-based PP module could correctly obtain the formatted results using the extractions of NER and RC, achieving a macro-F1 score of 94.57% and a micro-F1 score of 96.74% for all the 22 questions.

CONCLUSIONS

We conclude that the developed IE system can effectively and accurately extract information about lung cancer staging from CT reports. Experimental results show that the extracted results have significant potential for further use in stage verification and prediction to facilitate accurate clinical staging.

Collapse

Li J, Zhou Y, Jiang X, Natarajan K, Pakhomov SV, Liu H, Xu H. Are synthetic clinical notes useful for real natural language processing tasks: A case study on clinical entity recognition. J Am Med Inform Assoc 2021;28:2193-2201. [PMID: 34272955 DOI: 10.1093/jamia/ocab112] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 05/09/2021] [Accepted: 06/07/2021] [Indexed: 11/13/2022] Open

Abstract

OBJECTIVE

: Developing clinical natural language processing systems often requires access to many clinical documents, which are not widely available to the public due to privacy and security concerns. To address this challenge, we propose to develop methods to generate synthetic clinical notes and evaluate their utility in real clinical natural language processing tasks.

MATERIALS AND METHODS

: We implemented 4 state-of-the-art text generation models, namely CharRNN, SegGAN, GPT-2, and CTRL, to generate clinical text for the History and Present Illness section. We then manually annotated clinical entities for randomly selected 500 History and Present Illness notes generated from the best-performing algorithm. To compare the utility of natural and synthetic corpora, we trained named entity recognition (NER) models from all 3 corpora and evaluated their performance on 2 independent natural corpora.

RESULTS

: Our evaluation shows GPT-2 achieved the best BLEU (bilingual evaluation understudy) score (with a BLEU-2 of 0.92). NER models trained on synthetic corpus generated by GPT-2 showed slightly better performance on 2 independent corpora: strict F1 scores of 0.709 and 0.748, respectively, when compared with the NER models trained on natural corpus (F1 scores of 0.706 and 0.737, respectively), indicating the good utility of synthetic corpora in clinical NER model development. In addition, we also demonstrated that an augmented method that combines both natural and synthetic corpora achieved better performance than that uses the natural corpus only.

CONCLUSIONS

: Recent advances in text generation have made it possible to generate synthetic clinical notes that could be useful for training NER models for information extraction from natural clinical notes, thus lowering the privacy concern and increasing data availability. Further investigation is needed to apply this technology to practice.

Collapse

Du J, Xiang Y, Sankaranarayanapillai M, Zhang M, Wang J, Si Y, Pham HA, Xu H, Chen Y, Tao C. Extracting postmarketing adverse events from safety reports in the vaccine adverse event reporting system (VAERS) using deep learning. J Am Med Inform Assoc 2021;28:1393-1400. [PMID: 33647938 DOI: 10.1093/jamia/ocab014] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2020] [Revised: 01/14/2021] [Accepted: 01/20/2021] [Indexed: 11/14/2022] Open

Abstract

OBJECTIVE

Automated analysis of vaccine postmarketing surveillance narrative reports is important to understand the progression of rare but severe vaccine adverse events (AEs). This study implemented and evaluated state-of-the-art deep learning algorithms for named entity recognition to extract nervous system disorder-related events from vaccine safety reports.

MATERIALS AND METHODS

We collected Guillain-Barré syndrome (GBS) related influenza vaccine safety reports from the Vaccine Adverse Event Reporting System (VAERS) from 1990 to 2016. VAERS reports were selected and manually annotated with major entities related to nervous system disorders, including, investigation, nervous_AE, other_AE, procedure, social_circumstance, and temporal_expression. A variety of conventional machine learning and deep learning algorithms were then evaluated for the extraction of the above entities. We further pretrained domain-specific BERT (Bidirectional Encoder Representations from Transformers) using VAERS reports (VAERS BERT) and compared its performance with existing models.

RESULTS AND CONCLUSIONS

Ninety-one VAERS reports were annotated, resulting in 2512 entities. The corpus was made publicly available to promote community efforts on vaccine AEs identification. Deep learning-based methods (eg, bi-long short-term memory and BERT models) outperformed conventional machine learning-based methods (ie, conditional random fields with extensive features). The BioBERT large model achieved the highest exact match F-1 scores on nervous_AE, procedure, social_circumstance, and temporal_expression; while VAERS BERT large models achieved the highest exact match F-1 scores on investigation and other_AE. An ensemble of these 2 models achieved the highest exact match microaveraged F-1 score at 0.6802 and the second highest lenient match microaveraged F-1 score at 0.8078 among peer models.

Collapse

Mahendran D, Gurdin G, Lewinski N, Tang C, McInnes BT. Identifying Chemical Reactions and Their Associated Attributes in Patents. Front Res Metr Anal 2021;6:688353. [PMID: 34322654 PMCID: PMC8312343 DOI: 10.3389/frma.2021.688353] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2021] [Accepted: 05/31/2021] [Indexed: 11/13/2022] Open

Zhang Y, Zhang Y, Qi P, Manning CD, Langlotz CP. Biomedical and clinical English model packages for the Stanza Python NLP library. J Am Med Inform Assoc 2021;28:1892-1899. [PMID: 34157094 PMCID: PMC8363782 DOI: 10.1093/jamia/ocab090] [Citation(s) in RCA: 31] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2021] [Revised: 04/05/2021] [Accepted: 05/03/2021] [Indexed: 11/13/2022] Open

Jia Q, Zhang D, Xu H, Xie Y. Extraction of Traditional Chinese Medicine Entity: Design of a Novel Span-Level Named Entity Recognition Method With Distant Supervision. JMIR Med Inform 2021;9:e28219. [PMID: 34125076 PMCID: PMC8240806 DOI: 10.2196/28219] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 04/14/2021] [Accepted: 04/19/2021] [Indexed: 11/23/2022] Open