1
|
Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A. Hybrid natural language processing tool for semantic annotation of medical texts in Spanish. BMC Bioinformatics 2025; 26:7. [PMID: 39780059 PMCID: PMC11708069 DOI: 10.1186/s12859-024-05949-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2024] [Accepted: 09/30/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development. RESULTS In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019). CONCLUSIONS The tool is available at https://claramed.csic.es/medspaner . We also release the code ( https://github.com/lcampillos/medspaner ) and the annotated corpus to train the models.
Collapse
Affiliation(s)
| | - Ana Valverde-Mateos
- Medical Terminology Unit, Spanish Royal Academy of Medicine, C/Arrieta 12, 28013, Madrid, Spain
| | - Adrián Capllonch-Carrión
- Centro de Salud Retiro, Hospital Universitario Gregorio Marañon, C/Lope de Rueda, 43, 28009, Madrid, Spain
| |
Collapse
|
2
|
Yada S, Nakamura Y, Wakamiya S, Aramaki E. Cross-lingual Natural Language Processing on Limited Annotated Case/Radiology Reports in English and Japanese: Insights from the Real-MedNLP Workshop. Methods Inf Med 2024. [PMID: 39209296 DOI: 10.1055/a-2405-2489] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
BACKGROUND Textual datasets (corpora) are crucial for the application of natural language processing (NLP) models. However, corpus creation in the medical field is challenging, primarily because of privacy issues with raw clinical data such as health records. Thus, the existing clinical corpora are generally small and scarce. Medical NLP (MedNLP) methodologies perform well with limited data availability. OBJECTIVES We present the outcomes of the Real-MedNLP workshop, which was conducted using limited and parallel medical corpora. Real-MedNLP exhibits three distinct characteristics: (1) limited annotated documents: the training data comprise only a small set (∼100) of case reports (CRs) and radiology reports (RRs) that have been annotated. (2) Bilingually parallel: the constructed corpora are parallel in Japanese and English. (3) Practical tasks: the workshop addresses fundamental tasks, such as named entity recognition (NER) and applied practical tasks. METHODS We propose three tasks: NER of ∼100 available documents (Task 1), NER based only on annotation guidelines for humans (Task 2), and clinical applications (Task 3) consisting of adverse drug effect (ADE) detection for CRs and identical case identification (CI) for RRs. RESULTS Nine teams participated in this study. The best systems achieved 0.65 and 0.89 F1-scores for CRs and RRs in Task 1, whereas the top scores in Task 2 decreased by 50 to 70%. In Task 3, ADE reports were detected by up to 0.64 F1-score, and CI scored up to 0.96 binary accuracy. CONCLUSION Most systems adopt medical-domain-specific pretrained language models using data augmentation methods. Despite the challenge of limited corpus size in Tasks 1 and 2, recent approaches are promising because the partial match scores reached ∼0.8-0.9 F1-scores. Task 3 applications revealed that the different availabilities of external language resources affected the performance per language.
Collapse
Affiliation(s)
- Shuntaro Yada
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
| | - Yuta Nakamura
- 22nd Century Medical and Research Center, The University of Tokyo Hospital, Tokyo, Japan
| | - Shoko Wakamiya
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
| | - Eiji Aramaki
- Graduate School of Science and Technology, Nara Institute of Science and Technology, Nara, Japan
| |
Collapse
|
3
|
Zhu E, Sheng Q, Yang H, Liu Y, Cai T, Li J. A unified framework of medical information annotation and extraction for Chinese clinical text. Artif Intell Med 2023; 142:102573. [PMID: 37316096 DOI: 10.1016/j.artmed.2023.102573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 03/17/2023] [Accepted: 04/27/2023] [Indexed: 06/16/2023]
Abstract
Medical information extraction consists of a group of natural language processing (NLP) tasks, which collaboratively convert clinical text to pre-defined structured formats. This is a critical step to exploit electronic medical records (EMRs). Given the recent thriving NLP technologies, model implementation and performance seem no longer an obstacle, whereas the bottleneck locates on a high-quality annotated corpus and the whole engineering workflow. This study presents an engineering framework consisting of three tasks, i.e., medical entity recognition, relation extraction and attribute extraction. Within this framework, the whole workflow is demonstrated from EMR data collection through model performance evaluation. Our annotation scheme is designed to be comprehensive and compatible between the multiple tasks. With the EMRs from a general hospital in Ningbo, China, and the manual annotation by experienced physicians, our corpus is of large scale and high quality. Built upon this Chinese clinical corpus, the medical information extraction system show performance that approaches human annotation. The annotation scheme, (a subset of) the annotated corpus, and the code are all publicly released, to facilitate further research.
Collapse
Affiliation(s)
- Enwei Zhu
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| | - Qilin Sheng
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China.
| | - Huanwan Yang
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China.
| | - Yiyang Liu
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| | - Ting Cai
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| | - Jinpeng Li
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| |
Collapse
|
4
|
Chang H, Zan H, Zhang S, Zhao B, Zhang K. Construction of cardiovascular information extraction corpus based on electronic medical records. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:13379-13397. [PMID: 37501492 DOI: 10.3934/mbe.2023596] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Cardiovascular disease has a significant impact on both society and patients, making it necessary to conduct knowledge-based research such as research that utilizes knowledge graphs and automated question answering. However, the existing research on corpus construction for cardiovascular disease is relatively limited, which has hindered further knowledge-based research on this disease. Electronic medical records contain patient data that span the entire diagnosis and treatment process and include a large amount of reliable medical information. Therefore, we collected electronic medical record data related to cardiovascular disease, combined the data with relevant work experience and developed a standard for labeling cardiovascular electronic medical record entities and entity relations. By building a sentence-level labeling result dictionary through the use of a rule-based semi-automatic method, a cardiovascular electronic medical record entity and entity relationship labeling corpus (CVDEMRC) was constructed. The CVDEMRC contains 7691 entities and 11,185 entity relation triples, and the results of consistency examination were 93.51% and 84.02% for entities and entity-relationship annotations, respectively, demonstrating good consistency results. The CVDEMRC constructed in this study is expected to provide a database for information extraction research related to cardiovascular diseases.
Collapse
Affiliation(s)
- Hongyang Chang
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
| | - Hongying Zan
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Shuai Zhang
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
| | - Bingfei Zhao
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
| | - Kunli Zhang
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou, China
- Peng Cheng Laboratory, Shenzhen, China
| |
Collapse
|
5
|
Richter-Pechanski P, Wiesenbach P, Schwab DM, Kiriakou C, He M, Allers MM, Tiefenbacher AS, Kunz N, Martynova A, Spiller N, Mierisch J, Borchert F, Schwind C, Frey N, Dieterich C, Geis NA. A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters. Sci Data 2023; 10:207. [PMID: 37059736 PMCID: PMC10104831 DOI: 10.1038/s41597-023-02128-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 03/31/2023] [Indexed: 04/16/2023] Open
Abstract
We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. CARDIO:DE encompasses 500 clinical routine German doctor's letters from Heidelberg University Hospital, which were manually annotated. Our prospective study design complies well with current data protection regulations and allows us to keep the original structure of clinical documents consistent. In order to ease access to our corpus, we manually de-identified all letters. To enable various information extraction tasks the temporal information in the documents was preserved. We added two high-quality manual annotation layers to CARDIO:DE, (1) medication information and (2) CDA-compliant section classes. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German clinical corpus in the cardiovascular domain. In summary, our corpus offers unique opportunities for collaborative and reproducible research on natural language processing models for German clinical texts.
Collapse
Affiliation(s)
- Phillip Richter-Pechanski
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany.
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany.
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany.
- Informatics for Life, Heidelberg, DE, Germany.
| | - Philipp Wiesenbach
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Dominic M Schwab
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Christina Kiriakou
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Mingyang He
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Michael M Allers
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Anna S Tiefenbacher
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Nicola Kunz
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Anna Martynova
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Noemie Spiller
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Julian Mierisch
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Florian Borchert
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Potsdam, DE, Germany
| | - Charlotte Schwind
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Norbert Frey
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Christoph Dieterich
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Nicolas A Geis
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| |
Collapse
|
6
|
Chen JS, Lin WC, Yang S, Chiang MF, Hribar MR. Development of an Open-Source Annotated Glaucoma Medication Dataset From Clinical Notes in the Electronic Health Record. Transl Vis Sci Technol 2022; 11:20. [PMID: 36441131 PMCID: PMC9710490 DOI: 10.1167/tvst.11.11.20] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Accepted: 10/21/2022] [Indexed: 11/30/2022] Open
Abstract
Purpose To describe the methods involved in processing and characteristics of an open dataset of annotated clinical notes from the electronic health record (EHR) annotated for glaucoma medications. Methods In this study, 480 clinical notes from office visits, medical record numbers (MRNs), visit identification numbers, provider names, and billing codes were extracted for 480 patients seen for glaucoma by a comprehensive or glaucoma ophthalmologist from January 1, 2019, to August 31, 2020. MRNs and all visit data were de-identified using a hash function with salt from the deidentifyr package. All progress notes were annotated for glaucoma medication name, route, frequency, dosage, and drug use using an open-source annotation tool, Doccano. Annotations were saved separately. All protected health information (PHI) in progress notes and annotated files were de-identified using the published de-identifying algorithm Philter. All progress notes and annotations were manually validated by two ophthalmologists to ensure complete de-identification. Results The final dataset contained 5520 annotated sentences, including those with and without medications, for 480 clinical notes. Manual validation revealed 10 instances of remaining PHI which were manually corrected. Conclusions Annotated free-text clinical notes can be de-identified for upload as an open dataset. As data availability increases with the adoption of EHRs, free-text open datasets will become increasingly valuable for "big data" research and artificial intelligence development. This dataset is published online and publicly available at https://github.com/jche253/Glaucoma_Med_Dataset. Translational Relevance This open access medication dataset may be a source of raw data for future research involving big data and artificial intelligence research using free-text.
Collapse
Affiliation(s)
- Jimmy S. Chen
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR, USA
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, La Jolla, CA, USA
| | - Wei-Chun Lin
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| | - Sen Yang
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR, USA
| | - Michael F. Chiang
- National Eye Institute, National Institutes of Health, Bethesda, MD, USA
| | - Michelle R. Hribar
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR, USA
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| |
Collapse
|
7
|
Shinohara E, Shibata D, Kawazoe Y. Development of comprehensive annotation criteria for patients' states from clinical texts. J Biomed Inform 2022; 134:104200. [PMID: 36089198 DOI: 10.1016/j.jbi.2022.104200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2022] [Revised: 08/17/2022] [Accepted: 09/04/2022] [Indexed: 11/18/2022]
Abstract
In clinical records, much of the clinical information is recorded as free text, thus necessitating the use of advanced automatic information extraction technology. The development of practical technologies requires a corpus with finer granularity annotations that describe the information in the corpus, but such annotation criteria have not been researched enough thus far. This study aimed to develop fine grained annotation criteria that exhaustively cover patients' states in case reports. We collected 362 case reports-written in Japanese-of intractable diseases that were expected to contain a broad range of patients' states. Criteria were developed by repeatedly revising and annotating the clinical case report text. A set of annotation criteria for patients' states, consisting of 46 entity types, 9 attributes, and 36 relations, was obtained it allows more detailed information to be expressed than in previous studies by broader range of concept types including treatment, and captures clinical information based on a combination of causality and judgment, which could not be expressed before.
Collapse
Affiliation(s)
- Emiko Shinohara
- Artificial Intelligence in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan.
| | - Daisaku Shibata
- Artificial Intelligence in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| | - Yoshimasa Kawazoe
- Artificial Intelligence in Healthcare, Graduate School of Medicine, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
8
|
The Hmong Medical Corpus: a biomedical corpus for a minority language. LANG RESOUR EVAL 2022. [DOI: 10.1007/s10579-022-09596-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
Abstract
AbstractBiomedical communication is an area that increasingly benefits from natural language processing (NLP) work. Biomedical named entity recognition (NER) in particular provides a foundation for advanced NLP applications, such as automated medical question-answering and translation services. However, while a large body of biomedical documents are available in an array of languages, most work in biomedical NER remains in English, with the remainder in official national or regional languages. Minority languages so far remain an underexplored area. The Hmong language, a minority language with sizable populations in several countries and without official status anywhere, represents an exceptional challenge for effective communication in medical contexts. Taking advantage of the large number of government-produced medical information documents in Hmong, we have developed the first named entity-annotated biomedical corpus for a resource-poor minority language. The Hmong Medical Corpus contains 100,535 tokens with 4554 named entities (NEs) of three UMLS semantic types: diseases/syndromes, signs/symptoms, and body parts/organs/organ components. Furthermore, a subset of the corpus is annotated for word position and parts of speech, representing the first such gold-standard dataset publicly available for Hmong. The methodology presented provides a readily reproducible approach for the creation of biomedical NE-annotated corpora for other resource-poor languages.
Collapse
|
9
|
Oliveira LESE, Peters AC, da Silva AMP, Gebeluca CP, Gumiel YB, Cintho LMM, Carvalho DR, Al Hasan S, Moro CMC. SemClinBr - a multi-institutional and multi-specialty semantically annotated corpus for Portuguese clinical NLP tasks. J Biomed Semantics 2022; 13:13. [PMID: 35527259 PMCID: PMC9080187 DOI: 10.1186/s13326-022-00269-1] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 04/12/2022] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND The high volume of research focusing on extracting patient information from electronic health records (EHRs) has led to an increase in the demand for annotated corpora, which are a precious resource for both the development and evaluation of natural language processing (NLP) algorithms. The absence of a multipurpose clinical corpus outside the scope of the English language, especially in Brazilian Portuguese, is glaring and severely impacts scientific progress in the biomedical NLP field. METHODS In this study, a semantically annotated corpus was developed using clinical text from multiple medical specialties, document types, and institutions. In addition, we present, (1) a survey listing common aspects, differences, and lessons learned from previous research, (2) a fine-grained annotation schema that can be replicated to guide other annotation initiatives, (3) a web-based annotation tool focusing on an annotation suggestion feature, and (4) both intrinsic and extrinsic evaluation of the annotations. RESULTS This study resulted in SemClinBr, a corpus that has 1000 clinical notes, labeled with 65,117 entities and 11,263 relations. In addition, both negation cues and medical abbreviation dictionaries were generated from the annotations. The average annotator agreement score varied from 0.71 (applying strict match) to 0.92 (considering a relaxed match) while accepting partial overlaps and hierarchically related semantic types. The extrinsic evaluation, when applying the corpus to two downstream NLP tasks, demonstrated the reliability and usefulness of annotations, with the systems achieving results that were consistent with the agreement scores. CONCLUSION The SemClinBr corpus and other resources produced in this work can support clinical NLP studies, providing a common development and evaluation resource for the research community, boosting the utilization of EHRs in both clinical practice and biomedical research. To the best of our knowledge, SemClinBr is the first available Portuguese clinical corpus.
Collapse
Affiliation(s)
- Lucas Emanuel Silva e Oliveira
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| | - Ana Carolina Peters
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| | - Adalniza Moura Pucca da Silva
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| | - Caroline Pilatti Gebeluca
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| | - Yohan Bonescki Gumiel
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| | - Lilian Mie Mukai Cintho
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| | - Deborah Ribeiro Carvalho
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| | - Sadid Al Hasan
- AI Lab, Philips Research North America, Cambridge, MA USA
| | - Claudia Maria Cabral Moro
- Health Technology Program, Pontifical Catholic University of Paraná, Rua Imaculada Conceição, 1155 - Curitiba, Paraná, 80215-901 Brazil
| |
Collapse
|
10
|
Constructing novel datasets for intent detection and ner in a korean healthcare advice system: guidelines and empirical results. APPL INTELL 2022. [DOI: 10.1007/s10489-022-03400-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
11
|
Privacy-Preserving Mimic Models for clinical Named Entity Recognition in French. J Biomed Inform 2022; 130:104073. [DOI: 10.1016/j.jbi.2022.104073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 02/09/2022] [Accepted: 04/07/2022] [Indexed: 11/18/2022]
|
12
|
Giachelle F, Irrera O, Silvello G. MedTAG: a portable and customizable annotation tool for biomedical documents. BMC Med Inform Decis Mak 2021; 21:352. [PMID: 34922517 PMCID: PMC8684237 DOI: 10.1186/s12911-021-01706-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 12/01/2021] [Indexed: 01/08/2023] Open
Abstract
BACKGROUND Semantic annotators and Natural Language Processing (NLP) methods for Named Entity Recognition and Linking (NER+L) require plenty of training and test data, especially in the biomedical domain. Despite the abundance of unstructured biomedical data, the lack of richly annotated biomedical datasets poses hindrances to the further development of NER+L algorithms for any effective secondary use. In addition, manual annotation of biomedical documents performed by physicians and experts is a costly and time-consuming task. To support, organize and speed up the annotation process, we introduce MedTAG, a collaborative biomedical annotation tool that is open-source, platform-independent, and free to use/distribute. RESULTS We present the main features of MedTAG and how it has been employed in the histopathology domain by physicians and experts to annotate more than seven thousand clinical reports manually. We compare MedTAG with a set of well-established biomedical annotation tools, including BioQRator, ezTag, MyMiner, and tagtog, comparing their pros and cons with those of MedTag. We highlight that MedTAG is one of the very few open-source tools provided with an open license and a straightforward installation procedure supporting cross-platform use. CONCLUSIONS MedTAG has been designed according to five requirements (i.e. available, distributable, installable, workable and schematic) defined in a recent extensive review of manual annotation tools. Moreover, MedTAG satisfies 20 over 22 criteria specified in the same study.
Collapse
Affiliation(s)
- Fabio Giachelle
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Ornella Irrera
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
13
|
Clinical Concept Extraction with Lexical Semantics to Support Automatic Annotation. INTERNATIONAL JOURNAL OF ENVIRONMENTAL RESEARCH AND PUBLIC HEALTH 2021; 18:ijerph182010564. [PMID: 34682315 PMCID: PMC8535468 DOI: 10.3390/ijerph182010564] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 09/30/2021] [Accepted: 10/01/2021] [Indexed: 12/22/2022]
Abstract
Extracting clinical concepts, such as problems, diagnosis, and treatment, from unstructured clinical narrative documents enables data-driven approaches such as machine and deep learning to support advanced applications such as clinical decision-support systems, the assessment of disease progression, and the intelligent analysis of treatment efficacy. Various tools such as cTAKES, Sophia, MetaMap, and other rules-based approaches and algorithms have been used for automatic concept extraction. Recently, machine- and deep-learning approaches have been used to extract, classify, and accurately annotate terms and phrases. However, the requirement of an annotated dataset, which is labor-intensive, impedes the success of data-driven approaches. A rule-based mechanism could support the process of annotation, but existing rule-based approaches fail to adequately capture contextual, syntactic, and semantic patterns. This study intends to introduce a comprehensive rule-based system that automatically extracts clinical concepts from unstructured narratives with higher accuracy and transparency. The proposed system is a pipelined approach, capable of recognizing clinical concepts of three types, problem, treatment, and test, in the dataset collected from a published repository as a part of the I2b2 challenge 2010. The system’s performance is compared with that of three existing systems: Quick UMLS, BIO-CRF, and the Rules (i2b2) model. Compared to the baseline systems, the average F1-score of 72.94% was found to be 13% better than Quick UMLS, 3% better than BIO CRF, and 30.1% better than the Rules (i2b2) model. Individually, the system performance was noticeably higher for problem-related concepts, with an F1-score of 80.45%, followed by treatment-related concepts and test-related concepts, with F1-scores of 76.06% and 55.3%, respectively. The proposed methodology significantly improves the performance of concept extraction from unstructured clinical narratives by exploiting the linguistic and lexical semantic features. The approach can ease the automatic annotation process of clinical data, which ultimately improves the performance of supervised data-driven applications trained with these data.
Collapse
|
14
|
Park J, You SC, Jeong E, Weng C, Park D, Roh J, Lee DY, Cheong JY, Choi JW, Kang M, Park RW. A Framework (SOCRATex) for Hierarchical Annotation of Unstructured Electronic Health Records and Integration Into a Standardized Medical Database: Development and Usability Study. JMIR Med Inform 2021; 9:e23983. [PMID: 33783361 PMCID: PMC8044740 DOI: 10.2196/23983] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2020] [Revised: 11/14/2020] [Accepted: 01/23/2021] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Although electronic health records (EHRs) have been widely used in secondary assessments, clinical documents are relatively less utilized owing to the lack of standardized clinical text frameworks across different institutions. OBJECTIVE This study aimed to develop a framework for processing unstructured clinical documents of EHRs and integration with standardized structured data. METHODS We developed a framework known as Staged Optimization of Curation, Regularization, and Annotation of clinical text (SOCRATex). SOCRATex has the following four aspects: (1) extracting clinical notes for the target population and preprocessing the data, (2) defining the annotation schema with a hierarchical structure, (3) performing document-level hierarchical annotation using the annotation schema, and (4) indexing annotations for a search engine system. To test the usability of the proposed framework, proof-of-concept studies were performed on EHRs. We defined three distinctive patient groups and extracted their clinical documents (ie, pathology reports, radiology reports, and admission notes). The documents were annotated and integrated into the Observational Medical Outcomes Partnership (OMOP)-common data model (CDM) database. The annotations were used for creating Cox proportional hazard models with different settings of clinical analyses to measure (1) all-cause mortality, (2) thyroid cancer recurrence, and (3) 30-day hospital readmission. RESULTS Overall, 1055 clinical documents of 953 patients were extracted and annotated using the defined annotation schemas. The generated annotations were indexed into an unstructured textual data repository. Using the annotations of pathology reports, we identified that node metastasis and lymphovascular tumor invasion were associated with all-cause mortality among colon and rectum cancer patients (both P=.02). The other analyses involving measuring thyroid cancer recurrence using radiology reports and 30-day hospital readmission using admission notes in depressive disorder patients also showed results consistent with previous findings. CONCLUSIONS We propose a framework for hierarchical annotation of textual data and integration into a standardized OMOP-CDM medical database. The proof-of-concept studies demonstrated that our framework can effectively process and integrate diverse clinical documents with standardized structured data for clinical research.
Collapse
Affiliation(s)
- Jimyung Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
| | - Seng Chan You
- Department of Preventive Medicine and Public Health, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Eugene Jeong
- Department of Biomedical Informatics, Vanderbilt University School of Medicine, Nashville, TN, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY, United States
| | - Dongsu Park
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Jin Roh
- Department of Pathology, Ajou University Hospital, Suwon, Republic of Korea
| | - Dong Yun Lee
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Jae Youn Cheong
- Department of Gastroenterology, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Jin Wook Choi
- Department of Radiology, Ajou University School of Medicine, Suwon, Republic of Korea
| | - Mira Kang
- Department of Digital Health, Samsung Advanced Institute for Health Sciences & Technology, Sungkyunkwan University, Seoul, Republic of Korea
| | - Rae Woong Park
- Department of Biomedical Sciences, Ajou University Graduate School of Medicine, Suwon, Republic of Korea
- Department of Biomedical Informatics, Ajou University School of Medicine, Suwon, Republic of Korea
| |
Collapse
|
15
|
Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A, Moreno-Sandoval A. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med Inform Decis Mak 2021; 21:69. [PMID: 33618727 PMCID: PMC7898014 DOI: 10.1186/s12911-021-01395-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2020] [Accepted: 01/12/2021] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND The large volume of medical literature makes it difficult for healthcare professionals to keep abreast of the latest studies that support Evidence-Based Medicine. Natural language processing enhances the access to relevant information, and gold standard corpora are required to improve systems. To contribute with a new dataset for this domain, we collected the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus. METHODS We annotated 1200 texts about clinical trials with entities from the Unified Medical Language System semantic groups: anatomy (ANAT), pharmacological and chemical substances (CHEM), pathologies (DISO), and lab tests, diagnostic or therapeutic procedures (PROC). We doubly annotated 10% of the corpus and measured inter-annotator agreement (IAA) using F-measure. As use case, we run medical entity recognition experiments with neural network models. RESULTS This resource contains 500 abstracts of journal articles about clinical trials and 700 announcements of trial protocols (292 173 tokens). We annotated 46 699 entities (13.98% are nested entities). Regarding IAA agreement, we obtained an average F-measure of 85.65% (±4.79, strict match) and 93.94% (±3.31, relaxed match). In the use case experiments, we achieved recognition results ranging from 80.28% (±00.99) to 86.74% (±00.19) of average F-measure. CONCLUSIONS Our results show that this resource is adequate for experiments with state-of-the-art approaches to biomedical named entity recognition. It is freely distributed at: http://www.lllf.uam.es/ESP/nlpmedterm_en.html . The methods are generalizable to other languages with similar available sources.
Collapse
Affiliation(s)
- Leonardo Campillos-Llanos
- Computational Linguistics Laboratory, Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente 1. Cantoblanco Campus, 28049 Madrid, Spain
| | - Ana Valverde-Mateos
- Medical Terminology Unit, Spanish Royal Academy of Medicine., C/Arrieta 12, 28013 Madrid, Spain
| | | | - Antonio Moreno-Sandoval
- Computational Linguistics Laboratory, Universidad Autónoma de Madrid, C/Francisco Tomás y Valiente 1. Cantoblanco Campus, 28049 Madrid, Spain
| |
Collapse
|
16
|
Terminologies augmented recurrent neural network model for clinical named entity recognition. J Biomed Inform 2020; 102:103356. [DOI: 10.1016/j.jbi.2019.103356] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2019] [Revised: 11/14/2019] [Accepted: 12/10/2019] [Indexed: 11/19/2022]
|
17
|
Deléger L, Campillos L, Ligozat AL, Névéol A. Design of an extensive information representation scheme for clinical narratives. J Biomed Semantics 2017; 8:37. [PMID: 28893314 PMCID: PMC5594525 DOI: 10.1186/s13326-017-0135-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2016] [Accepted: 07/26/2017] [Indexed: 12/26/2022] Open
Abstract
BACKGROUND Knowledge representation frameworks are essential to the understanding of complex biomedical processes, and to the analysis of biomedical texts that describe them. Combined with natural language processing (NLP), they have the potential to contribute to retrospective studies by unlocking important phenotyping information contained in the narrative content of electronic health records (EHRs). This work aims to develop an extensive information representation scheme for clinical information contained in EHR narratives, and to support secondary use of EHR narrative data to answer clinical questions. METHODS We review recent work that proposed information representation schemes and applied them to the analysis of clinical narratives. We then propose a unifying scheme that supports the extraction of information to address a large variety of clinical questions. RESULTS We devised a new information representation scheme for clinical narratives that comprises 13 entities, 11 attributes and 37 relations. The associated annotation guidelines can be used to consistently apply the scheme to clinical narratives and are https://cabernet.limsi.fr/annotation_guide_for_the_merlot_french_clinical_corpus-Sept2016.pdf . CONCLUSION The information scheme includes many elements of the major schemes described in the clinical natural language processing literature, as well as a uniquely detailed set of relations.
Collapse
Affiliation(s)
- Louise Deléger
- French National Institute for Agricultural Research (INRA), Domaine de Vilvert, Jouy en Josas, Paris, 78352, France.,LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France
| | - Leonardo Campillos
- LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France
| | - Anne-Laure Ligozat
- LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France.,ENSIIE, 1 square de la résistance, Évry Cedex, 91025, France
| | - Aurélie Névéol
- LIMSI, CNRS, Université Paris - Saclay, Rue John von Neumann, Orsay, 91405, France.
| |
Collapse
|