1
|
Garda S, Weber-Genzel L, Martin R, Leser U. BELB: a biomedical entity linking benchmark. Bioinformatics 2023; 39:btad698. [PMID: 37975879 PMCID: PMC10681865 DOI: 10.1093/bioinformatics/btad698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2023] [Revised: 10/30/2023] [Accepted: 11/16/2023] [Indexed: 11/19/2023] Open
Abstract
MOTIVATION Biomedical entity linking (BEL) is the task of grounding entity mentions to a knowledge base (KB). It plays a vital role in information extraction pipelines for the life sciences literature. We review recent work in the field and find that, as the task is absent from existing benchmarks for biomedical text mining, different studies adopt different experimental setups making comparisons based on published numbers problematic. Furthermore, neural systems are tested primarily on instances linked to the broad coverage KB UMLS, leaving their performance to more specialized ones, e.g. genes or variants, understudied. RESULTS We therefore developed BELB, a biomedical entity linking benchmark, providing access in a unified format to 11 corpora linked to 7 KBs and spanning six entity types: gene, disease, chemical, species, cell line, and variant. BELB greatly reduces preprocessing overhead in testing BEL systems on multiple corpora offering a standardized testbed for reproducible experiments. Using BELB, we perform an extensive evaluation of six rule-based entity-specific systems and three recent neural approaches leveraging pre-trained language models. Our results reveal a mixed picture showing that neural approaches fail to perform consistently across entity types, highlighting the need of further studies towards entity-agnostic models. AVAILABILITY AND IMPLEMENTATION The source code of BELB is available at: https://github.com/sg-wbi/belb. The code to reproduce our experiments can be found at: https://github.com/sg-wbi/belb-exp.
Collapse
Affiliation(s)
- Samuele Garda
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Leon Weber-Genzel
- Center for Information and Language Processing, Ludwig-Maximilians-Universität München, München 80539, Germany
| | - Robert Martin
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| | - Ulf Leser
- Computer Science Department, Humboldt-Universität zu Berlin, Berlin 10099, Germany
| |
Collapse
|
2
|
Frei J, Frei-Stuber L, Kramer F. GERNERMED++: Semantic annotation in German medical NLP through transfer-learning, translation and word alignment. J Biomed Inform 2023; 147:104513. [PMID: 37838290 DOI: 10.1016/j.jbi.2023.104513] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Revised: 09/27/2023] [Accepted: 10/04/2023] [Indexed: 10/16/2023]
Abstract
We present a statistical model, GERNERMED++, for German medical natural language processing trained for named entity recognition (NER) as an open, publicly available model. We demonstrate the effectiveness of combining multiple techniques in order to achieve strong results in entity recognition performance by the means of transfer-learning on pre-trained deep language models (LM), word-alignment and neural machine translation, outperforming a pre-existing baseline model on several datasets. Due to the sparse situation of open, public medical entity recognition models for German texts, this work offers benefits to the German research community on medical NLP as a baseline model. The work serves as a refined successor to our first GERNERMED model. Similar to our previous work, our trained model is publicly available to other researchers. The sample code and the statistical model is available at: https://github.com/frankkramer-lab/GERNERMED-pp.
Collapse
Affiliation(s)
- Johann Frei
- IT-Infrastructure for Translational Medical Research, University of Augsburg, Alter Postweg 101, 86159 Augsburg, Germany.
| | - Ludwig Frei-Stuber
- Institute and Outpatient Clinic for Occupational, Social and Environmental Medicine, 80336 Munich, Germany.
| | - Frank Kramer
- IT-Infrastructure for Translational Medical Research, University of Augsburg, Alter Postweg 101, 86159 Augsburg, Germany.
| |
Collapse
|
3
|
Frei J, Kramer F. Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform 2023; 145:104478. [PMID: 37625508 DOI: 10.1016/j.jbi.2023.104478] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 08/01/2023] [Accepted: 08/21/2023] [Indexed: 08/27/2023]
Abstract
Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.
Collapse
Affiliation(s)
- Johann Frei
- IT-Infrastructure for Translational Medical Research, University of Augsburg Alter Postweg 101, 86159 Augsburg, Germany.
| | - Frank Kramer
- IT-Infrastructure for Translational Medical Research, University of Augsburg Alter Postweg 101, 86159 Augsburg, Germany.
| |
Collapse
|
4
|
Solarte-Pabón O, Montenegro O, García-Barragán A, Torrente M, Provencio M, Menasalvas E, Robles V. Transformers for extracting breast cancer information from Spanish clinical narratives. Artif Intell Med 2023; 143:102625. [PMID: 37673566 DOI: 10.1016/j.artmed.2023.102625] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2022] [Revised: 05/11/2023] [Accepted: 07/08/2023] [Indexed: 09/08/2023]
Abstract
The wide adoption of electronic health records (EHRs) offers immense potential as a source of support for clinical research. However, previous studies focused on extracting only a limited set of medical concepts to support information extraction in the cancer domain for the Spanish language. Building on the success of deep learning for processing natural language texts, this paper proposes a transformer-based approach to extract named entities from breast cancer clinical notes written in Spanish and compares several language models. To facilitate this approach, a schema for annotating clinical notes with breast cancer concepts is presented, and a corpus for breast cancer is developed. Results indicate that both BERT-based and RoBERTa-based language models demonstrate competitive performance in clinical Named Entity Recognition (NER). Specifically, BETO and multilingual BERT achieve F-scores of 93.71% and 94.63%, respectively. Additionally, RoBERTa Biomedical attains an F-score of 95.01%, while RoBERTa BNE achieves an F-score of 94.54%. The findings suggest that transformers can feasibly extract information in the clinical domain in the Spanish language, with the use of models trained on biomedical texts contributing to enhanced results. The proposed approach takes advantage of transfer learning techniques by fine-tuning language models to automatically represent text features and avoiding the time-consuming feature engineering process.
Collapse
Affiliation(s)
- Oswaldo Solarte-Pabón
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain; Escuela de Ingeniería de Sistemas, Universidad del Valle, Cali, Colombia.
| | - Orlando Montenegro
- Escuela de Ingeniería de Sistemas, Universidad del Valle, Cali, Colombia
| | | | - Maria Torrente
- Hospital Universitario Puerta de Hierro de Madrid, Madrid, Spain
| | | | - Ernestina Menasalvas
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain
| | - Víctor Robles
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Madrid, Spain
| |
Collapse
|
5
|
Shaitarova A, Zaghir J, Lavelli A, Krauthammer M, Rinaldi F. Exploring the Latest Highlights in Medical Natural Language Processing across Multiple Languages: A Survey. Yearb Med Inform 2023; 32:230-243. [PMID: 38147865 PMCID: PMC10751112 DOI: 10.1055/s-0043-1768726] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023] Open
Abstract
OBJECTIVES This survey aims to provide an overview of the current state of biomedical and clinical Natural Language Processing (NLP) research and practice in Languages other than English (LoE). We pay special attention to data resources, language models, and popular NLP downstream tasks. METHODS We explore the literature on clinical and biomedical NLP from the years 2020-2022, focusing on the challenges of multilinguality and LoE. We query online databases and manually select relevant publications. We also use recent NLP review papers to identify the possible information lacunae. RESULTS Our work confirms the recent trend towards the use of transformer-based language models for a variety of NLP tasks in medical domains. In addition, there has been an increase in the availability of annotated datasets for clinical NLP in LoE, particularly in European languages such as Spanish, German and French. Common NLP tasks addressed in medical NLP research in LoE include information extraction, named entity recognition, normalization, linking, and negation detection. However, there is still a need for the development of annotated datasets and models specifically tailored to the unique characteristics and challenges of medical text in some of these languages, especially low-resources ones. Lastly, this survey highlights the progress of medical NLP in LoE, and helps at identifying opportunities for future research and development in this field.
Collapse
Affiliation(s)
| | - Jamil Zaghir
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland
- Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland
| | - Alberto Lavelli
- Natural Language Processing Research Unit, Center for Digital Health and Wellbeing, Fondazione Bruno Kessler, Trento, Italy
| | - Michael Krauthammer
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
- Biomedical Informatics, University Hospital Zurich, Zurich, Switzerland
| | - Fabio Rinaldi
- Natural Language Processing Research Unit, Center for Digital Health and Wellbeing, Fondazione Bruno Kessler, Trento, Italy
- Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland
- Dalle Molle Institute for Artificial Intelligence Research, Lugano, Switzerland
- Swiss Institute of Bioinformatics
| |
Collapse
|
6
|
Richter-Pechanski P, Wiesenbach P, Schwab DM, Kiriakou C, He M, Allers MM, Tiefenbacher AS, Kunz N, Martynova A, Spiller N, Mierisch J, Borchert F, Schwind C, Frey N, Dieterich C, Geis NA. A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters. Sci Data 2023; 10:207. [PMID: 37059736 PMCID: PMC10104831 DOI: 10.1038/s41597-023-02128-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 03/31/2023] [Indexed: 04/16/2023] Open
Abstract
We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. CARDIO:DE encompasses 500 clinical routine German doctor's letters from Heidelberg University Hospital, which were manually annotated. Our prospective study design complies well with current data protection regulations and allows us to keep the original structure of clinical documents consistent. In order to ease access to our corpus, we manually de-identified all letters. To enable various information extraction tasks the temporal information in the documents was preserved. We added two high-quality manual annotation layers to CARDIO:DE, (1) medication information and (2) CDA-compliant section classes. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German clinical corpus in the cardiovascular domain. In summary, our corpus offers unique opportunities for collaborative and reproducible research on natural language processing models for German clinical texts.
Collapse
Affiliation(s)
- Phillip Richter-Pechanski
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany.
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany.
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany.
- Informatics for Life, Heidelberg, DE, Germany.
| | - Philipp Wiesenbach
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Dominic M Schwab
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Christina Kiriakou
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Mingyang He
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Michael M Allers
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Anna S Tiefenbacher
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Nicola Kunz
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Anna Martynova
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Noemie Spiller
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Julian Mierisch
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Florian Borchert
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Potsdam, DE, Germany
| | - Charlotte Schwind
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Norbert Frey
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Christoph Dieterich
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Nicolas A Geis
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| |
Collapse
|
7
|
Kreuzthaler M, Brochhausen M, Zayas C, Blobel B, Schulz S. Linguistic and ontological challenges of multiple domains contributing to transformed health ecosystems. Front Med (Lausanne) 2023; 10:1073313. [PMID: 37007792 PMCID: PMC10050682 DOI: 10.3389/fmed.2023.1073313] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2022] [Accepted: 02/13/2023] [Indexed: 03/17/2023] Open
Abstract
This paper provides an overview of current linguistic and ontological challenges which have to be met in order to provide full support to the transformation of health ecosystems in order to meet precision medicine (5 PM) standards. It highlights both standardization and interoperability aspects regarding formal, controlled representations of clinical and research data, requirements for smart support to produce and encode content in a way that humans and machines can understand and process it. Starting from the current text-centered communication practices in healthcare and biomedical research, it addresses the state of the art in information extraction using natural language processing (NLP). An important aspect of the language-centered perspective of managing health data is the integration of heterogeneous data sources, employing different natural languages and different terminologies. This is where biomedical ontologies, in the sense of formal, interchangeable representations of types of domain entities come into play. The paper discusses the state of the art of biomedical ontologies, addresses their importance for standardization and interoperability and sheds light to current misconceptions and shortcomings. Finally, the paper points out next steps and possible synergies of both the field of NLP and the area of Applied Ontology and Semantic Web to foster data interoperability for 5 PM.
Collapse
Affiliation(s)
- Markus Kreuzthaler
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| | - Mathias Brochhausen
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Cilia Zayas
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Bernd Blobel
- Medical Faculty, University of Regensburg, Regensburg, Germany
- eHealth Competence Center Bavaria, Deggendorf Institute of Technology, Deggendorf, Germany
- First Medical Faculty, Charles University Prague, Prague, Czechia
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
- Averbis GmbH, Freiburg, Germany
- *Correspondence: Stefan Schulz,
| |
Collapse
|
8
|
French E, McInnes BT. An overview of biomedical entity linking throughout the years. J Biomed Inform 2023; 137:104252. [PMID: 36464228 PMCID: PMC9845184 DOI: 10.1016/j.jbi.2022.104252] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Revised: 09/19/2022] [Accepted: 11/15/2022] [Indexed: 12/04/2022]
Abstract
Biomedical Entity Linking (BEL) is the task of mapping of spans of text within biomedical documents to normalized, unique identifiers within an ontology. This is an important task in natural language processing for both translational information extraction applications and providing context for downstream tasks like relationship extraction. In this paper, we will survey the progression of BEL from its inception in the late 80s to present day state of the art systems, provide a comprehensive list of datasets available for training BEL systems, reference shared tasks focused on BEL, discuss the technical components that comprise BEL systems, and discuss possible directions for the future of the field.
Collapse
Affiliation(s)
- Evan French
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| | - Bridget T McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| |
Collapse
|
9
|
Lentzen M, Madan S, Lage-Rupprecht V, Kühnel L, Fluck J, Jacobs M, Mittermaier M, Witzenrath M, Brunecker P, Hofmann-Apitius M, Weber J, Fröhlich H. Critical assessment of transformer-based AI models for German clinical notes. JAMIA Open 2022; 5:ooac087. [PMID: 36380848 PMCID: PMC9663939 DOI: 10.1093/jamiaopen/ooac087] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 10/02/2022] [Accepted: 10/25/2022] [Indexed: 11/17/2022] Open
Abstract
Objective Healthcare data such as clinical notes are primarily recorded in an unstructured manner. If adequately translated into structured data, they can be utilized for health economics and set the groundwork for better individualized patient care. To structure clinical notes, deep-learning methods, particularly transformer-based models like Bidirectional Encoder Representations from Transformers (BERT), have recently received much attention. Currently, biomedical applications are primarily focused on the English language. While general-purpose German-language models such as GermanBERT and GottBERT have been published, adaptations for biomedical data are unavailable. This study evaluated the suitability of existing and novel transformer-based models for the German biomedical and clinical domain. Materials and Methods We used 8 transformer-based models and pre-trained 3 new models on a newly generated biomedical corpus, and systematically compared them with each other. We annotated a new dataset of clinical notes and used it with 4 other corpora (BRONCO150, CLEF eHealth 2019 Task 1, GGPONC, and JSynCC) to perform named entity recognition (NER) and document classification tasks. Results General-purpose language models can be used effectively for biomedical and clinical natural language processing (NLP) tasks, still, our newly trained BioGottBERT model outperformed GottBERT on both clinical NER tasks. However, training new biomedical models from scratch proved ineffective. Discussion The domain-adaptation strategy’s potential is currently limited due to a lack of pre-training data. Since general-purpose language models are only marginally inferior to domain-specific models, both options are suitable for developing German-language biomedical applications. Conclusion General-purpose language models perform remarkably well on biomedical and clinical NLP tasks. If larger corpora become available in the future, domain-adapting these models may improve performances.
Collapse
Affiliation(s)
- Manuel Lentzen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany,Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany
| | - Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany,Institute of Computer Science, University of Bonn, Bonn, Germany
| | - Vanessa Lage-Rupprecht
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany
| | - Lisa Kühnel
- Knowledge Management, ZB MED – Information Centre for Life Sciences, Cologne, Germany,Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Juliane Fluck
- Knowledge Management, ZB MED – Information Centre for Life Sciences, Cologne, Germany,The Agricultural Faculty, University of Bonn, Bonn, Germany
| | - Marc Jacobs
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany
| | - Mirja Mittermaier
- Department of Infectious Diseases and Respiratory Medicine, Charité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany,Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Berlin, Germany
| | - Martin Witzenrath
- Department of Infectious Diseases and Respiratory Medicine, Charité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin, Humboldt-Universität zu Berlin, Berlin, Germany,German Center for Lung Research (DZL), Partner Site Charité, Berlin, Germany
| | - Peter Brunecker
- Berlin Institute of Health at Charité – Universitätsmedizin Berlin, Core Facility Research IT, Berlin, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, Germany,Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, Germany
| | - Joachim Weber
- Berlin Institute of Health (BIH) at Charité – Universitätsmedizin Berlin, Berlin, Germany,Charité – Universitätsmedizin Berlin, Center for Stroke Research Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany,Department of Neurology, Charité – Universitätsmedizin Berlin, Corporate Member of Freie Universität Berlin and Humboldt-Universität zu Berlin, Berlin, Germany
| | - Holger Fröhlich
- Corresponding Author: Prof. Dr. Holger Fröhlich, Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany;
| |
Collapse
|
10
|
Richter-Pechanski P, Geis NA, Kiriakou C, Schwab DM, Dieterich C. Automatic extraction of 12 cardiovascular concepts from German discharge letters using pre-trained language models. Digit Health 2021; 7:20552076211057662. [PMID: 34868618 PMCID: PMC8637713 DOI: 10.1177/20552076211057662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 10/15/2021] [Indexed: 11/17/2022] Open
Abstract
Objective A vast amount of medical data is still stored in unstructured text documents.
We present an automated method of information extraction from German
unstructured clinical routine data from the cardiology domain enabling their
usage in state-of-the-art data-driven deep learning projects. Methods We evaluated pre-trained language models to extract a set of 12
cardiovascular concepts in German discharge letters. We compared three
bidirectional encoder representations from transformers pre-trained on
different corpora and fine-tuned them on the task of cardiovascular concept
extraction using 204 discharge letters manually annotated by cardiologists
at the University Hospital Heidelberg. We compared our results with
traditional machine learning methods based on a long short-term memory
network and a conditional random field. Results Our best performing model, based on publicly available German pre-trained
bidirectional encoder representations from the transformer model, achieved a
token-wise micro-average F1-score of 86% and outperformed the baseline by at
least 6%. Moreover, this approach achieved the best trade-off between
precision (positive predictive value) and recall (sensitivity). Conclusion Our results show the applicability of state-of-the-art deep learning methods
using pre-trained language models for the task of cardiovascular concept
extraction using limited training data. This minimizes annotation efforts,
which are currently the bottleneck of any application of data-driven deep
learning projects in the clinical domain for German and many other European
languages.
Collapse
Affiliation(s)
- Phillip Richter-Pechanski
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany.,Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany.,German Center for Cardiovascular Research (DZHK) - Partner Site Heidelberg/Mannheim, Mannheim, Germany.,Informatics for Life, Heidelberg, Germany
| | - Nicolas A Geis
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany.,Informatics for Life, Heidelberg, Germany
| | - Christina Kiriakou
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany
| | - Dominic M Schwab
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany
| | - Christoph Dieterich
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, Germany.,Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, Germany.,German Center for Cardiovascular Research (DZHK) - Partner Site Heidelberg/Mannheim, Mannheim, Germany.,Informatics for Life, Heidelberg, Germany
| |
Collapse
|