1
|
Hahn U. Clinical document corpora-real ones, translated and synthetic substitutes, and assorted domain proxies: a survey of diversity in corpus design, with focus on German text data. JAMIA Open 2025; 8:ooaf024. [PMID: 40371384 PMCID: PMC12077144 DOI: 10.1093/jamiaopen/ooaf024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Revised: 03/04/2025] [Accepted: 03/24/2025] [Indexed: 05/16/2025] Open
Abstract
Objective We survey clinical document corpora, with a focus on German textual data. Due to rigid data privacy legislation in Germany, these resources, with only few exceptions, are stored in protected clinical data spaces and locked against clinic-external researchers. This situation stands in stark contrast with established workflows in the field of natural language processing, where easy accessibility and reuse of (textual) data collections are common practice. Hence, alternative corpus designs have been examined to escape from data poverty. Besides machine translation of English clinical datasets and the generation of synthetic corpora with fictitious clinical contents, several types of domain proxies have come up as substitutes for real clinical documents. Common instances of close proxies are medical journal publications, therapy guidelines, drug labels, etc., more distant proxies include medical contents from social media channels or online encyclopedic medical articles. Methods We follow the PRISM (Preferred Reporting Items for Systematic reviews and Meta-analyses) guidelines for surveying the field of German-language clinical/medical corpora. Four bibliographic databases were searched: PubMed, ACL Anthology, Google Scholar, and the author's personal literature database. Results After PRISM-conformant identification of 362 hits from the 4 bibliographic systems, the screening process yielded 78 relevant documents for inclusion in this review. They contained overall 92 different published versions of corpora, from which 71 were truly unique in terms of their underlying document sets. Out of these, the majority were clinical corpora-46 real ones from which 32 were unique, 5 translated ones (3 unique), and 6 synthetic ones (3 unique). As to domain proxies, we identified 18 close ones (16 unique) and 17 distant ones (all of them unique). Discussion There is a clear divide between the large number of non-accessible real clinical German-language corpora and their publicly accessible substitutes: translated or synthetic datasets, close or more distant proxies. So, at first sight, the data bottleneck seems broken. Intuitively, yet, differences in genre-specific writing style, lexical or terminological diction, and required medical background expertise in this typological space are also obvious. This raises the question how valid alternative corpus designs really are. A systematic, empirically grounded yardstick for comparing real clinical corpora with those suggested substitutes and proxies is missing until now. Conclusion The extreme sparsity of real clinical corpora in almost all non-Anglo-American countries worldwide, Germany in particular, has triggered an active search for alternative, publicly accessible data resources laid out in this survey. However, the utility of these substitutes compared with real clinical corpora and their semantic and genre-specific distance to real clinical corpora is still under-researched so that their value remains to be assessed properly. Furthermore, corpus descriptions are often incomplete with respect to relevant descriptional attributes. This paper bundles these observations and proposes a template for a so-called corpus card to improve adequate corpus documentation.
Collapse
Affiliation(s)
- Udo Hahn
- Institute for Medical Informatics, Statistics and Epidemiology (IMISE), University of Leipzig, D-04107 Leipzig, Saxony, Germany
| |
Collapse
|
2
|
Borchert F, Llorca I, Roller R, Arnrich B, Schapranow MP. xMEN: a modular toolkit for cross-lingual medical entity normalization. JAMIA Open 2025; 8:ooae147. [PMID: 39735785 PMCID: PMC11671143 DOI: 10.1093/jamiaopen/ooae147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2024] [Revised: 12/03/2024] [Accepted: 12/12/2024] [Indexed: 12/31/2024] Open
Abstract
Objective To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language. Results xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task. Discussion We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future. Conclusion xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.
Collapse
Affiliation(s)
- Florian Borchert
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| | - Ignacio Llorca
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| | - Roland Roller
- Speech and Language Technology Lab, German Research Center for Artificial Intelligence (DFKI), Berlin 10559, Germany
| | - Bert Arnrich
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| | - Matthieu-P Schapranow
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| |
Collapse
|
3
|
Specht L, Scheible R, Boeker M, Farin-Glattacker E, Kampel N, Schmölz M, Schöpf-Lazzarino A, Schulz S, Schlett C, Thomczyk F, Voigt-Radloff S, Wegner C, Wollmann K, Maun A. Evaluating the Acceptance and Usability of an Independent, Noncommercial Search Engine for Medical Information: Cross-Sectional Questionnaire Study and User Behavior Tracking Analysis. JMIR Hum Factors 2025; 12:e56941. [PMID: 39847765 PMCID: PMC11803324 DOI: 10.2196/56941] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 07/12/2024] [Accepted: 10/23/2024] [Indexed: 01/25/2025] Open
Abstract
BACKGROUND The internet is a key source of health information, but the quality of content from popular search engines varies, posing challenges for users-especially those with low health or digital health literacy. To address this, the "tala-med" search engine was developed in 2020 to provide access to high-quality, evidence-based content. It prioritizes German health websites based on trustworthiness, recency, user-friendliness, and comprehensibility, offering category-based filters while ensuring privacy by avoiding data collection and advertisements. OBJECTIVE This study aims to evaluate the acceptance and usability of this independent, noncommercial search engine from the users' perspectives and their actual use of the search engine. METHODS For the questionnaire study, a cross-sectional study design was used. In total, 802 participants were recruited through a web-based panel and were asked to interact with the new search engine before completing a web-based questionnaire. Descriptive statistics and multiple regression analyses were used to assess participants' acceptance and usability ratings, as well as predictors of acceptance. Furthermore, from October 2020 to June 2021, we used the open-source web analytics platform Matomo to collect behavior-tracking data from consenting users of the search engine. RESULTS The study indicated positive findings on the acceptance and usability of the search engine, with more than half of the participants willing to reuse (465/802, 58%) and recommend it (507/802, 63.2%). Of the 802 users, 747 (93.1%) valued the absence of advertising. Furthermore, 92.3% (518/561), 93.9% (553/589), 94.7% (567/599), and 96.5% (600/622) of those users who used the filters agreed at least partially that the filter functions were helpful in finding trustworthy, recent, user-friendly, or comprehensible results. Participants criticized some of the search results regarding the selection of domains and shared ideas for potential improvements (eg, for a clearer design). Regression analyses showed that the search engine was especially well accepted among older users, frequent internet users, and those with lower educational levels, indicating an effective targeting of segments of the population with lower health literacy and digital health literacy. Tracking data analysis revealed 1631 sessions, comprising 3090 searches across 1984 unique terms. Users performed 1.64 (SD 1.31) searches per visit on average. They prioritized the search terms "corona," "back pain," and "cough." Filter changes were common, especially for recency and trustworthiness, reflecting the importance that users placed on these criteria. CONCLUSIONS User questionnaires and behavior tracking showed the platform was well received, particularly by older and less educated users, especially for its advertisement-free design and filtering system. While feedback highlighted areas for improvement in design and filter functionality, the search engine's focus on transparency, evidence-based content, and user privacy shows promise in addressing health literacy and navigational needs. Future updates and research will further refine its effectiveness and impact on promoting access to quality health information.
Collapse
Affiliation(s)
- Lisa Specht
- Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Raphael Scheible
- Institute of Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, University Hospital rechts der Isar, School of Medicine and Health, Technical University of Munich, Munich, Germany
| | - Martin Boeker
- Institute of Artificial Intelligence and Informatics in Medicine, Chair of Medical Informatics, University Hospital rechts der Isar, School of Medicine and Health, Technical University of Munich, Munich, Germany
| | - Erik Farin-Glattacker
- Institute of Medical Biometry and Statistics, Section of Health Care Research and Rehabilitation Research, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Nikolas Kampel
- Institute of Neuroscience and Medicine, Jülich Aachen Research Alliance, Forschungszentrum Jülich GmbH, Jülich, Germany
| | - Marina Schmölz
- Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Andrea Schöpf-Lazzarino
- Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
- Careum School of Health, part of the Kalaidos University of Applied Sciences, Zurich, Switzerland
| | - Stefan Schulz
- Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria
| | - Christian Schlett
- Institute of Medical Biometry and Statistics, Section of Health Care Research and Rehabilitation Research, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Fabian Thomczyk
- Data Integration Center, University of Freiburg, Freiburg, Germany
| | - Sebastian Voigt-Radloff
- Institute of Medical Biometry and Statistics, Section of Health Care Research and Rehabilitation Research, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| | - Constanze Wegner
- Cochrane Germany, Cochrane Germany Foundation, Freiburg, Germany
| | | | - Andy Maun
- Institute of General Practice, Faculty of Medicine and Medical Center, University of Freiburg, Freiburg, Germany
| |
Collapse
|
4
|
Diaz Ochoa JG, Mustafa FE, Weil F, Wang Y, Kama K, Knott M. The aluminum standard: using generative Artificial Intelligence tools to synthesize and annotate non-structured patient data. BMC Med Inform Decis Mak 2024; 24:409. [PMID: 39732668 DOI: 10.1186/s12911-024-02825-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2023] [Accepted: 12/16/2024] [Indexed: 12/30/2024] Open
Abstract
BACKGROUND Medical narratives are fundamental to the correct identification of a patient's health condition. This is not only because it describes the patient's situation. It also contains relevant information about the patient's context and health state evolution. Narratives are usually vague and cannot be categorized easily. On the other hand, once the patient's situation is correctly identified based on a narrative, it is then possible to map the patient's situation into precise classification schemas and ontologies that are machine-readable. To this end, language models can be trained to read and extract elements from these narratives. However, the main problem is the lack of data for model identification and model training in languages other than English. First, gold standard annotations are usually not available due to the high level of data protection for patient data. Second, gold standard annotations (if available) are difficult to access. Alternative available data, like MIMIC (Sci Data 3:1, 2016) is written in English and for specific patient conditions like intensive care. Thus, when model training is required for other types of patients, like oncology (and not intensive care), this could lead to bias. To facilitate clinical narrative model training, a method for creating high-quality synthetic narratives is needed. METHOD We devised workflows based on generative AI methods to synthesize narratives in the German language to avoid the disclosure of patient's health data. Since we required highly realistic narratives, we generated prompts, written with high-quality medical terminology, asking for clinical narratives containing both a main and co-disease. The frequency of distribution of both the main and co-disease was extracted from the hospital's structured data, such that the synthetic narratives reflect the disease distribution among the patient's cohort. In order to validate the quality of the synthetic narratives, we annotated them to train a Named Entity Recognition (NER) algorithm. According to our assumptions, the validation of this system implies that the synthesized data used for its training are of acceptable quality. RESULT We report precision, recall and F1 score for the NER model while also considering metrics that take into account both exact and partial entity matches. Trained models are cautious, with a precision up to 0.8 for Entity Type match metric and a F1 score of 0.3. CONCLUSION Despite its inherent limitations, this technology has the potential to allow data interoperability by using encoded diseases across languages and regions without compromising data safety. Additionally, it facilitates the synthesis of unstructured patient data. In this way, the identification and training of models can be accelerated. We believe that this method may be able to generate discharge letters for any combination of main and co-diseases, which will significantly reduce the amount of time spent writing these letters by healthcare professionals.
Collapse
Grants
- BW1_1456 (AI4MedCode) Ministry for Economics, Labor and Tourism from Baden-Württemberg, Germany
- BW1_1456 (AI4MedCode) Ministry for Economics, Labor and Tourism from Baden-Württemberg, Germany
- BW1_1456 (AI4MedCode) Ministry for Economics, Labor and Tourism from Baden-Württemberg, Germany
- BW1_1456 (AI4MedCode) Ministry for Economics, Labor and Tourism from Baden-Württemberg, Germany
- BW1_1456 (AI4MedCode) Ministry for Economics, Labor and Tourism from Baden-Württemberg, Germany
Collapse
Affiliation(s)
- Juan G Diaz Ochoa
- QuiBiQ GmbH, Heßbrühlstr. 11, Stuttgart, D-70565, Germany.
- PerMediQ GmbH, Salzbergweg 18, Wang, D-85368, Germany.
| | | | - Felix Weil
- QuiBiQ GmbH, Heßbrühlstr. 11, Stuttgart, D-70565, Germany
| | - Yi Wang
- Analytic Computing Department, University of Stuttgart, Institute for Artificial Intelligence, Universitätsstraße 32, Stuttgart, D-70569, Germany
| | - Kudret Kama
- Klinikum Stuttgart, Stuttgart Cancer Center - Tumorzentrum Eva Mayr-Stihl DE, Kriegsbergstraße 60, Stuttgart, D-70174, Germany
| | - Markus Knott
- Klinikum Stuttgart, Stuttgart Cancer Center - Tumorzentrum Eva Mayr-Stihl DE, Kriegsbergstraße 60, Stuttgart, D-70174, Germany
| |
Collapse
|
5
|
Cho HN, Jun TJ, Kim YH, Kang H, Ahn I, Gwon H, Kim Y, Seo J, Choi H, Kim M, Han J, Kee G, Park S, Ko S. Task-Specific Transformer-Based Language Models in Health Care: Scoping Review. JMIR Med Inform 2024; 12:e49724. [PMID: 39556827 PMCID: PMC11612605 DOI: 10.2196/49724] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2023] [Revised: 07/10/2023] [Accepted: 10/21/2024] [Indexed: 11/20/2024] Open
Abstract
BACKGROUND Transformer-based language models have shown great potential to revolutionize health care by advancing clinical decision support, patient interaction, and disease prediction. However, despite their rapid development, the implementation of transformer-based language models in health care settings remains limited. This is partly due to the lack of a comprehensive review, which hinders a systematic understanding of their applications and limitations. Without clear guidelines and consolidated information, both researchers and physicians face difficulties in using these models effectively, resulting in inefficient research efforts and slow integration into clinical workflows. OBJECTIVE This scoping review addresses this gap by examining studies on medical transformer-based language models and categorizing them into 6 tasks: dialogue generation, question answering, summarization, text classification, sentiment analysis, and named entity recognition. METHODS We conducted a scoping review following the Cochrane scoping review protocol. A comprehensive literature search was performed across databases, including Google Scholar and PubMed, covering publications from January 2017 to September 2024. Studies involving transformer-derived models in medical tasks were included. Data were categorized into 6 key tasks. RESULTS Our key findings revealed both advancements and critical challenges in applying transformer-based models to health care tasks. For example, models like MedPIR involving dialogue generation show promise but face privacy and ethical concerns, while question-answering models like BioBERT improve accuracy but struggle with the complexity of medical terminology. The BioBERTSum summarization model aids clinicians by condensing medical texts but needs better handling of long sequences. CONCLUSIONS This review attempted to provide a consolidated understanding of the role of transformer-based language models in health care and to guide future research directions. By addressing current challenges and exploring the potential for real-world applications, we envision significant improvements in health care informatics. Addressing the identified challenges and implementing proposed solutions can enable transformer-based language models to significantly improve health care delivery and patient outcomes. Our review provides valuable insights for future research and practical applications, setting the stage for transformative advancements in medical informatics.
Collapse
Affiliation(s)
- Ha Na Cho
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Tae Joon Jun
- Big Data Research Center, Asan Institute for Life Sciences, Asan Medical Center, Seoul, Republic of Korea
| | - Young-Hak Kim
- Division of Cardiology, Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Heejun Kang
- Division of Cardiology, Asan Medical Center, Seoul, Republic of Korea
| | - Imjin Ahn
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Hansle Gwon
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Yunha Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jiahn Seo
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Heejung Choi
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Minkyoung Kim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Jiye Han
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Gaeun Kee
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Seohyun Park
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| | - Soyoung Ko
- Department of Information Medicine, Asan Medical Center, Seoul, Republic of Korea
| |
Collapse
|
6
|
Madan S, Kühnel L, Fröhlich H, Hofmann-Apitius M, Fluck J. Dataset of miRNA-disease relations extracted from textual data using transformer-based neural networks. Database (Oxford) 2024; 2024:baae066. [PMID: 39104284 PMCID: PMC11300841 DOI: 10.1093/database/baae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 06/23/2024] [Accepted: 07/10/2024] [Indexed: 08/07/2024]
Abstract
MicroRNAs (miRNAs) play important roles in post-transcriptional processes and regulate major cellular functions. The abnormal regulation of expression of miRNAs has been linked to numerous human diseases such as respiratory diseases, cancer, and neurodegenerative diseases. Latest miRNA-disease associations are predominantly found in unstructured biomedical literature. Retrieving these associations manually can be cumbersome and time-consuming due to the continuously expanding number of publications. We propose a deep learning-based text mining approach that extracts normalized miRNA-disease associations from biomedical literature. To train the deep learning models, we build a new training corpus that is extended by distant supervision utilizing multiple external databases. A quantitative evaluation shows that the workflow achieves an area under receiver operator characteristic curve of 98% on a holdout test set for the detection of miRNA-disease associations. We demonstrate the applicability of the approach by extracting new miRNA-disease associations from biomedical literature (PubMed and PubMed Central). We have shown through quantitative analysis and evaluation on three different neurodegenerative diseases that our approach can effectively extract miRNA-disease associations not yet available in public databases. Database URL: https://zenodo.org/records/10523046.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
| | - Lisa Kühnel
- Knowledge Management, German National Library of Medicine (ZB MED)—Information Centre for Life Sciences, Friedrich-Hirzebruch-Allee 4, Bonn 53115, Germany
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Postfach 10 01 31, Bielefeld, Nordrhein-Westfalen 33501, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Friedrich-Hirzebruch-Allee 6, Bonn 53113, Germany
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, 53757 Sankt Augustin, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Friedrich-Hirzebruch-Allee 6, Bonn 53113, Germany
| | - Juliane Fluck
- Knowledge Management, German National Library of Medicine (ZB MED)—Information Centre for Life Sciences, Friedrich-Hirzebruch-Allee 4, Bonn 53115, Germany
- Graduate School DILS, Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Faculty of Technology, Bielefeld University, Postfach 10 01 31, Bielefeld, Nordrhein-Westfalen 33501, Germany
- Information management, Institute of Geodesy and Geoinformation, University of Bonn, Katzenburgweg 1a, Bonn 53115, Germany
| |
Collapse
|
7
|
Madan S, Lentzen M, Brandt J, Rueckert D, Hofmann-Apitius M, Fröhlich H. Transformer models in biomedicine. BMC Med Inform Decis Mak 2024; 24:214. [PMID: 39075407 PMCID: PMC11287876 DOI: 10.1186/s12911-024-02600-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Deep neural networks (DNN) have fundamentally revolutionized the artificial intelligence (AI) field. The transformer model is a type of DNN that was originally used for the natural language processing tasks and has since gained more and more attention for processing various kinds of sequential data, including biological sequences and structured electronic health records. Along with this development, transformer-based models such as BioBERT, MedBERT, and MassGenie have been trained and deployed by researchers to answer various scientific questions originating in the biomedical domain. In this paper, we review the development and application of transformer models for analyzing various biomedical-related datasets such as biomedical textual data, protein sequences, medical structured-longitudinal data, and biomedical images as well as graphs. Also, we look at explainable AI strategies that help to comprehend the predictions of transformer-based models. Finally, we discuss the limitations and challenges of current models, and point out emerging novel research directions.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Institute of Computer Science, University of Bonn, Bonn, 53115, Germany.
| | - Manuel Lentzen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Johannes Brandt
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
| | - Daniel Rueckert
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
- School of Computation, Information and Technology, Technical University Munich, Munich, Germany
- Department of Computing, Imperial College London, London, UK
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany.
| |
Collapse
|
8
|
Denecke K, May R, Rivera Romero O. Potential of Large Language Models in Health Care: Delphi Study. J Med Internet Res 2024; 26:e52399. [PMID: 38739445 PMCID: PMC11130776 DOI: 10.2196/52399] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Revised: 10/10/2023] [Accepted: 04/19/2024] [Indexed: 05/14/2024] Open
Abstract
BACKGROUND A large language model (LLM) is a machine learning model inferred from text data that captures subtle patterns of language use in context. Modern LLMs are based on neural network architectures that incorporate transformer methods. They allow the model to relate words together through attention to multiple words in a text sequence. LLMs have been shown to be highly effective for a range of tasks in natural language processing (NLP), including classification and information extraction tasks and generative applications. OBJECTIVE The aim of this adapted Delphi study was to collect researchers' opinions on how LLMs might influence health care and on the strengths, weaknesses, opportunities, and threats of LLM use in health care. METHODS We invited researchers in the fields of health informatics, nursing informatics, and medical NLP to share their opinions on LLM use in health care. We started the first round with open questions based on our strengths, weaknesses, opportunities, and threats framework. In the second and third round, the participants scored these items. RESULTS The first, second, and third rounds had 28, 23, and 21 participants, respectively. Almost all participants (26/28, 93% in round 1 and 20/21, 95% in round 3) were affiliated with academic institutions. Agreement was reached on 103 items related to use cases, benefits, risks, reliability, adoption aspects, and the future of LLMs in health care. Participants offered several use cases, including supporting clinical tasks, documentation tasks, and medical research and education, and agreed that LLM-based systems will act as health assistants for patient education. The agreed-upon benefits included increased efficiency in data handling and extraction, improved automation of processes, improved quality of health care services and overall health outcomes, provision of personalized care, accelerated diagnosis and treatment processes, and improved interaction between patients and health care professionals. In total, 5 risks to health care in general were identified: cybersecurity breaches, the potential for patient misinformation, ethical concerns, the likelihood of biased decision-making, and the risk associated with inaccurate communication. Overconfidence in LLM-based systems was recognized as a risk to the medical profession. The 6 agreed-upon privacy risks included the use of unregulated cloud services that compromise data security, exposure of sensitive patient data, breaches of confidentiality, fraudulent use of information, vulnerabilities in data storage and communication, and inappropriate access or use of patient data. CONCLUSIONS Future research related to LLMs should not only focus on testing their possibilities for NLP-related tasks but also consider the workflows the models could contribute to and the requirements regarding quality, integration, and regulations needed for successful implementation in practice.
Collapse
Affiliation(s)
| | - Richard May
- Harz University of Applied Sciences, Wernigerode, Germany
| | - Octavio Rivera Romero
- Instituto de Ingeniería Informática (I3US), Universidad de Sevilla, Sevilla, Spain
- Department of Electronic Technology, Universidad de Sevilla, Sevilla, Spain
| |
Collapse
|
9
|
Grouin C, Grabar N, Section Editors for the IMIA Yearbook Section on Natural Language Processing . Year 2022 in Medical Natural Language Processing: Availability of Language Models as a Step in the Democratization of NLP in the Biomedical Area. Yearb Med Inform 2023; 32:244-252. [PMID: 38147866 PMCID: PMC10751107 DOI: 10.1055/s-0043-1768752] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2023] Open
Abstract
OBJECTIVES To analyse the content of publications within the medical Natural Language Processing (NLP) domain in 2022. METHODS Automatic and manual preselection of publications to be reviewed, and selection of the best NLP papers of the year. Analysis of the important issues. RESULTS Three best papers have been selected. We also propose an analysis of the content of the NLP publications in 2022, stressing on some of the topics. CONCLUSION The main trend in 2022 is certainly related to the availability of large language models, especially those based on Transformers, and to their use by non-NLP researchers. This leads to the democratization of the NLP methods. We also observe the renewal of interest to languages other than English, the continuation of research on information extraction and prediction, the massive use of data from social media, and the consideration of needs and interests of patients.
Collapse
Affiliation(s)
- Cyril Grouin
- Université Paris Saclay, CNRS, Laboratoire Interdisciplinaire des Sciences du Numérique, 91400 Orsay, France
| | - Natalia Grabar
- UMR8163 STL, CNRS, Université de Lille, Domaine du Pont-de-bois, 59653 Villeneuve-d'Ascq cedex, France
| | | |
Collapse
|
10
|
Richter-Pechanski P, Wiesenbach P, Schwab DM, Kiriakou C, He M, Allers MM, Tiefenbacher AS, Kunz N, Martynova A, Spiller N, Mierisch J, Borchert F, Schwind C, Frey N, Dieterich C, Geis NA. A distributable German clinical corpus containing cardiovascular clinical routine doctor's letters. Sci Data 2023; 10:207. [PMID: 37059736 PMCID: PMC10104831 DOI: 10.1038/s41597-023-02128-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 03/31/2023] [Indexed: 04/16/2023] Open
Abstract
We present CARDIO:DE, the first freely available and distributable large German clinical corpus from the cardiovascular domain. CARDIO:DE encompasses 500 clinical routine German doctor's letters from Heidelberg University Hospital, which were manually annotated. Our prospective study design complies well with current data protection regulations and allows us to keep the original structure of clinical documents consistent. In order to ease access to our corpus, we manually de-identified all letters. To enable various information extraction tasks the temporal information in the documents was preserved. We added two high-quality manual annotation layers to CARDIO:DE, (1) medication information and (2) CDA-compliant section classes. To the best of our knowledge, CARDIO:DE is the first freely available and distributable German clinical corpus in the cardiovascular domain. In summary, our corpus offers unique opportunities for collaborative and reproducible research on natural language processing models for German clinical texts.
Collapse
Affiliation(s)
- Phillip Richter-Pechanski
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany.
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany.
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany.
- Informatics for Life, Heidelberg, DE, Germany.
| | - Philipp Wiesenbach
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Dominic M Schwab
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Christina Kiriakou
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Mingyang He
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Michael M Allers
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Anna S Tiefenbacher
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Nicola Kunz
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Anna Martynova
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Noemie Spiller
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Julian Mierisch
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
| | - Florian Borchert
- Digital Health Center, Hasso Plattner Institute, University of Potsdam, Potsdam, DE, Germany
| | - Charlotte Schwind
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
| | - Norbert Frey
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Christoph Dieterich
- Section of Bioinformatics and Systems Cardiology, Klaus Tschira Institute for Integrative Computational Cardiology, Heidelberg, DE, Germany
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- German Center for Cardiovascular Research (DZHK) - Partner site Heidelberg/Mannheim, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| | - Nicolas A Geis
- Department of Internal Medicine III, University Hospital Heidelberg, Heidelberg, DE, Germany
- Informatics for Life, Heidelberg, DE, Germany
| |
Collapse
|