1
|
Yuan H, Zhu M, Yang R, Liu H, Li I, Hong C. Rethinking Domain-Specific Pretraining by Supervised or Self-Supervised Learning for Chest Radiograph Classification: A Comparative Study Against ImageNet Counterparts in Cold-Start Active Learning. HEALTH CARE SCIENCE 2025; 4:110-143. [PMID: 40241982 PMCID: PMC11997468 DOI: 10.1002/hcs2.70009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/10/2024] [Revised: 01/05/2025] [Accepted: 01/26/2025] [Indexed: 04/18/2025]
Abstract
Objective Deep learning (DL) has become the prevailing method in chest radiograph analysis, yet its performance heavily depends on large quantities of annotated images. To mitigate the cost, cold-start active learning (AL), comprising an initialization followed by subsequent learning, selects a small subset of informative data points for labeling. Recent advancements in pretrained models by supervised or self-supervised learning tailored to chest radiograph have shown broad applicability to diverse downstream tasks. However, their potential in cold-start AL remains unexplored. Methods To validate the efficacy of domain-specific pretraining, we compared two foundation models: supervised TXRV and self-supervised REMEDIS with their general domain counterparts pretrained on ImageNet. Model performance was evaluated at both initialization and subsequent learning stages on two diagnostic tasks: psychiatric pneumonia and COVID-19. For initialization, we assessed their integration with three strategies: diversity, uncertainty, and hybrid sampling. For subsequent learning, we focused on uncertainty sampling powered by different pretrained models. We also conducted statistical tests to compare the foundation models with ImageNet counterparts, investigate the relationship between initialization and subsequent learning, examine the performance of one-shot initialization against the full AL process, and investigate the influence of class balance in initialization samples on initialization and subsequent learning. Results First, domain-specific foundation models failed to outperform ImageNet counterparts in six out of eight experiments on informative sample selection. Both domain-specific and general pretrained models were unable to generate representations that could substitute for the original images as model inputs in seven of the eight scenarios. However, pretrained model-based initialization surpassed random sampling, the default approach in cold-start AL. Second, initialization performance was positively correlated with subsequent learning performance, highlighting the importance of initialization strategies. Third, one-shot initialization performed comparably to the full AL process, demonstrating the potential of reducing experts' repeated waiting during AL iterations. Last, a U-shaped correlation was observed between the class balance of initialization samples and model performance, suggesting that the class balance is more strongly associated with performance at middle budget levels than at low or high budgets. Conclusions In this study, we highlighted the limitations of medical pretraining compared to general pretraining in the context of cold-start AL. We also identified promising outcomes related to cold-start AL, including initialization based on pretrained models, the positive influence of initialization on subsequent learning, the potential for one-shot initialization, and the influence of class balance on middle-budget AL. Researchers are encouraged to improve medical pretraining for versatile DL foundations and explore novel AL methods.
Collapse
Affiliation(s)
- Han Yuan
- Duke‐NUS Medical School, Centre for Quantitative MedicineSingaporeSingapore
| | - Mingcheng Zhu
- Duke‐NUS Medical School, Centre for Quantitative MedicineSingaporeSingapore
- Department of Engineering ScienceUniversity of OxfordOxfordUK
| | - Rui Yang
- Duke‐NUS Medical School, Centre for Quantitative MedicineSingaporeSingapore
| | - Han Liu
- Department of Computer ScienceVanderbilt UniversityNashvilleTennesseeUSA
| | - Irene Li
- Information Technology CenterUniversity of TokyoBunkyo‐kuJapan
| | - Chuan Hong
- Department of Biostatistics and BioinformaticsDuke UniversityDurhamNorth CarolinaUSA
| |
Collapse
|
2
|
Farrow L, Raja A, Zhong M, Anderson L. A systematic review of natural language processing applications in Trauma & Orthopaedics. Bone Jt Open 2025; 6:264-274. [PMID: 40037398 PMCID: PMC11879473 DOI: 10.1302/2633-1462.63.bjo-2024-0081.r1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/06/2025] Open
Abstract
Aims Prevalence of artificial intelligence (AI) algorithms within the Trauma & Orthopaedics (T&O) literature has greatly increased over the last ten years. One increasingly explored aspect of AI is the automated interpretation of free-text data often prevalent in electronic medical records (known as natural language processing (NLP)). We set out to review the current evidence for applications of NLP methodology in T&O, including assessment of study design and reporting. Methods MEDLINE, Allied and Complementary Medicine (AMED), Excerpta Medica Database (EMBASE), and Cochrane Central Register of Controlled Trials (CENTRAL) were screened for studies pertaining to NLP in T&O from database inception to 31 December 2023. An additional grey literature search was performed. NLP quality assessment followed the criteria outlined by Farrow et al in 2021 with two independent reviewers (classification as absent, incomplete, or complete). Reporting was performed according to the Synthesis-Without Meta-Analysis (SWiM) guidelines. The review protocol was registered on the Prospective Register of Systematic Reviews (PROSPERO; registration no. CRD42022291714). Results The final review included 31 articles (published between 2012 and 2021). The most common subspeciality areas included trauma, arthroplasty, and spine; 13% (4/31) related to online reviews/social media, 42% (13/31) to clinical notes/operation notes, 42% (13/31) to radiology reports, and 3% (1/31) to systematic review. According to the reporting criteria, 16% (5/31) were considered good quality, 74% (23/31) average quality, and 6% (2/31) poor quality. The most commonly absent reporting criteria were evaluation of missing data (26/31), sample size calculation (31/31), and external validation of the study results (29/31 papers). Code and data availability were also poorly documented in most studies. Conclusion Application of NLP is becoming increasingly common in T&O; however, published article quality is mixed, with few high-quality studies. There are key consistent deficiencies in published work relating to NLP which ultimately influence the potential for clinical application. Open science is an important part of research transparency that should be encouraged in NLP algorithm development and reporting.
Collapse
Affiliation(s)
- Luke Farrow
- Institute of Applied Health Sciences, University of Aberdeen, Aberdeen, UK
- Grampian Orthopaedics, Aberdeen Royal Infirmary, Aberdeen, UK
| | - Arslan Raja
- School of Medicine, University of Edinburgh, Edinburgh, UK
| | - Mingjun Zhong
- Institute of Applied Health Sciences, University of Aberdeen, Aberdeen, UK
| | - Lesley Anderson
- Institute of Applied Health Sciences, University of Aberdeen, Aberdeen, UK
| |
Collapse
|
3
|
Hurtado LF, Marco-Ruiz L, Segarra E, Castro-Bleda MJ, Bustos-Moreno A, Iglesia-Vayá MDL, Vallalta-Rueda JF. Leveraging Transformers-based models and linked data for deep phenotyping in radiology. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 260:108567. [PMID: 39787917 DOI: 10.1016/j.cmpb.2024.108567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 11/12/2024] [Accepted: 12/16/2024] [Indexed: 01/12/2025]
Abstract
BACKGROUND AND OBJECTIVE Despite significant investments in the normalization and the standardization of Electronic Health Records (EHRs), free text is still the rule rather than the exception in clinical notes. The use of free text has implications in data reuse methods used for supporting clinical research since the query mechanisms used in cohort definition and patient matching are mainly based on structured data and clinical terminologies. This study aims to develop a method for the secondary use of clinical text by: (a) using Natural Language Processing (NLP) for tagging clinical notes with biomedical terminology; and (b) designing an ontology that maps and classifies all the identified tags to various terminologies and allows for running phenotyping queries. METHODS AND RESULTS Transformers-based NLP Models, concretely pre-trained RoBERTa language models, were used to process radiology reports and annotate them identifying elements matching UMLS Concept Unique Identifiers (CUIs) definitions. CUIs were mapped into several biomedical ontologies useful for phenotyping (e.g., SNOMED-CT, HPO, ICD-10, FMA, LOINC, and ICPC2, among others) and represented as a lightweight ontology using OWL (Web Ontology Language) constructs. This process resulted in a Linked Knowledge Base (LKB), which allows running expressive queries to retrieve reports that comply with specific criteria using automatic reasoning. CONCLUSION Although phenotyping tools mostly rely on relational databases, the combination of NLP and Linked Data technologies allows us to build scalable knowledge bases using standard ontologies from the Web of data. Our approach enables us to execute a pipeline which input is free text and automatically maps identified entities to a LKB that allows answering phenotyping queries. In this work, we have only used Spanish radiology reports, although it is extensible to other languages for which suitable corpora are available. This is particularly valuable in regional and national systems dealing with large research databases from different registries and cohorts and plays an essential role in the scalability of large data reuse infrastructures that require indexing and governing distributed data sources.
Collapse
Affiliation(s)
- Lluís-F Hurtado
- VRAIN: Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València, Camí de Vera s/n, València, 46020, Spain; ValgrAI: Valencian Graduate School and Research Network of Artificial Intelligence, Camí de Vera s/n, València, 46020, Spain
| | - Luis Marco-Ruiz
- Norwegian Centre for E-health Research, University Hospital of North Norway, P.O. Box 35, Tromsø, N-9038, Norway.
| | - Encarna Segarra
- VRAIN: Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València, Camí de Vera s/n, València, 46020, Spain; ValgrAI: Valencian Graduate School and Research Network of Artificial Intelligence, Camí de Vera s/n, València, 46020, Spain
| | - Maria Jose Castro-Bleda
- VRAIN: Valencian Research Institute for Artificial Intelligence, Universitat Politècnica de València, Camí de Vera s/n, València, 46020, Spain; ValgrAI: Valencian Graduate School and Research Network of Artificial Intelligence, Camí de Vera s/n, València, 46020, Spain
| | | | - Maria de la Iglesia-Vayá
- Foundation for the Promotion of the Research in Healthcare and Biomedicine (FISABIO), Avda. de Catalunya, 21, València, 46020, Spain
| | | |
Collapse
|
4
|
Musbahi O, Nurek M, Pouris K, Vella-Baldacchino M, Bottle A, Hing C, Kostopoulou O, Cobb JP, Jones GG. Can ChatGPT make surgical decisions with confidence similar to experienced knee surgeons? Knee 2024; 51:120-129. [PMID: 39255525 DOI: 10.1016/j.knee.2024.08.015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 08/04/2024] [Accepted: 08/15/2024] [Indexed: 09/12/2024]
Abstract
BACKGROUND Unicompartmental knee replacements (UKRs) have become an increasingly attractive option for end-stage single-compartment knee osteoarthritis (OA). However, there remains controversy in patient selection. Natural language processing (NLP) is a form of artificial intelligence (AI). We aimed to determine whether general-purpose open-source natural language programs can make decisions regarding a patient's suitability for a total knee replacement (TKR) or a UKR and how confident AI NLP programs are in surgical decision making. METHODS We conducted a case-based cohort study using data from a separate study, where participants (73 surgeons and AI NLP programs) were presented with 32 fictitious clinical case scenarios that simulated patients with predominantly medial knee OA who would require surgery. Using the overall UKR/TKR judgments of the 73 experienced knee surgeons as the gold standard reference, we calculated the sensitivity, specificity, and positive predictive value of AI NLP programs to identify whether a patient should undergo UKR. RESULTS There was disagreement between the surgeons and ChatGPT in only five scenarios (15.6%). With the 73 surgeons' decision as the gold standard, the sensitivity of ChatGPT in determining whether a patient should undergo UKR was 0.91 (95% confidence interval (CI): 0.71 to 0.98). The positive predictive value for ChatGPT was 0.87 (95% CI: 0.72 to 0.94). ChatGPT was more confident in its UKR decision making (surgeon mean confidence = 1.7, ChatGPT mean confidence = 2.4). CONCLUSIONS It has been demonstrated that ChatGPT can make surgical decisions, and exceeded the confidence of experienced knee surgeons with substantial inter-rater agreement when deciding whether a patient was most appropriate for a UKR.
Collapse
Affiliation(s)
- Omar Musbahi
- MSk Lab, Sir Michael Uren Hub, Imperial College London, London, UK.
| | - Martine Nurek
- Department of Surgery and Cancer, Imperial College London, London, UK
| | - Kyriacos Pouris
- MSk Lab, Sir Michael Uren Hub, Imperial College London, London, UK
| | | | - Alex Bottle
- School of Public Health, Imperial College London, London, UK
| | - Caroline Hing
- St George's University Hospitals NHS Foundation Trust, London, UK
| | - Olga Kostopoulou
- Department of Surgery and Cancer, Imperial College London, London, UK; Institute of Global Health Innovation, Imperial College London, London, UK
| | - Justin P Cobb
- MSk Lab, Sir Michael Uren Hub, Imperial College London, London, UK
| | - Gareth G Jones
- MSk Lab, Sir Michael Uren Hub, Imperial College London, London, UK
| |
Collapse
|
5
|
Valente AS, Trunfio TA, Aiello M, Baldi D, Baldi M, Imbò S, Russo MA, Cavaliere C, Franzese M. Text mining approach for feature extraction and cartilage disease grade classification using knee MRI radiology reports. Comput Struct Biotechnol J 2024; 24:622-629. [PMID: 39963548 PMCID: PMC11832019 DOI: 10.1016/j.csbj.2024.10.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 10/01/2024] [Accepted: 10/01/2024] [Indexed: 02/20/2025] Open
Abstract
MRI radiology reporting processes can be improved by exploiting structured and semantically labelled data that can be fed to artificial intelligence (AI) tools. AI-based tools assisting radiology reporting can help to automatically individuate cartilage grading in textual magnetic resonance imaging (MRI) reports, thus supporting clinicians' decisions regarding medical imaging utilisation, diagnosis and treatment. In this study, we extracted information (clinical findings, observations, anatomical regions, etc.) and classified knee cartilage degradation from medical reports utilising transfer-learning techniques applied to the Bidirectional Encoder Representations from Transformers (BERT) model and its variants, pre-trained on an Italian-language corpus. To realise this objective, we used a dataset of 750 MRI knee reports written by three radiologists who contributed to a manual annotation process to perform text classification (TC) and named entity recognition (NER) tasks. The dataset was obtained from an internal database of the IRCCS SYNLAB SDN. Seventy percent of the dataset was used for training, 10% was used for validation and 20% was used for testing. The best-performing configurations for NER and TC tasks were based on the pre-trained BERT model. The macro F1-scores obtained with the NER and TC models are 0.89 and 0.81, respectively. The accuracies calculated on the test set for both tasks are 0.96 and 0.99, respectively.
Collapse
Affiliation(s)
| | - Teresa Angela Trunfio
- University of Naples Federico II, Department of Advanced Biomedical Sciences, Via Pansini, 5, 80131, Naples, Italy
| | - Marco Aiello
- IRCCS SYNLAB SDN, Via E. Gianturco, 113, 80143, Naples, Italy
| | - Dario Baldi
- IRCCS SYNLAB SDN, Via E. Gianturco, 113, 80143, Naples, Italy
| | - Marilena Baldi
- GESAN SRL, R&D Department, Via Torino, 14, 81020, San Nicola La Strada, Caserta, Italy
| | - Silvio Imbò
- GESAN SRL, R&D Department, Via Torino, 14, 81020, San Nicola La Strada, Caserta, Italy
| | | | - Carlo Cavaliere
- IRCCS SYNLAB SDN, Via E. Gianturco, 113, 80143, Naples, Italy
| | - Monica Franzese
- IRCCS SYNLAB SDN, Via E. Gianturco, 113, 80143, Naples, Italy
| |
Collapse
|
6
|
Ehrett C, Hegde S, Andre K, Liu D, Wilson T. Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study. JMIR MEDICAL EDUCATION 2024; 10:e51433. [PMID: 39560937 PMCID: PMC11590755 DOI: 10.2196/51433] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 02/09/2024] [Accepted: 08/15/2024] [Indexed: 11/20/2024]
Abstract
Background Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI's ChatGPT. Objective This study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. Methods The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. Results The overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers' performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. Conclusions The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field.
Collapse
Affiliation(s)
- Carl Ehrett
- Watt Family Innovation Center, Clemson University, Clemson, SC, United States
| | - Sudeep Hegde
- Department of Industrial Engineering, Clemson University, Clemson, SC, United States
| | - Kwame Andre
- Department of Computer Science, Clemson University, Clemson, SC, United States
| | - Dixizi Liu
- Department of Industrial Engineering, Clemson University, Clemson, SC, United States
| | - Timothy Wilson
- Department of Industrial Engineering, Clemson University, Clemson, SC, United States
| |
Collapse
|
7
|
Burke HB, Hoang A, Lopreiato JO, King H, Hemmer P, Montgomery M, Gagarin V. Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study. JMIR MEDICAL EDUCATION 2024; 10:e56342. [PMID: 39118469 PMCID: PMC11327632 DOI: 10.2196/56342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 02/22/2024] [Accepted: 05/06/2024] [Indexed: 08/10/2024]
Abstract
Background Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students' free-text history and physical notes. Methods This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students' notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students' standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice.
Collapse
Affiliation(s)
- Harry B Burke
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Albert Hoang
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Joseph O Lopreiato
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Heidi King
- Defense Health Agency, Falls Church, VA, United States
| | - Paul Hemmer
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Michael Montgomery
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| | - Viktoria Gagarin
- Uniformed Services University of the Health Sciences, Bethesda, MD, 20814, United States, 1 301-938-2212
| |
Collapse
|
8
|
Calvo-Lorenzo I, Uriarte-Llano I. [Massive generation of synthetic medical records with ChatGPT: An example in hip fractures]. Med Clin (Barc) 2024; 162:549-554. [PMID: 38290872 DOI: 10.1016/j.medcli.2023.11.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 11/20/2023] [Accepted: 11/22/2023] [Indexed: 02/01/2024]
Affiliation(s)
- Isidoro Calvo-Lorenzo
- Servicio de Cirugía Ortopédica y Traumatología, Hospital Universitario Galdakao Usansolo, Galdakao, Vizcaya, España.
| | - Iker Uriarte-Llano
- Servicio de Cirugía Ortopédica y Traumatología, Hospital Universitario Galdakao Usansolo, Galdakao, Vizcaya, España
| |
Collapse
|
9
|
Gorenstein L, Konen E, Green M, Klang E. Bidirectional Encoder Representations from Transformers in Radiology: A Systematic Review of Natural Language Processing Applications. J Am Coll Radiol 2024; 21:914-941. [PMID: 38302036 DOI: 10.1016/j.jacr.2024.01.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 01/13/2024] [Accepted: 01/26/2024] [Indexed: 02/03/2024]
Abstract
INTRODUCTION Bidirectional Encoder Representations from Transformers (BERT), introduced in 2018, has revolutionized natural language processing. Its bidirectional understanding of word context has enabled innovative applications, notably in radiology. This study aimed to assess BERT's influence and applications within the radiologic domain. METHODS Adhering to Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a systematic review, searching PubMed for literature on BERT-based models and natural language processing in radiology from January 1, 2018, to February 12, 2023. The search encompassed keywords related to generative models, transformer architecture, and various imaging techniques. RESULTS Of 597 results, 30 met our inclusion criteria. The remaining were unrelated to radiology or did not use BERT-based models. The included studies were retrospective, with 14 published in 2022. The primary focus was on classification and information extraction from radiology reports, with x-rays as the prevalent imaging modality. Specific investigations included automatic CT protocol assignment and deep learning applications in chest x-ray interpretation. CONCLUSION This review underscores the primary application of BERT in radiology for report classification. It also reveals emerging BERT applications for protocol assignment and report generation. As BERT technology advances, we foresee further innovative applications. Its implementation in radiology holds potential for enhancing diagnostic precision, expediting report generation, and optimizing patient care.
Collapse
Affiliation(s)
- Larisa Gorenstein
- Department of Diagnostic Imaging, Sheba Medical Center, Ramat-Gan, Israel; Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| | - Eli Konen
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel; Chair, Department of Diagnostic Imaging, Sheba Medical Center, Ramat-Gan, Israel
| | - Michael Green
- Department of Diagnostic Imaging, Sheba Medical Center, Ramat-Gan, Israel; Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eyal Klang
- Icahn School of Medicine at Mount Sinai, New York, New York; and Associate Professor of Radiology, Innovation Center, Sheba Medical Center, Affiliated with Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
10
|
Hasani AM, Singh S, Zahergivar A, Ryan B, Nethala D, Bravomontenegro G, Mendhiratta N, Ball M, Farhadi F, Malayeri A. Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol 2024; 34:3566-3574. [PMID: 37938381 DOI: 10.1007/s00330-023-10384-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 17.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 09/01/2023] [Accepted: 09/08/2023] [Indexed: 11/09/2023]
Abstract
OBJECTIVE Radiology reporting is an essential component of clinical diagnosis and decision-making. With the advent of advanced artificial intelligence (AI) models like GPT-4 (Generative Pre-trained Transformer 4), there is growing interest in evaluating their potential for optimizing or generating radiology reports. This study aimed to compare the quality and content of radiologist-generated and GPT-4 AI-generated radiology reports. METHODS A comparative study design was employed in the study, where a total of 100 anonymized radiology reports were randomly selected and analyzed. Each report was processed by GPT-4, resulting in the generation of a corresponding AI-generated report. Quantitative and qualitative analysis techniques were utilized to assess similarities and differences between the two sets of reports. RESULTS The AI-generated reports showed comparable quality to radiologist-generated reports in most categories. Significant differences were observed in clarity (p = 0.027), ease of understanding (p = 0.023), and structure (p = 0.050), favoring the AI-generated reports. AI-generated reports were more concise, with 34.53 fewer words and 174.22 fewer characters on average, but had greater variability in sentence length. Content similarity was high, with an average Cosine Similarity of 0.85, Sequence Matcher Similarity of 0.52, BLEU Score of 0.5008, and BERTScore F1 of 0.8775. CONCLUSION The results of this proof-of-concept study suggest that GPT-4 can be a reliable tool for generating standardized radiology reports, offering potential benefits such as improved efficiency, better communication, and simplified data extraction and analysis. However, limitations and ethical implications must be addressed to ensure the safe and effective implementation of this technology in clinical practice. CLINICAL RELEVANCE STATEMENT The findings of this study suggest that GPT-4 (Generative Pre-trained Transformer 4), an advanced AI model, has the potential to significantly contribute to the standardization and optimization of radiology reporting, offering improved efficiency and communication in clinical practice. KEY POINTS • Large language model-generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports. • Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports. • Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.
Collapse
Affiliation(s)
- Amir M Hasani
- Laboratory of Translation Research, National Heart Blood Lung Institute, NIH, Bethesda, MD, USA
| | - Shiva Singh
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Aryan Zahergivar
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Beth Ryan
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Daniel Nethala
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | | | - Neil Mendhiratta
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Mark Ball
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Faraz Farhadi
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Ashkan Malayeri
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA.
| |
Collapse
|
11
|
Flory MN, Napel S, Tsai EB. Artificial Intelligence in Radiology: Opportunities and Challenges. Semin Ultrasound CT MR 2024; 45:152-160. [PMID: 38403128 DOI: 10.1053/j.sult.2024.02.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/27/2024]
Abstract
Artificial intelligence's (AI) emergence in radiology elicits both excitement and uncertainty. AI holds promise for improving radiology with regards to clinical practice, education, and research opportunities. Yet, AI systems are trained on select datasets that can contain bias and inaccuracies. Radiologists must understand these limitations and engage with AI developers at every step of the process - from algorithm initiation and design to development and implementation - to maximize benefit and minimize harm that can be enabled by this technology.
Collapse
Affiliation(s)
- Marta N Flory
- Department of Radiology, Stanford University School of Medicine, Center for Academic Medicine, Palo Alto, CA
| | - Sandy Napel
- Department of Radiology, Stanford University School of Medicine, Center for Academic Medicine, Palo Alto, CA
| | - Emily B Tsai
- Department of Radiology, Stanford University School of Medicine, Center for Academic Medicine, Palo Alto, CA.
| |
Collapse
|
12
|
Alzubaidi L, Salhi A, A.Fadhel M, Bai J, Hollman F, Italia K, Pareyon R, Albahri AS, Ouyang C, Santamaría J, Cutbush K, Gupta A, Abbosh A, Gu Y. Trustworthy deep learning framework for the detection of abnormalities in X-ray shoulder images. PLoS One 2024; 19:e0299545. [PMID: 38466693 PMCID: PMC10927121 DOI: 10.1371/journal.pone.0299545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2023] [Accepted: 02/12/2024] [Indexed: 03/13/2024] Open
Abstract
Musculoskeletal conditions affect an estimated 1.7 billion people worldwide, causing intense pain and disability. These conditions lead to 30 million emergency room visits yearly, and the numbers are only increasing. However, diagnosing musculoskeletal issues can be challenging, especially in emergencies where quick decisions are necessary. Deep learning (DL) has shown promise in various medical applications. However, previous methods had poor performance and a lack of transparency in detecting shoulder abnormalities on X-ray images due to a lack of training data and better representation of features. This often resulted in overfitting, poor generalisation, and potential bias in decision-making. To address these issues, a new trustworthy DL framework has been proposed to detect shoulder abnormalities (such as fractures, deformities, and arthritis) using X-ray images. The framework consists of two parts: same-domain transfer learning (TL) to mitigate imageNet mismatch and feature fusion to reduce error rates and improve trust in the final result. Same-domain TL involves training pre-trained models on a large number of labelled X-ray images from various body parts and fine-tuning them on the target dataset of shoulder X-ray images. Feature fusion combines the extracted features with seven DL models to train several ML classifiers. The proposed framework achieved an excellent accuracy rate of 99.2%, F1Score of 99.2%, and Cohen's kappa of 98.5%. Furthermore, the accuracy of the results was validated using three visualisation tools, including gradient-based class activation heat map (Grad CAM), activation visualisation, and locally interpretable model-independent explanations (LIME). The proposed framework outperformed previous DL methods and three orthopaedic surgeons invited to classify the test set, who obtained an average accuracy of 79.1%. The proposed framework has proven effective and robust, improving generalisation and increasing trust in the final results.
Collapse
Affiliation(s)
- Laith Alzubaidi
- School of Mechanical, Medical, and Process Engineering, Queensland University of Technology, Brisbane, QLD, Australia
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
- Centre for Data Science, Queensland University of Technology, Brisbane, QLD, Australia
- Akunah Medical Technology Pty Ltd Company, Brisbane, QLD, Australia
| | - Asma Salhi
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
- Akunah Medical Technology Pty Ltd Company, Brisbane, QLD, Australia
| | | | - Jinshuai Bai
- School of Mechanical, Medical, and Process Engineering, Queensland University of Technology, Brisbane, QLD, Australia
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
| | - Freek Hollman
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
| | - Kristine Italia
- Akunah Medical Technology Pty Ltd Company, Brisbane, QLD, Australia
| | - Roberto Pareyon
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
| | - A. S. Albahri
- Technical College, Imam Ja’afar Al-Sadiq University, Baghdad, Iraq
| | - Chun Ouyang
- School of Information Systems, Queensland University of Technology, Brisbane, QLD, Australia
| | - Jose Santamaría
- Department of Computer Science, University of Jaén, Jaén, Spain
| | - Kenneth Cutbush
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
- School of Medicine, The University of Queensland, Brisbane, QLD, Australia
| | - Ashish Gupta
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
- Akunah Medical Technology Pty Ltd Company, Brisbane, QLD, Australia
- Greenslopes Private Hospital, Brisbane, QLD, Australia
| | - Amin Abbosh
- School of Information Technology and Electrical Engineering, Brisbane, QLD, Australia
| | - Yuantong Gu
- School of Mechanical, Medical, and Process Engineering, Queensland University of Technology, Brisbane, QLD, Australia
- Queensland Unit for Advanced Shoulder Research (QUASR)/ARC Industrial Transformation Training Centre—Joint Biomechanics, Queensland University of Technology, Brisbane, QLD, Australia
| |
Collapse
|
13
|
Kim M, Ong KTI, Choi S, Yeo J, Kim S, Han K, Park JE, Kim HS, Choi YS, Ahn SS, Kim J, Lee SK, Sohn B. Natural language processing to predict isocitrate dehydrogenase genotype in diffuse glioma using MR radiology reports. Eur Radiol 2023; 33:8017-8025. [PMID: 37566271 DOI: 10.1007/s00330-023-10061-z] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 05/18/2023] [Accepted: 06/22/2023] [Indexed: 08/12/2023]
Abstract
OBJECTIVES To evaluate the performance of natural language processing (NLP) models to predict isocitrate dehydrogenase (IDH) mutation status in diffuse glioma using routine MR radiology reports. MATERIALS AND METHODS This retrospective, multi-center study included consecutive patients with diffuse glioma with known IDH mutation status from May 2009 to November 2021 whose initial MR radiology report was available prior to pathologic diagnosis. Five NLP models (long short-term memory [LSTM], bidirectional LSTM, bidirectional encoder representations from transformers [BERT], BERT graph convolutional network [GCN], BioBERT) were trained, and area under the receiver operating characteristic curve (AUC) was assessed to validate prediction of IDH mutation status in the internal and external validation sets. The performance of the best performing NLP model was compared with that of the human readers. RESULTS A total of 1427 patients (mean age ± standard deviation, 54 ± 15; 779 men, 54.6%) with 720 patients in the training set, 180 patients in the internal validation set, and 527 patients in the external validation set were included. In the external validation set, BERT GCN showed the highest performance (AUC 0.85, 95% CI 0.81-0.89) in predicting IDH mutation status, which was higher than LSTM (AUC 0.77, 95% CI 0.72-0.81; p = .003) and BioBERT (AUC 0.81, 95% CI 0.76-0.85; p = .03). This was higher than that of a neuroradiologist (AUC 0.80, 95% CI 0.76-0.84; p = .005) and a neurosurgeon (AUC 0.79, 95% CI 0.76-0.84; p = .04). CONCLUSION BERT GCN was externally validated to predict IDH mutation status in patients with diffuse glioma using routine MR radiology reports with superior or at least comparable performance to human reader. CLINICAL RELEVANCE STATEMENT Natural language processing may be used to extract relevant information from routine radiology reports to predict cancer genotype and provide prognostic information that may aid in guiding treatment strategy and enabling personalized medicine. KEY POINTS • A transformer-based natural language processing (NLP) model predicted isocitrate dehydrogenase mutation status in diffuse glioma with an AUC of 0.85 in the external validation set. • The best NLP models were superior or at least comparable to human readers in both internal and external validation sets. • Transformer-based models showed higher performance than conventional NLP model such as long short-term memory.
Collapse
Affiliation(s)
- Minjae Kim
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea
| | - Kai Tzu-Iunn Ong
- Department of Artificial Intelligence, College of Computing, Yonsei University, Seoul, Korea
| | - Seonah Choi
- Department of Neurosurgery, Brain Tumor Center, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea
| | - Jinyoung Yeo
- Department of Artificial Intelligence, College of Computing, Yonsei University, Seoul, Korea
| | - Sooyon Kim
- Department of Statistics and Data Science, Yonsei University, Seoul, Korea
| | - Kyunghwa Han
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Ji Eun Park
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea
| | - Ho Sung Kim
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea
| | - Yoon Seong Choi
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Sung Soo Ahn
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Jinna Kim
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Seung-Koo Lee
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Beomseok Sohn
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea.
- Department of Radiology and Center for Imaging Sciences, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea.
| |
Collapse
|
14
|
Kita K, Uemura K, Takao M, Fujimori T, Tamura K, Nakamura N, Wakabayashi G, Kurakami H, Suzuki Y, Wataya T, Nishigaki D, Okada S, Tomiyama N, Kido S. Use of artificial intelligence to identify data elements for The Japanese Orthopaedic Association National Registry from operative records. J Orthop Sci 2023; 28:1392-1399. [PMID: 36163118 DOI: 10.1016/j.jos.2022.09.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Revised: 08/09/2022] [Accepted: 09/06/2022] [Indexed: 11/19/2022]
Abstract
BACKGROUND The Japanese Orthopaedic Association National Registry (JOANR) was recently launched in Japan and is expected to improve the quality of medical care. However, surgeons must register ten detailed features for total hip arthroplasty, which is labor intensive. One possible solution is to use a system that automatically extracts information about the surgeries. Although it is not easy to extract features from an operative record consisting of free-text data, natural language processing has been used to extract features from operative records. This study aimed to evaluate the best natural language processing method for building a system that automatically detects some elements in the JOANR from the operative records of total hip arthroplasty. METHODS We obtained operative records of total hip arthroplasty (n = 2574) in three hospitals and targeted two items: surgical approach and fixation technique. We compared the accuracy of three natural language processing methods: rule-based algorithms, machine learning, and bidirectional encoder representations from transformers (BERT). RESULTS In the surgical approach task, the accuracy of BERT was superior to that of the rule-based algorithm (99.6% vs. 93.6%, p < 0.001), comparable to machine learning. In the fixation technique task, the accuracy of BERT was superior to the rule-based algorithm and machine learning (96% vs. 74%, p < 0.0001 and 94%, p = 0.0004). CONCLUSIONS BERT is the most appropriate method for building a system that automatically detects the surgical approach and fixation technique.
Collapse
Affiliation(s)
- Kosuke Kita
- Department of Artificial Intelligence Diagnostic Radiology, Graduate School of Medicine, Osaka University, Osaka, Japan.
| | - Keisuke Uemura
- Department of Orthopaedic Surgery, Graduate School of Medicine, Osaka University, Osaka, Japan
| | - Masaki Takao
- Department of Orthopaedic Surgery, Graduate School of Medicine, Osaka University, Osaka, Japan
| | - Takahito Fujimori
- Department of Orthopaedic Surgery, Graduate School of Medicine, Osaka University, Osaka, Japan
| | - Kazunori Tamura
- Department of Orthopaedic Surgery, Kyowakai Hospital, Osaka, Japan
| | - Nobuo Nakamura
- Department of Orthopaedic Surgery, Kyowakai Hospital, Osaka, Japan
| | - Gen Wakabayashi
- Department of Orthopaedic Surgery, Ikeda City Hospital, Osaka, Japan
| | - Hiroyuki Kurakami
- Department of Medical Innovation, Osaka University Hospital, Osaka, Japan
| | - Yuki Suzuki
- Department of Artificial Intelligence Diagnostic Radiology, Graduate School of Medicine, Osaka University, Osaka, Japan
| | - Tomohiro Wataya
- Department of Artificial Intelligence Diagnostic Radiology, Graduate School of Medicine, Osaka University, Osaka, Japan
| | - Daiki Nishigaki
- Department of Artificial Intelligence Diagnostic Radiology, Graduate School of Medicine, Osaka University, Osaka, Japan
| | - Seiji Okada
- Department of Orthopaedic Surgery, Graduate School of Medicine, Osaka University, Osaka, Japan
| | | | - Shoji Kido
- Department of Artificial Intelligence Diagnostic Radiology, Graduate School of Medicine, Osaka University, Osaka, Japan
| |
Collapse
|
15
|
Simoulin A, Thiebaut N, Neuberger K, Ibnouhsein I, Brunel N, Viné R, Bousquet N, Latapy J, Reix N, Molière S, Lodi M, Mathelin C. From free-text electronic health records to structured cohorts: Onconum, an innovative methodology for real-world data mining in breast cancer. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 240:107693. [PMID: 37453367 DOI: 10.1016/j.cmpb.2023.107693] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 05/25/2023] [Accepted: 06/23/2023] [Indexed: 07/18/2023]
Abstract
PURPOSE A considerable amount of valuable information is present in electronic health records (EHRs) however it remains inaccessible because it is embedded into unstructured narrative documents that cannot be easily analyzed. We wanted to develop and evaluate a methodology able to extract and structure information from electronic health records in breast cancer. METHODS We developed a software platform called Onconum (ClinicalTrials.gov Identifier: NCT02810093) which uses a hybrid method relying on machine learning approaches and rule-based lexical methods. It is based on natural language processing techniques that allows a targeted analysis of free-text medical data related to breast cancer, independently of any pre-existing dictionary, in a French context (available in N files). We then evaluated it on a validation cohort called Senometry. FINDINGS Senometry cohort included 9,599 patients with breast cancer (both invasive and in situ), treated between 2000 and 2017 in the breast cancer unit of Strasbourg University Hospitals. Extraction rates ranged from 45 to 100%, depending on the type of each parameter. Precision of extracted information was 68%-94% compared to a structured cohort, and 89%-98% compared to manually structured databases and it retrieved more rare occurrences compared to another database search engine (+17%). INTERPRETATION This innovative method can accurately structure relevant medical information embedded in EHRs in the context of breast cancer. Missing data handling is the main limitation of this method however multiple sources can be incorporated to reduce this limit. Nevertheless, this methodology does not need neither pre-existing dictionaries nor manually annotated corpora. It can therefore be easily implemented in non-English-speaking countries and in other diseases outside breast cancer, and it allows prospective inclusion of new patients.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Nicolas Bousquet
- Quantmetry, 52 rue d'Anjou, 75008 Paris, France; Sorbonne University, 4 place Jussieu, 75005 Paris, France
| | | | - Nathalie Reix
- ICube UMR 7537, Strasbourg University / CNRS, Fédération de Médecine Translationnelle de Strasbourg, 67200 Strasbourg, France; Biochemistry and Molecular Biology Laboratory, Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France
| | - Sébastien Molière
- Radiology Department, Strasbourg University Hospitals, 1 avenue Molière, 67098 Strasbourg, France
| | - Massimo Lodi
- Institut de cancérologie Strasbourg Europe (ICANS), 17 avenue Albert Calmette, 67033 Strasbourg Cedex, France; Department of Functional Genomics and Cancer, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS UMR 7104, INSERM U964, Strasbourg University, Illkirch, France; Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France.
| | - Carole Mathelin
- Institut de cancérologie Strasbourg Europe (ICANS), 17 avenue Albert Calmette, 67033 Strasbourg Cedex, France; Department of Functional Genomics and Cancer, Institut de Génétique et de Biologie Moléculaire et Cellulaire, CNRS UMR 7104, INSERM U964, Strasbourg University, Illkirch, France; Strasbourg University Hospitals, 1 place de l'Hôpital, 67091 Strasbourg, France.
| |
Collapse
|
16
|
Elmarakeby HA, Trukhanov PS, Arroyo VM, Riaz IB, Schrag D, Van Allen EM, Kehl KL. Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports. BMC Bioinformatics 2023; 24:328. [PMID: 37658330 PMCID: PMC10474750 DOI: 10.1186/s12859-023-05439-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 08/07/2023] [Indexed: 09/03/2023] Open
Abstract
BACKGROUND Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. RESULTS We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler "bag of words" or convolutional neural network models. CONCLUSION When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.
Collapse
Affiliation(s)
- Haitham A Elmarakeby
- Dana-Farber Cancer Institute, Boston, MA, USA.
- Al-Azhar University, Cairo, Egypt.
- Harvard Medical School, Boston, MA, USA.
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | | | | | - Irbaz Bin Riaz
- Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Mayo Clinic, Rochester, MN, USA
| | - Deborah Schrag
- Memorial-Sloan Kettering Cancer Center, New York, NY, USA
| | - Eliezer M Van Allen
- Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kenneth L Kehl
- Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| |
Collapse
|
17
|
Hsu E, Bako AT, Potter T, Pan AP, Britz GW, Tannous J, Vahidy FS. Extraction of Radiological Characteristics From Free-Text Imaging Reports Using Natural Language Processing Among Patients With Ischemic and Hemorrhagic Stroke: Algorithm Development and Validation. JMIR AI 2023; 2:e42884. [PMID: 38875556 PMCID: PMC11041442 DOI: 10.2196/42884] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 01/10/2023] [Accepted: 04/08/2023] [Indexed: 06/16/2024]
Abstract
BACKGROUND Neuroimaging is the gold-standard diagnostic modality for all patients suspected of stroke. However, the unstructured nature of imaging reports remains a major challenge to extracting useful information from electronic health records systems. Despite the increasing adoption of natural language processing (NLP) for radiology reports, information extraction for many stroke imaging features has not been systematically evaluated. OBJECTIVE In this study, we propose an NLP pipeline, which adopts the state-of-the-art ClinicalBERT model with domain-specific pretraining and task-oriented fine-tuning to extract 13 stroke features from head computed tomography imaging notes. METHODS We used the model to generate structured data sets with information on the presence or absence of common stroke features for 24,924 patients with strokes. We compared the survival characteristics of patients with and without features of severe stroke (eg, midline shift, perihematomal edema, or mass effect) using the Kaplan-Meier curve and log-rank tests. RESULTS Pretrained on 82,073 head computed tomography notes with 13.7 million words and fine-tuned on 200 annotated notes, our HeadCT_BERT model achieved an average area under receiver operating characteristic curve of 0.9831, F1-score of 0.8683, and accuracy of 97%. Among patients with acute ischemic stroke, admissions with any severe stroke feature in initial imaging notes were associated with a lower probability of survival (P<.001). CONCLUSIONS Our proposed NLP pipeline achieved high performance and has the potential to improve medical research and patient safety.
Collapse
Affiliation(s)
- Enshuo Hsu
- Center for Health Data Science and Analytics, Houston Methodist Research Institute, Houston, TX, United States
- School of Biomedical Informatics, University of Texas Health Science Center at Houston, Houston, TX, United States
| | - Abdulaziz T Bako
- Center for Health Data Science and Analytics, Houston Methodist Research Institute, Houston, TX, United States
| | - Thomas Potter
- Center for Health Data Science and Analytics, Houston Methodist Research Institute, Houston, TX, United States
| | - Alan P Pan
- Center for Health Data Science and Analytics, Houston Methodist Research Institute, Houston, TX, United States
| | - Gavin W Britz
- Department of Neurosurgery, Houston Methodist Neurological Institute, Houston, TX, United States
- Department of Neurology, Weill Cornell Medical College, New York, NY, United States
| | - Jonika Tannous
- Center for Health Data Science and Analytics, Houston Methodist Research Institute, Houston, TX, United States
| | - Farhaan S Vahidy
- Center for Health Data Science and Analytics, Houston Methodist Research Institute, Houston, TX, United States
- Department of Neurosurgery, Houston Methodist Neurological Institute, Houston, TX, United States
- Department of Population Health Sciences, Weill Cornell Medical College, New York, NY, United States
| |
Collapse
|
18
|
Srinivas S, Young AJ. Machine Learning and Artificial Intelligence in Surgical Research. Surg Clin North Am 2023; 103:299-316. [PMID: 36948720 DOI: 10.1016/j.suc.2022.11.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2023]
Abstract
Machine learning, a subtype of artificial intelligence, is an emerging field of surgical research dedicated to predictive modeling. From its inception, machine learning has been of interest in medical and surgical research. Built on traditional research metrics for optimal success, avenues of research include diagnostics, prognosis, operative timing, and surgical education, in a variety of surgical subspecialties. Machine learning represents an exciting and developing future in the world of surgical research that will not only allow for more personalized and comprehensive medical care.
Collapse
Affiliation(s)
- Shruthi Srinivas
- Department of Surgery, The Ohio State University, 370 West 9th Avenue, Columbus, OH 43210, USA
| | - Andrew J Young
- Division of Trauma, Critical Care, and Burn, The Ohio State University, 181 Taylor Avenue, Suite 1102K, Columbus, OH 43203, USA.
| |
Collapse
|
19
|
Tejani AS, Ng YS, Xi Y, Fielding JR, Browning TG, Rayan JC. Performance of Multiple Pretrained BERT Models to Automate and Accelerate Data Annotation for Large Datasets. Radiol Artif Intell 2022; 4:e220007. [PMID: 35923377 PMCID: PMC9344209 DOI: 10.1148/ryai.220007] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 06/08/2022] [Accepted: 06/14/2022] [Indexed: 06/15/2023]
Abstract
PURPOSE To develop and evaluate domain-specific and pretrained bidirectional encoder representations from transformers (BERT) models in a transfer learning task on varying training dataset sizes to annotate a larger overall dataset. MATERIALS AND METHODS The authors retrospectively reviewed 69 095 anonymized adult chest radiograph reports (reports dated April 2020-March 2021). From the overall cohort, 1004 reports were randomly selected and labeled for the presence or absence of each of the following devices: endotracheal tube (ETT), enterogastric tube (NGT, or Dobhoff tube), central venous catheter (CVC), and Swan-Ganz catheter (SGC). Pretrained transformer models (BERT, PubMedBERT, DistilBERT, RoBERTa, and DeBERTa) were trained, validated, and tested on 60%, 20%, and 20%, respectively, of these reports through fivefold cross-validation. Additional training involved varying dataset sizes with 5%, 10%, 15%, 20%, and 40% of the 1004 reports. The best-performing epochs were used to assess area under the receiver operating characteristic curve (AUC) and determine run time on the overall dataset. RESULTS The highest average AUCs from fivefold cross-validation were 0.996 for ETT (RoBERTa), 0.994 for NGT (RoBERTa), 0.991 for CVC (PubMedBERT), and 0.98 for SGC (PubMedBERT). DeBERTa demonstrated the highest AUC for each support device trained on 5% of the training set. PubMedBERT showed a higher AUC with a decreasing training set size compared with BERT. Training and validation time was shortest for DistilBERT at 3 minutes 39 seconds on the annotated cohort. CONCLUSION Pretrained and domain-specific transformer models required small training datasets and short training times to create a highly accurate final model that expedites autonomous annotation of large datasets.Keywords: Informatics, Named Entity Recognition, Transfer Learning Supplemental material is available for this article. ©RSNA, 2022See also the commentary by Zech in this issue.
Collapse
|
20
|
Dipnall JF, Lu J, Gabbe BJ, Cosic F, Edwards E, Page R, Du L. Comparison of state-of-the-art machine and deep learning algorithms to classify proximal humeral fractures using radiology text. Eur J Radiol 2022; 153:110366. [DOI: 10.1016/j.ejrad.2022.110366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Revised: 04/08/2022] [Accepted: 05/16/2022] [Indexed: 12/01/2022]
|
21
|
Bhatnagar R, Sardar S, Beheshti M, Podichetty JT. How can natural language processing help model informed drug development?: a review. JAMIA Open 2022; 5:ooac043. [PMID: 35702625 PMCID: PMC9188322 DOI: 10.1093/jamiaopen/ooac043] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2022] [Revised: 04/28/2022] [Accepted: 05/26/2022] [Indexed: 01/20/2023] Open
Abstract
Objective To summarize applications of natural language processing (NLP) in model informed drug development (MIDD) and identify potential areas of improvement. Materials and Methods Publications found on PubMed and Google Scholar, websites and GitHub repositories for NLP libraries and models. Publications describing applications of NLP in MIDD were reviewed. The applications were stratified into 3 stages: drug discovery, clinical trials, and pharmacovigilance. Key NLP functionalities used for these applications were assessed. Programming libraries and open-source resources for the implementation of NLP functionalities in MIDD were identified. Results NLP has been utilized to aid various processes in drug development lifecycle such as gene-disease mapping, biomarker discovery, patient-trial matching, adverse drug events detection, etc. These applications commonly use NLP functionalities of named entity recognition, word embeddings, entity resolution, assertion status detection, relation extraction, and topic modeling. The current state-of-the-art for implementing these functionalities in MIDD applications are transformer models that utilize transfer learning for enhanced performance. Various libraries in python, R, and Java like huggingface, sparkNLP, and KoRpus as well as open-source platforms such as DisGeNet, DeepEnroll, and Transmol have enabled convenient implementation of NLP models to MIDD applications. Discussion Challenges such as reproducibility, explainability, fairness, limited data, limited language-support, and security need to be overcome to ensure wider adoption of NLP in MIDD landscape. There are opportunities to improve the performance of existing models and expand the use of NLP in newer areas of MIDD. Conclusions This review provides an overview of the potential and pitfalls of current NLP approaches in MIDD.
Collapse
Affiliation(s)
- Roopal Bhatnagar
- Data Science, Data Collaboration Center, Critical Path Institute , Tucson, Arizona, USA
| | - Sakshi Sardar
- Quantitative Medicine, Critical Path Institute , Tucson, Arizona, USA
| | - Maedeh Beheshti
- Quantitative Medicine, Critical Path Institute , Tucson, Arizona, USA
| | | |
Collapse
|
22
|
Li YH, Lee IT, Chen YW, Lin YK, Liu YH, Lai FP. Using Text Content From Coronary Catheterization Reports to Predict 5-Year Mortality Among Patients Undergoing Coronary Angiography: A Deep Learning Approach. Front Cardiovasc Med 2022; 9:800864. [PMID: 35295250 PMCID: PMC8918537 DOI: 10.3389/fcvm.2022.800864] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2021] [Accepted: 01/24/2022] [Indexed: 11/13/2022] Open
Abstract
BackgroundCurrent predictive models for patients undergoing coronary angiography have complex parameters which limit their clinical application. Coronary catheterization reports that describe coronary lesions and the corresponding interventions provide information of the severity of the coronary artery disease and the completeness of the revascularization. This information is relevant for predicting patient prognosis. However, no predictive model has been constructed using the text content from coronary catheterization reports before.ObjectiveTo develop a deep learning model using text content from coronary catheterization reports to predict 5-year all-cause mortality and 5-year cardiovascular mortality for patients undergoing coronary angiography and to compare the performance of the model to the established clinical scores.MethodThis retrospective cohort study was conducted between January 1, 2006, and December 31, 2015. Patients admitted for coronary angiography were enrolled and followed up until August 2019. The main outcomes were 5-year all-cause mortality and 5-year cardiovascular mortality. In total, 11,576 coronary catheterization reports were collected. BioBERT (bidirectional encoder representations from transformers for biomedical text mining), which is a BERT-based model in the biomedical domain, was utilized to construct the model. The area under the receiver operating characteristic curve (AUC) was used to assess model performance. We also compared our results to the residual SYNTAX (SYNergy between PCI with TAXUS and Cardiac Surgery) score.ResultsThe dataset was divided into the training (60%), validation (20%), and test (20%) sets. The mean age of the patients in each dataset was 65.5 ± 12.1, 65.4 ± 11.2, and 65.6 ± 11.2 years, respectively. A total of 1,411 (12.2%) patients died, and 664 (5.8%) patients died of cardiovascular causes within 5 years after coronary angiography. The best of our models had an AUC of 0.822 (95% CI, 0.790–0.855) for 5-year all-cause mortality, and an AUC of 0.858 (95% CI, 0.816–0.900) for 5-year cardiovascular mortality. We randomly selected 300 patients who underwent percutaneous coronary intervention (PCI), and our model outperformed the residual SYNTAX score in predicting 5-year all-cause mortality (AUC, 0.867 [95% CI, 0.813–0.921] vs. 0.590 [95% CI, 0.503–0.684]) and 5-year cardiovascular mortality (AUC, 0.880 [95% CI, 0.873–0.925] vs. 0.649 [95% CI, 0.535–0.764]), respectively, after PCI among these patients.ConclusionsWe developed a predictive model using text content from coronary catheterization reports to predict the 5-year mortality in patients undergoing coronary angiography. Since interventional cardiologists routinely write reports after procedures, our model can be easily implemented into the clinical setting.
Collapse
Affiliation(s)
- Yu-Hsuan Li
- Department of Computer Science & Information Engineering, National Taiwan University, Taipei, Taiwan
- Division of Endocrinology and Metabolism, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan
| | - I-Te Lee
- Division of Endocrinology and Metabolism, Department of Internal Medicine, Taichung Veterans General Hospital, Taichung, Taiwan
- School of Medicine, National Yang-Ming University, Taipei, Taiwan
- School of Medicine, Chung Shan Medical University, Taichung, Taiwan
| | - Yu-Wei Chen
- Cardiovascular Center, Taichung Veterans General Hospital, Taichung, Taiwan
| | - Yow-Kuan Lin
- Department of Computer Science, Columbia University, New York, NY, United States
| | - Yu-Hsin Liu
- Department of Computer Science, Columbia University, New York, NY, United States
| | - Fei-Pei Lai
- Department of Computer Science & Information Engineering, National Taiwan University, Taipei, Taiwan
- Graduate Institute of Biomedical Electronics and Bioinformatics, National Taiwan University, Taipei, Taiwan
- Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan
- *Correspondence: Fei-Pei Lai
| |
Collapse
|
23
|
Zheng Y, Dickson VV, Blecker S, Ng JM, Rice BC, Melkus GD, Shenkar L, Mortejo MCR, Johnson SB. Identifying Patients with Hypoglycemia Using Natural Language Processing: A Systematic Literature Review (Preprint). JMIR Diabetes 2021; 7:e34681. [PMID: 35576579 PMCID: PMC9152713 DOI: 10.2196/34681] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 04/03/2022] [Accepted: 04/08/2022] [Indexed: 01/22/2023] Open
Abstract
Background Accurately identifying patients with hypoglycemia is key to preventing adverse events and mortality. Natural language processing (NLP), a form of artificial intelligence, uses computational algorithms to extract information from text data. NLP is a scalable, efficient, and quick method to extract hypoglycemia-related information when using electronic health record data sources from a large population. Objective The objective of this systematic review was to synthesize the literature on the application of NLP to extract hypoglycemia from electronic health record clinical notes. Methods Literature searches were conducted electronically in PubMed, Web of Science Core Collection, CINAHL (EBSCO), PsycINFO (Ovid), IEEE Xplore, Google Scholar, and ACL Anthology. Keywords included hypoglycemia, low blood glucose, NLP, and machine learning. Inclusion criteria included studies that applied NLP to identify hypoglycemia, reported the outcomes related to hypoglycemia, and were published in English as full papers. Results This review (n=8 studies) revealed heterogeneity of the reported results related to hypoglycemia. Of the 8 included studies, 4 (50%) reported that the prevalence rate of any level of hypoglycemia was 3.4% to 46.2%. The use of NLP to analyze clinical notes improved the capture of undocumented or missed hypoglycemic events using International Classification of Diseases, Ninth Revision (ICD-9), and International Classification of Diseases, Tenth Revision (ICD-10), and laboratory testing. The combination of NLP and ICD-9 or ICD-10 codes significantly increased the identification of hypoglycemic events compared with individual methods; for example, the prevalence rates of hypoglycemia were 12.4% for International Classification of Diseases codes, 25.1% for an NLP algorithm, and 32.2% for combined algorithms. All the reviewed studies applied rule-based NLP algorithms to identify hypoglycemia. Conclusions The findings provided evidence that the application of NLP to analyze clinical notes improved the capture of hypoglycemic events, particularly when combined with the ICD-9 or ICD-10 codes and laboratory testing.
Collapse
Affiliation(s)
- Yaguang Zheng
- Rory Meyers College of Nursing, New York University, New York, NY, United States
| | | | - Saul Blecker
- Department of Population Health, Grossman School of Medicine, New York University, New York, NY, United States
| | - Jason M Ng
- Division of Endocrinology and Metabolism, Department of Medicine, University of Pittsburgh, Pittsburgh, PA, United States
| | | | - Gail D'Eramo Melkus
- Rory Meyers College of Nursing, New York University, New York, NY, United States
| | - Liat Shenkar
- Lehigh Valley Health Network, Lehigh Valley Reilly Children's Hospital, Allentown, PA, United States
| | | | - Stephen B Johnson
- Department of Population Health, Grossman School of Medicine, New York University, New York, NY, United States
| |
Collapse
|