1
|
Bhayana R, Biswas S, Cook TS, Kim W, Kitamura FC, Gichoya J, Yi PH. From Bench to Bedside With Large Language Models: AJR Expert Panel Narrative Review. AJR Am J Roentgenol 2024:1-10. [PMID: 38598354 DOI: 10.2214/ajr.24.30928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2024]
Abstract
Large language models (LLMs) hold immense potential to revolutionize radiology. However, their integration into practice requires careful consideration. Artificial intelligence (AI) chatbots and general-purpose LLMs have potential pitfalls related to privacy, transparency, and accuracy, limiting their current clinical readiness. Thus, LLM-based tools must be optimized for radiology practice to overcome these limitations. Although research and validation for radiology applications remain in their infancy, commercial products incorporating LLMs are becoming available alongside promises of transforming practice. To help radiologists navigate this landscape, this AJR Expert Panel Narrative Review provides a multidimensional perspective on LLMs, encompassing considerations from bench (development and optimization) to bedside (use in practice). At present, LLMs are not autonomous entities that can replace expert decision-making, and radiologists remain responsible for the content of their reports. Patient-facing tools, particularly medical AI chatbots, require additional guardrails to ensure safety and prevent misuse. Still, if responsibly implemented, LLMs are well-positioned to transform efficiency and quality in radiology. Radiologists must be well-informed and proactively involved in guiding the implementation of LLMs in practice to mitigate risks and maximize benefits to patient care.
Collapse
Affiliation(s)
- Rajesh Bhayana
- University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Department of Medical Imaging, University of Toronto, Toronto, ON, Canada
- Joint Department of Medical Imaging, Toronto General Hospital, 200 Elizabeth St, Peter Munk Bldg, 1st Fl, Toronto, ON M5G 24C, Canada
| | - Som Biswas
- Department of Radiology, Le Bonheur Children's Hospital, University of Tennessee Health Science Center, Memphis, TN
| | - Tessa S Cook
- Department of Radiology, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, PA
| | - Woojin Kim
- Department of Radiology, Palo Alto VA Medical Center, Palo Alto, CA
| | - Felipe C Kitamura
- Department of Diagnostic Imaging, Universidade Federal de São Paulo, São Paulo, Brazil
- Dasa, São Paulo, Brazil
| | - Judy Gichoya
- Department of Radiology, Emory University School of Medicine, Atlanta, GA
| | - Paul H Yi
- Department of Diagnostic Radiology and Nuclear Medicine, University of Maryland School of Medicine, Baltimore, MD
| |
Collapse
|
2
|
Ding JE, Thao PNM, Peng WC, Wang JZ, Chug CC, Hsieh MC, Tseng YC, Chen L, Luo D, Wu C, Wang CT, Hsu CH, Chen YT, Chen PF, Liu F, Hung FM. Large language multimodal models for new-onset type 2 diabetes prediction using five-year cohort electronic health records. Sci Rep 2024; 14:20774. [PMID: 39237580 PMCID: PMC11377777 DOI: 10.1038/s41598-024-71020-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 08/23/2024] [Indexed: 09/07/2024] Open
Abstract
Type 2 diabetes mellitus (T2DM) is a prevalent health challenge faced by countries worldwide. In this study, we propose a novel large language multimodal models (LLMMs) framework incorporating multimodal data from clinical notes and laboratory results for diabetes risk prediction. We collected five years of electronic health records (EHRs) dating from 2017 to 2021 from a Taiwan hospital database. This dataset included 1,420,596 clinical notes, 387,392 laboratory results, and more than 1505 laboratory test items. Our method combined a text embedding encoder and multi-head attention layer to learn laboratory values, and utilized a deep neural network (DNN) module to merge blood features with chronic disease semantics into a latent space. In our experiments, we observed that integrating clinical notes with predictions based on textual laboratory values significantly enhanced the predictive capability of the unimodal model in the early detection of T2DM. Moreover, we achieved an area greater than 0.70 under the receiver operating characteristic curve (AUC) for new-onset T2DM prediction, demonstrating the effectiveness of leveraging textual laboratory data for training and inference in LLMs and improving the accuracy of new-onset diabetes prediction.
Collapse
Affiliation(s)
- Jun-En Ding
- School of Systems and Enterprises, Stevens Institute of Technology, Hoboken, USA
| | - Phan Nguyen Minh Thao
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| | - Wen-Chih Peng
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| | - Jian-Zhe Wang
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| | - Chun-Cheng Chug
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| | - Min-Chen Hsieh
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| | - Yun-Chien Tseng
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu City, Taiwan
| | - Ling Chen
- Institute of Hospital and Health Care Administration, National Yang Ming Chiao Tung University, Taipei City, Taiwan
| | - Dongsheng Luo
- School of Computing and Information Science, Florida International University, Miami, USA
| | - Chenwei Wu
- Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA
| | - Chi-Te Wang
- Center of Artificial Intelligence, Far Eastern Memorial Hospital, New Taipei City, Taiwan
| | - Chih-Ho Hsu
- Department of Surgery, Far Eastern Memorial Hospital, New Taipei City, Taiwan
| | - Yi-Tui Chen
- Smart Healthcare Interdisciplinary College, National Taipei University of Nursing and Health Sciences, Taipei City, Taiwan
| | - Pei-Fu Chen
- Department of Anesthesiology, Far Eastern Memorial Hospital, New Taipei City, Taiwan
- Department of Electrical Engineering, Yuan Ze University, Taoyuan, Taiwan
| | - Feng Liu
- School of Systems and Enterprises, Stevens Institute of Technology, Hoboken, USA
| | - Fang-Ming Hung
- Surgical Trauma Intensive Care Unit, Far Eastern Memorial Hospital, New Taipei City, Taiwan.
- Smart Healthcare Interdisciplinary College, National Taipei University of Nursing and Health Sciences, Taipei City, Taiwan.
| |
Collapse
|
3
|
Huemann Z, Tie X, Hu J, Bradshaw TJ. ConTEXTual Net: A Multimodal Vision-Language Model for Segmentation of Pneumothorax. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024; 37:1652-1663. [PMID: 38485899 PMCID: PMC11300752 DOI: 10.1007/s10278-024-01051-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Revised: 01/09/2024] [Accepted: 01/17/2024] [Indexed: 07/24/2024]
Abstract
Radiology narrative reports often describe characteristics of a patient's disease, including its location, size, and shape. Motivated by the recent success of multimodal learning, we hypothesized that this descriptive text could guide medical image analysis algorithms. We proposed a novel vision-language model, ConTEXTual Net, for the task of pneumothorax segmentation on chest radiographs. ConTEXTual Net extracts language features from physician-generated free-form radiology reports using a pre-trained language model. We then introduced cross-attention between the language features and the intermediate embeddings of an encoder-decoder convolutional neural network to enable language guidance for image analysis. ConTEXTual Net was trained on the CANDID-PTX dataset consisting of 3196 positive cases of pneumothorax with segmentation annotations from 6 different physicians as well as clinical radiology reports. Using cross-validation, ConTEXTual Net achieved a Dice score of 0.716±0.016, which was similar to the degree of inter-reader variability (0.712±0.044) computed on a subset of the data. It outperformed vision-only models (Swin UNETR: 0.670±0.015, ResNet50 U-Net: 0.677±0.015, GLoRIA: 0.686±0.014, and nnUNet 0.694±0.016) and a competing vision-language model (LAVT: 0.706±0.009). Ablation studies confirmed that it was the text information that led to the performance gains. Additionally, we show that certain augmentation methods degraded ConTEXTual Net's segmentation performance by breaking the image-text concordance. We also evaluated the effects of using different language models and activation functions in the cross-attention module, highlighting the efficacy of our chosen architectural design.
Collapse
Affiliation(s)
- Zachary Huemann
- Department of Radiology, University of Wisconsin-Madison, Madison, WI, 53705, USA.
| | - Xin Tie
- Department of Radiology, University of Wisconsin-Madison, Madison, WI, 53705, USA
| | - Junjie Hu
- Departments of Biostatistics and Computer Science, University of Wisconsin-Madison, Madison, WI, 53705, USA
| | - Tyler J Bradshaw
- Department of Radiology, University of Wisconsin-Madison, Madison, WI, 53705, USA
| |
Collapse
|
4
|
Kim S, Kim SS, Kim E, Cecchini M, Park MS, Choi JA, Kim SH, Hwang HK, Kang CM, Choi HJ, Shin SJ, Kang J, Lee CK. Deep-Transfer-Learning-Based Natural Language Processing of Serial Free-Text Computed Tomography Reports for Predicting Survival of Patients With Pancreatic Cancer. JCO Clin Cancer Inform 2024; 8:e2400021. [PMID: 39151114 DOI: 10.1200/cci.24.00021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 04/22/2024] [Accepted: 07/10/2024] [Indexed: 08/18/2024] Open
Abstract
PURPOSE To explore the predictive potential of serial computed tomography (CT) radiology reports for pancreatic cancer survival using natural language processing (NLP). METHODS Deep-transfer-learning-based NLP models were retrospectively trained and tested with serial, free-text CT reports, and survival information of consecutive patients diagnosed with pancreatic cancer in a Korean tertiary hospital was extracted. Randomly selected patients with pancreatic cancer and their serial CT reports from an independent tertiary hospital in the United States were included in the external testing data set. The concordance index (c-index) of predicted survival and actual survival, and area under the receiver operating characteristic curve (AUROC) for predicting 1-year survival were calculated. RESULTS Between January 2004 and June 2021, 2,677 patients with 12,255 CT reports and 670 patients with 3,058 CT reports were allocated to training and internal testing data sets, respectively. ClinicalBERT (Bidirectional Encoder Representations from Transformers) model trained on the single, first CT reports showed a c-index of 0.653 and AUROC of 0.722 in predicting the overall survival of patients with pancreatic cancer. ClinicalBERT trained on up to 15 consecutive reports from the initial report showed an improved c-index of 0.811 and AUROC of 0.911. On the external testing set with 273 patients with 1,947 CT reports, the AUROC was 0.888, indicating the generalizability of our model. Further analyses showed our model's contextual interpretation beyond specific phrases. CONCLUSION Deep-transfer-learning-based NLP model of serial CT reports can predict the survival of patients with pancreatic cancer. Clinical decisions can be supported by the developed model, with survival information extracted solely from serial radiology reports.
Collapse
Affiliation(s)
- Sunkyu Kim
- Department of Computer Science and Engineering, Korea University, Seoul, Korea
| | - Seung-Seob Kim
- Department of Radiology and Research Institute of Radiological Science, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea
- Pancreaticobiliary Cancer Clinic, Yonsei Cancer Center, Severance Hospital, Seoul, Korea
| | - Eejung Kim
- Department of Internal Medicine (Medical Oncology), Yale University School of Medicine, New Haven, CT
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA
| | - Michael Cecchini
- Department of Internal Medicine (Medical Oncology), Yale University School of Medicine, New Haven, CT
| | - Mi-Suk Park
- Department of Radiology and Research Institute of Radiological Science, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea
- Pancreaticobiliary Cancer Clinic, Yonsei Cancer Center, Severance Hospital, Seoul, Korea
| | - Ji A Choi
- Song-dang Institute for Cancer Research, Yonsei University College of Medicine, Seoul, Korea
| | - Sung Hyun Kim
- Pancreaticobiliary Cancer Clinic, Yonsei Cancer Center, Severance Hospital, Seoul, Korea
- Department of Surgery, Yonsei University College of Medicine, Seoul, Korea
| | - Ho Kyoung Hwang
- Pancreaticobiliary Cancer Clinic, Yonsei Cancer Center, Severance Hospital, Seoul, Korea
- Department of Surgery, Yonsei University College of Medicine, Seoul, Korea
| | - Chang Moo Kang
- Pancreaticobiliary Cancer Clinic, Yonsei Cancer Center, Severance Hospital, Seoul, Korea
- Department of Surgery, Yonsei University College of Medicine, Seoul, Korea
| | - Hye Jin Choi
- Pancreaticobiliary Cancer Clinic, Yonsei Cancer Center, Severance Hospital, Seoul, Korea
- Division of Medical Oncology, Department of Internal Medicine, Yonsei University College of Medicine, Seoul, Korea
| | - Sang Joon Shin
- Song-dang Institute for Cancer Research, Yonsei University College of Medicine, Seoul, Korea
- Division of Medical Oncology, Department of Internal Medicine, Yonsei University College of Medicine, Seoul, Korea
| | - Jaewoo Kang
- Department of Computer Science and Engineering, Korea University, Seoul, Korea
- AIGEN Sciences Inc, Seoul, Korea
| | - Choong-Kun Lee
- Pancreaticobiliary Cancer Clinic, Yonsei Cancer Center, Severance Hospital, Seoul, Korea
- Song-dang Institute for Cancer Research, Yonsei University College of Medicine, Seoul, Korea
- Division of Medical Oncology, Department of Internal Medicine, Yonsei University College of Medicine, Seoul, Korea
| |
Collapse
|
5
|
Kanzawa J, Yasaka K, Fujita N, Fujiwara S, Abe O. Automated classification of brain MRI reports using fine-tuned large language models. Neuroradiology 2024:10.1007/s00234-024-03427-7. [PMID: 38995393 DOI: 10.1007/s00234-024-03427-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Accepted: 07/05/2024] [Indexed: 07/13/2024]
Abstract
PURPOSE This study aimed to investigate the efficacy of fine-tuned large language models (LLM) in classifying brain MRI reports into pretreatment, posttreatment, and nontumor cases. METHODS This retrospective study included 759, 284, and 164 brain MRI reports for training, validation, and test dataset. Radiologists stratified the reports into three groups: nontumor (group 1), posttreatment tumor (group 2), and pretreatment tumor (group 3) cases. A pretrained Bidirectional Encoder Representations from Transformers Japanese model was fine-tuned using the training dataset and evaluated on the validation dataset. The model which demonstrated the highest accuracy on the validation dataset was selected as the final model. Two additional radiologists were involved in classifying reports in the test datasets for the three groups. The model's performance on test dataset was compared to that of two radiologists. RESULTS The fine-tuned LLM attained an overall accuracy of 0.970 (95% CI: 0.930-0.990). The model's sensitivity for group 1/2/3 was 1.000/0.864/0.978. The model's specificity for group1/2/3 was 0.991/0.993/0.958. No statistically significant differences were found in terms of accuracy, sensitivity, and specificity between the LLM and human readers (p ≥ 0.371). The LLM completed the classification task approximately 20-26-fold faster than the radiologists. The area under the receiver operating characteristic curve for discriminating groups 2 and 3 from group 1 was 0.994 (95% CI: 0.982-1.000) and for discriminating group 3 from groups 1 and 2 was 0.992 (95% CI: 0.982-1.000). CONCLUSION Fine-tuned LLM demonstrated a comparable performance with radiologists in classifying brain MRI reports, while requiring substantially less time.
Collapse
Affiliation(s)
- Jun Kanzawa
- Department of Radiology, The University of Tokyo Hospital, Tokyo, Japan
| | - Koichiro Yasaka
- Department of Radiology, The University of Tokyo Hospital, Tokyo, Japan.
| | - Nana Fujita
- Department of Radiology, The University of Tokyo Hospital, Tokyo, Japan
| | - Shin Fujiwara
- Department of Radiology, The University of Tokyo Hospital, Tokyo, Japan
| | - Osamu Abe
- Department of Radiology, The University of Tokyo Hospital, Tokyo, Japan
| |
Collapse
|
6
|
Nakai H, Suman G, Adamo DA, Navin PJ, Bookwalter CA, LeGout JD, Chen FK, Wellnitz CV, Silva AC, Thomas JV, Kawashima A, Fan JW, Froemming AT, Lomas DJ, Humphreys MR, Dora C, Korfiatis P, Takahashi N. Natural language processing pipeline to extract prostate cancer-related information from clinical notes. Eur Radiol 2024:10.1007/s00330-024-10812-6. [PMID: 38842692 DOI: 10.1007/s00330-024-10812-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2024] [Revised: 03/28/2024] [Accepted: 04/10/2024] [Indexed: 06/07/2024]
Abstract
OBJECTIVES To develop an automated pipeline for extracting prostate cancer-related information from clinical notes. MATERIALS AND METHODS This retrospective study included 23,225 patients who underwent prostate MRI between 2017 and 2022. Cancer risk factors (family history of cancer and digital rectal exam findings), pre-MRI prostate pathology, and treatment history of prostate cancer were extracted from free-text clinical notes in English as binary or multi-class classification tasks. Any sentence containing pre-defined keywords was extracted from clinical notes within one year before the MRI. After manually creating sentence-level datasets with ground truth, Bidirectional Encoder Representations from Transformers (BERT)-based sentence-level models were fine-tuned using the extracted sentence as input and the category as output. The patient-level output was determined by compilation of multiple sentence-level outputs using tree-based models. Sentence-level classification performance was evaluated using the area under the receiver operating characteristic curve (AUC) on 15% of the sentence-level dataset (sentence-level test set). The patient-level classification performance was evaluated on the patient-level test set created by radiologists by reviewing the clinical notes of 603 patients. Accuracy and sensitivity were compared between the pipeline and radiologists. RESULTS Sentence-level AUCs were ≥ 0.94. The pipeline showed higher patient-level sensitivity for extracting cancer risk factors (e.g., family history of prostate cancer, 96.5% vs. 77.9%, p < 0.001), but lower accuracy in classifying pre-MRI prostate pathology (92.5% vs. 95.9%, p = 0.002) and treatment history of prostate cancer (95.5% vs. 97.7%, p = 0.03) than radiologists, respectively. CONCLUSION The proposed pipeline showed promising performance, especially for extracting cancer risk factors from patient's clinical notes. CLINICAL RELEVANCE STATEMENT The natural language processing pipeline showed a higher sensitivity for extracting prostate cancer risk factors than radiologists and may help efficiently gather relevant text information when interpreting prostate MRI. KEY POINTS When interpreting prostate MRI, it is necessary to extract prostate cancer-related information from clinical notes. This pipeline extracted the presence of prostate cancer risk factors with higher sensitivity than radiologists. Natural language processing may help radiologists efficiently gather relevant prostate cancer-related text information.
Collapse
Affiliation(s)
| | - Garima Suman
- Department of Radiology, Mayo Clinic, Rochester, MN, USA
| | - Daniel A Adamo
- Department of Radiology, Mayo Clinic, Rochester, MN, USA
| | | | | | | | - Frank K Chen
- Department of Radiology, Mayo Clinic, Jacksonville, FL, USA
| | | | - Alvin C Silva
- Department of Radiology, Mayo Clinic, Scottsdale, AZ, USA
| | - John V Thomas
- Department of Radiology, Mayo Clinic, Rochester, MN, USA
| | | | - Jungwei W Fan
- Department of Artificial Intelligence and Informatics, Mayo Clinic, Rochester, MN, USA
| | | | - Derek J Lomas
- Department of Urology, Mayo Clinic, Rochester, MN, USA
| | | | - Chandler Dora
- Department of Urology, Mayo Clinic, Jacksonville, FL, USA
| | | | | |
Collapse
|
7
|
Yao J, Alabousi A, Mironov O. Evaluation of a BERT Natural Language Processing Model for Automating CT and MRI Triage and Protocol Selection. Can Assoc Radiol J 2024:8465371241255895. [PMID: 38832645 DOI: 10.1177/08465371241255895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/05/2024] Open
Abstract
Purpose: To evaluate the accuracy of a Bidirectional Encoder Representations for Transformers (BERT) Natural Language Processing (NLP) model for automating triage and protocol selection of cross-sectional image requisitions. Methods: A retrospective study was completed using 222 392 CT and MRI studies from a single Canadian university hospital database (January 2018-September 2022). Three hundred unique protocols (116 CT and 184 MRI) were included. A BERT model was trained, validated, and tested using an 80%-10%-10% stratified split. Naive Bayes (NB) and Support Vector Machine (SVM) machine learning models were used as comparators. Models were assessed using F1 score, precision, recall, and area under the receiver operating characteristic curve (AUROC). The BERT model was also assessed for multi-class protocol suggestion and subgroups based on referral location, modality, and imaging section. Results: BERT was superior to SVM for protocol selection (F1 score: BERT-0.901 vs SVM-0.881). However, was not significantly different from SVM for triage prediction (F1 score: BERT-0.844 vs SVM-0.845). Both models outperformed NB for protocol and triage. BERT had superior performance on minority classes compared to SVM and NB. For multiclass prediction, BERT accuracy was up to 0.991 for top-5 protocol suggestion, and 0.981 for top-2 triage suggestion. Emergency department patients had the highest F1 scores for both protocol (0.957) and triage (0.986), compared to inpatients and outpatients. Conclusion: The BERT NLP model demonstrated strong performance in automating the triage and protocol selection of radiology studies, showing potential to enhance radiologist workflows. These findings suggest the feasibility of using advanced NLP models to streamline radiology operations.
Collapse
Affiliation(s)
- Jason Yao
- Department of Radiology, McMaster University, Hamilton, ON, Canada
| | - Abdullah Alabousi
- Department of Radiology, McMaster University, Hamilton, ON, Canada
- St Joseph's Healthcare Hamilton, Hamilton, ON, Canada
| | - Oleg Mironov
- Department of Radiology, McMaster University, Hamilton, ON, Canada
- St Joseph's Healthcare Hamilton, Hamilton, ON, Canada
| |
Collapse
|
8
|
Gorenstein L, Konen E, Green M, Klang E. Bidirectional Encoder Representations from Transformers in Radiology: A Systematic Review of Natural Language Processing Applications. J Am Coll Radiol 2024; 21:914-941. [PMID: 38302036 DOI: 10.1016/j.jacr.2024.01.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Revised: 01/13/2024] [Accepted: 01/26/2024] [Indexed: 02/03/2024]
Abstract
INTRODUCTION Bidirectional Encoder Representations from Transformers (BERT), introduced in 2018, has revolutionized natural language processing. Its bidirectional understanding of word context has enabled innovative applications, notably in radiology. This study aimed to assess BERT's influence and applications within the radiologic domain. METHODS Adhering to Preferred Reporting Items for Systematic Reviews and Meta-Analyses guidelines, we conducted a systematic review, searching PubMed for literature on BERT-based models and natural language processing in radiology from January 1, 2018, to February 12, 2023. The search encompassed keywords related to generative models, transformer architecture, and various imaging techniques. RESULTS Of 597 results, 30 met our inclusion criteria. The remaining were unrelated to radiology or did not use BERT-based models. The included studies were retrospective, with 14 published in 2022. The primary focus was on classification and information extraction from radiology reports, with x-rays as the prevalent imaging modality. Specific investigations included automatic CT protocol assignment and deep learning applications in chest x-ray interpretation. CONCLUSION This review underscores the primary application of BERT in radiology for report classification. It also reveals emerging BERT applications for protocol assignment and report generation. As BERT technology advances, we foresee further innovative applications. Its implementation in radiology holds potential for enhancing diagnostic precision, expediting report generation, and optimizing patient care.
Collapse
Affiliation(s)
- Larisa Gorenstein
- Department of Diagnostic Imaging, Sheba Medical Center, Ramat-Gan, Israel; Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel.
| | - Eli Konen
- Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel; Chair, Department of Diagnostic Imaging, Sheba Medical Center, Ramat-Gan, Israel
| | - Michael Green
- Department of Diagnostic Imaging, Sheba Medical Center, Ramat-Gan, Israel; Faculty of Medicine, Tel Aviv University, Tel Aviv, Israel
| | - Eyal Klang
- Icahn School of Medicine at Mount Sinai, New York, New York; and Associate Professor of Radiology, Innovation Center, Sheba Medical Center, Affiliated with Tel Aviv University, Tel Aviv, Israel
| |
Collapse
|
9
|
Hasani AM, Singh S, Zahergivar A, Ryan B, Nethala D, Bravomontenegro G, Mendhiratta N, Ball M, Farhadi F, Malayeri A. Evaluating the performance of Generative Pre-trained Transformer-4 (GPT-4) in standardizing radiology reports. Eur Radiol 2024; 34:3566-3574. [PMID: 37938381 DOI: 10.1007/s00330-023-10384-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Revised: 09/01/2023] [Accepted: 09/08/2023] [Indexed: 11/09/2023]
Abstract
OBJECTIVE Radiology reporting is an essential component of clinical diagnosis and decision-making. With the advent of advanced artificial intelligence (AI) models like GPT-4 (Generative Pre-trained Transformer 4), there is growing interest in evaluating their potential for optimizing or generating radiology reports. This study aimed to compare the quality and content of radiologist-generated and GPT-4 AI-generated radiology reports. METHODS A comparative study design was employed in the study, where a total of 100 anonymized radiology reports were randomly selected and analyzed. Each report was processed by GPT-4, resulting in the generation of a corresponding AI-generated report. Quantitative and qualitative analysis techniques were utilized to assess similarities and differences between the two sets of reports. RESULTS The AI-generated reports showed comparable quality to radiologist-generated reports in most categories. Significant differences were observed in clarity (p = 0.027), ease of understanding (p = 0.023), and structure (p = 0.050), favoring the AI-generated reports. AI-generated reports were more concise, with 34.53 fewer words and 174.22 fewer characters on average, but had greater variability in sentence length. Content similarity was high, with an average Cosine Similarity of 0.85, Sequence Matcher Similarity of 0.52, BLEU Score of 0.5008, and BERTScore F1 of 0.8775. CONCLUSION The results of this proof-of-concept study suggest that GPT-4 can be a reliable tool for generating standardized radiology reports, offering potential benefits such as improved efficiency, better communication, and simplified data extraction and analysis. However, limitations and ethical implications must be addressed to ensure the safe and effective implementation of this technology in clinical practice. CLINICAL RELEVANCE STATEMENT The findings of this study suggest that GPT-4 (Generative Pre-trained Transformer 4), an advanced AI model, has the potential to significantly contribute to the standardization and optimization of radiology reporting, offering improved efficiency and communication in clinical practice. KEY POINTS • Large language model-generated radiology reports exhibited high content similarity and moderate structural resemblance to radiologist-generated reports. • Performance metrics highlighted the strong matching of word selection and order, as well as high semantic similarity between AI and radiologist-generated reports. • Large language model demonstrated potential for generating standardized radiology reports, improving efficiency and communication in clinical settings.
Collapse
Affiliation(s)
- Amir M Hasani
- Laboratory of Translation Research, National Heart Blood Lung Institute, NIH, Bethesda, MD, USA
| | - Shiva Singh
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Aryan Zahergivar
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Beth Ryan
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Daniel Nethala
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | | | - Neil Mendhiratta
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Mark Ball
- Urology Oncology Branch, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Faraz Farhadi
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA
| | - Ashkan Malayeri
- Radiology & Imaging Sciences Department, Clinical Center, NIH, Bethesda, MD, USA.
| |
Collapse
|
10
|
Tay SB, Low GH, Wong GJE, Tey HJ, Leong FL, Li C, Chua MLK, Tan DSW, Thng CH, Tan IBH, Tan RSYC. Use of Natural Language Processing to Infer Sites of Metastatic Disease From Radiology Reports at Scale. JCO Clin Cancer Inform 2024; 8:e2300122. [PMID: 38788166 PMCID: PMC11371090 DOI: 10.1200/cci.23.00122] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 03/02/2024] [Accepted: 04/01/2024] [Indexed: 05/26/2024] Open
Abstract
PURPOSE To evaluate natural language processing (NLP) methods to infer metastatic sites from radiology reports. METHODS A set of 4,522 computed tomography (CT) reports of 550 patients with 14 types of cancer was used to fine-tune four clinical large language models (LLMs) for multilabel classification of metastatic sites. We also developed an NLP information extraction (IE) system (on the basis of named entity recognition, assertion status detection, and relation extraction) for comparison. Model performances were measured by F1 scores on test and three external validation sets. The best model was used to facilitate analysis of metastatic frequencies in a cohort study of 6,555 patients with 53,838 CT reports. RESULTS The RadBERT, BioBERT, GatorTron-base, and GatorTron-medium LLMs achieved F1 scores of 0.84, 0.87, 0.89, and 0.91, respectively, on the test set. The IE system performed best, achieving an F1 score of 0.93. F1 scores of the IE system by individual cancer type ranged from 0.89 to 0.96. The IE system attained F1 scores of 0.89, 0.83, and 0.81, respectively, on external validation sets including additional cancer types, positron emission tomography-CT ,and magnetic resonance imaging scans, respectively. In our cohort study, we found that for colorectal cancer, liver-only metastases were higher in de novo stage IV versus recurrent patients (29.7% v 12.2%; P < .001). Conversely, lung-only metastases were more frequent in recurrent versus de novo stage IV patients (17.2% v 7.3%; P < .001). CONCLUSION We developed an IE system that accurately infers metastatic sites in multiple primary cancers from radiology reports. It has explainable methods and performs better than some clinical LLMs. The inferred metastatic phenotypes could enhance cancer research databases and clinical trial matching, and identify potential patients for oligometastatic interventions.
Collapse
Affiliation(s)
- See Boon Tay
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- NUS Yong Loo Lin School of Medicine, Singapore, Singapore
| | - Guat Hwa Low
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore, Singapore
| | | | - Han Jieh Tey
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore, Singapore
| | - Fun Loon Leong
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore, Singapore
| | - Constance Li
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore, Singapore
| | - Melvin Lee Kiang Chua
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore, Singapore
- Singapore Duke-NUS Medical School, Singapore, Singapore
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore, Singapore
| | - Daniel Shao Weng Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Singapore Duke-NUS Medical School, Singapore, Singapore
- Division of Clinical Trials and Epidemiological Sciences, National Cancer Centre Singapore, Singapore, Singapore
| | - Choon Hua Thng
- Singapore Duke-NUS Medical School, Singapore, Singapore
- Division of Oncologic Imaging, National Cancer Centre Singapore, Singapore, Singapore
| | - Iain Bee Huat Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore, Singapore
- Singapore Duke-NUS Medical School, Singapore, Singapore
| | - Ryan Shea Ying Cong Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore, Singapore
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore, Singapore
- Singapore Duke-NUS Medical School, Singapore, Singapore
- Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY
| |
Collapse
|
11
|
Lyu D, Wang X, Chen Y, Wang F. Language model and its interpretability in biomedicine: A scoping review. iScience 2024; 27:109334. [PMID: 38495823 PMCID: PMC10940999 DOI: 10.1016/j.isci.2024.109334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2024] Open
Abstract
With advancements in large language models, artificial intelligence (AI) is undergoing a paradigm shift where AI models can be repurposed with minimal effort across various downstream tasks. This provides great promise in learning generally useful representations from biomedical corpora, at scale, which would empower AI solutions in healthcare and biomedical research. Nonetheless, our understanding of how they work, when they fail, and what they are capable of remains underexplored due to their emergent properties. Consequently, there is a need to comprehensively examine the use of language models in biomedicine. This review aims to summarize existing studies of language models in biomedicine and identify topics ripe for future research, along with the technical and analytical challenges w.r.t. interpretability. We expect this review to help researchers and practitioners better understand the landscape of language models in biomedicine and what methods are available to enhance the interpretability of their models.
Collapse
Affiliation(s)
- Daoming Lyu
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Xingbo Wang
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Yong Chen
- Department of Biostatistics, Epidemiology & Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Fei Wang
- Institute of Artificial Intelligence for Digital Health, Weill Cornell Medicine, New York, NY, USA
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| |
Collapse
|
12
|
Oeding JF, Yang L, Sanchez-Sotelo J, Camp CL, Karlsson J, Samuelsson K, Pearle AD, Ranawat AS, Kelly BT, Pareek A. A practical guide to the development and deployment of deep learning models for the orthopaedic surgeon: Part III, focus on registry creation, diagnosis, and data privacy. Knee Surg Sports Traumatol Arthrosc 2024; 32:518-528. [PMID: 38426614 DOI: 10.1002/ksa.12085] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 01/22/2024] [Accepted: 01/23/2024] [Indexed: 03/02/2024]
Abstract
Deep learning is a subset of artificial intelligence (AI) with enormous potential to transform orthopaedic surgery. As has already become evident with the deployment of Large Language Models (LLMs) like ChatGPT (OpenAI Inc.), deep learning can rapidly enter clinical and surgical practices. As such, it is imperative that orthopaedic surgeons acquire a deeper understanding of the technical terminology, capabilities and limitations associated with deep learning models. The focus of this series thus far has been providing surgeons with an overview of the steps needed to implement a deep learning-based pipeline, emphasizing some of the important technical details for surgeons to understand as they encounter, evaluate or lead deep learning projects. However, this series would be remiss without providing practical examples of how deep learning models have begun to be deployed and highlighting the areas where the authors feel deep learning may have the most profound potential. While computer vision applications of deep learning were the focus of Parts I and II, due to the enormous impact that natural language processing (NLP) has had in recent months, NLP-based deep learning models are also discussed in this final part of the series. In this review, three applications that the authors believe can be impacted the most by deep learning but with which many surgeons may not be familiar are discussed: (1) registry construction, (2) diagnostic AI and (3) data privacy. Deep learning-based registry construction will be essential for the development of more impactful clinical applications, with diagnostic AI being one of those applications likely to augment clinical decision-making in the near future. As the applications of deep learning continue to grow, the protection of patient information will become increasingly essential; as such, applications of deep learning to enhance data privacy are likely to become more important than ever before. Level of Evidence: Level IV.
Collapse
Affiliation(s)
- Jacob F Oeding
- School of Medicine, Mayo Clinic Alix School of Medicine, Rochester, Minnesota, USA
- Department of Orthopaedics, Institute of Clinical Sciences, The Sahlgrenska Academy, University of Gothenburg, Gothenburg, Sweden
| | - Linjun Yang
- Orthopedic Surgery Artificial Intelligence Laboratory (OSAIL), Department of Orthopedic Surgery, Mayo Clinic, Rochester, Minnesota, USA
| | | | - Christopher L Camp
- Department of Orthopedic Surgery, Mayo Clinic, Rochester, Minnesota, USA
| | - Jón Karlsson
- Department of Orthopaedics, Sahlgrenska University Hospital, Sahlgrenska Academy, Gothenburg University, Gothenburg, Sweden
| | - Kristian Samuelsson
- Department of Orthopaedics, Sahlgrenska University Hospital, Sahlgrenska Academy, Gothenburg University, Gothenburg, Sweden
| | - Andrew D Pearle
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, USA
| | - Anil S Ranawat
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, USA
| | - Bryan T Kelly
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, USA
| | - Ayoosh Pareek
- Sports Medicine and Shoulder Service, Hospital for Special Surgery, New York, New York, USA
| |
Collapse
|
13
|
Martín-Noguerol T, López-Úbeda P, Pons-Escoda A, Luna A. Natural language processing deep learning models for the differential between high-grade gliomas and metastasis: what if the key is how we report them? Eur Radiol 2024; 34:2113-2120. [PMID: 37665389 DOI: 10.1007/s00330-023-10202-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/10/2023] [Accepted: 07/20/2023] [Indexed: 09/05/2023]
Abstract
OBJECTIVES The differential between high-grade glioma (HGG) and metastasis remains challenging in common radiological practice. We compare different natural language processing (NLP)-based deep learning models to assist radiologists based on data contained in radiology reports. METHODS This retrospective study included 185 MRI reports between 2010 and 2022 from two different institutions. A total of 117 reports were used for the training and 21 were reserved for the validation set, while the rest were used as a test set. A comparison of the performance of different deep learning models for HGG and metastasis classification has been carried out. Specifically, Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), a hybrid version of BiLSTM and CNN, and a radiology-specific Bidirectional Encoder Representations from Transformers (RadBERT) model were used. RESULTS For the classification of MRI reports, the CNN network provided the best results among all tested, showing a macro-avg precision of 87.32%, a sensitivity of 87.45%, and an F1 score of 87.23%. In addition, our NLP algorithm detected keywords such as tumor, temporal, and lobe to positively classify a radiological report as HGG or metastasis group. CONCLUSIONS A deep learning model based on CNN enables radiologists to discriminate between HGG and metastasis based on MRI reports with high-precision values. This approach should be considered an additional tool in diagnosing these central nervous system lesions. CLINICAL RELEVANCE STATEMENT The use of our NLP model enables radiologists to differentiate between patients with high-grade glioma and metastasis based on their MRI reports and can be used as an additional tool to the conventional image-based approach for this challenging task. KEY POINTS • Differential between high-grade glioma and metastasis is still challenging in common radiological practice. • Natural language processing (NLP)-based deep learning models can assist radiologists based on data contained in radiology reports. • We have developed and tested a natural language processing model for discriminating between high-grade glioma and metastasis based on MRI reports that show high precision for this task.
Collapse
Affiliation(s)
| | | | - Albert Pons-Escoda
- Radiology Department, Hospital Universitari de Bellvitge, Barcelona, Spain
| | - Antonio Luna
- Radiology Department, MRI Unit, HT Medica, Carmelo Torres 2, 23007, Jaén, Spain
| |
Collapse
|
14
|
Chien A, Tang H, Jagessar B, Chang KW, Peng N, Nael K, Salamon N. AI-Assisted Summarization of Radiologic Reports: Evaluating GPT3davinci, BARTcnn, LongT5booksum, LEDbooksum, LEDlegal, and LEDclinical. AJNR Am J Neuroradiol 2024; 45:244-248. [PMID: 38238092 DOI: 10.3174/ajnr.a8102] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Accepted: 11/09/2023] [Indexed: 02/09/2024]
Abstract
BACKGROUND AND PURPOSE The review of clinical reports is an essential part of monitoring disease progression. Synthesizing multiple imaging reports is also important for clinical decisions. It is critical to aggregate information quickly and accurately. Machine learning natural language processing (NLP) models hold promise to address an unmet need for report summarization. MATERIALS AND METHODS We evaluated NLP methods to summarize longitudinal aneurysm reports. A total of 137 clinical reports and 100 PubMed case reports were used in this study. Models were 1) compared against expert-generated summary using longitudinal imaging notes collected in our institute and 2) compared using publicly accessible PubMed case reports. Five AI models were used to summarize the clinical reports, and a sixth model, the online GPT3davinci NLP large language model (LLM), was added for the summarization of PubMed case reports. We assessed the summary quality through comparison with expert summaries using quantitative metrics and quality reviews by experts. RESULTS In clinical summarization, BARTcnn had the best performance (BERTscore = 0.8371), followed by LongT5Booksum and LEDlegal. In the analysis using PubMed case reports, GPT3davinci demonstrated the best performance, followed by models BARTcnn and then LEDbooksum (BERTscore = 0.894, 0.872, and 0.867, respectively). CONCLUSIONS AI NLP summarization models demonstrated great potential in summarizing longitudinal aneurysm reports, though none yet reached the level of quality for clinical usage. We found the online GPT LLM outperformed the others; however, the BARTcnn model is potentially more useful because it can be implemented on-site. Future work to improve summarization, address other types of neuroimaging reports, and develop structured reports may allow NLP models to ease clinical workflow.
Collapse
Affiliation(s)
- Aichi Chien
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Hubert Tang
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Bhavita Jagessar
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Kai-Wei Chang
- Department of Computer Science (K.C., N.P.), University of California, Los Angeles, Los Angeles, California
| | - Nanyun Peng
- Department of Computer Science (K.C., N.P.), University of California, Los Angeles, Los Angeles, California
| | - Kambiz Nael
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| | - Noriko Salamon
- From the Department of Radiological Science (A.C., H.T., B.J., K.N., N.S.), David Geffen School of Medicine at UCLA, Los Angeles, California
| |
Collapse
|
15
|
Chae A, Yao MS, Sagreiya H, Goldberg AD, Chatterjee N, MacLean MT, Duda J, Elahi A, Borthakur A, Ritchie MD, Rader D, Kahn CE, Witschey WR, Gee JC. Strategies for Implementing Machine Learning Algorithms in the Clinical Practice of Radiology. Radiology 2024; 310:e223170. [PMID: 38259208 PMCID: PMC10831483 DOI: 10.1148/radiol.223170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2022] [Revised: 08/24/2023] [Accepted: 08/29/2023] [Indexed: 01/24/2024]
Abstract
Despite recent advancements in machine learning (ML) applications in health care, there have been few benefits and improvements to clinical medicine in the hospital setting. To facilitate clinical adaptation of methods in ML, this review proposes a standardized framework for the step-by-step implementation of artificial intelligence into the clinical practice of radiology that focuses on three key components: problem identification, stakeholder alignment, and pipeline integration. A review of the recent literature and empirical evidence in radiologic imaging applications justifies this approach and offers a discussion on structuring implementation efforts to help other hospital practices leverage ML to improve patient care. Clinical trial registration no. 04242667 © RSNA, 2024 Supplemental material is available for this article.
Collapse
Affiliation(s)
| | | | - Hersh Sagreiya
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Ari D. Goldberg
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Neil Chatterjee
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Matthew T. MacLean
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Jeffrey Duda
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Ameena Elahi
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Arijitt Borthakur
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Marylyn D. Ritchie
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Daniel Rader
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | - Charles E. Kahn
- From the Departments of Bioengineering (M.S.Y.), Radiology (H.S.,
N.C., M.T.M., J.D., A.B., C.E.K., W.R.W., J.C.G.), Genetics (M.D.R.), and
Medicine (D.R.), Perelman School of Medicine (A.C., M.S.Y., H.S., A.B., C.E.K.,
W.R.W., J.C.G.), University of Pennsylvania, 3400 Civic Center Blvd,
Philadelphia, PA 19104; Department of Radiology, Loyola University Medical
Center, Maywood, Ill (A.D.G.); Department of Information Services, University of
Pennsylvania, Philadelphia, Pa (A.E.); and Leonard Davis Institute of Health
Economics, University of Pennsylvania, Philadelphia, Pa (A.B.)
| | | | | |
Collapse
|
16
|
Bhayana R. Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology 2024; 310:e232756. [PMID: 38226883 DOI: 10.1148/radiol.232756] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2024]
Abstract
Although chatbots have existed for decades, the emergence of transformer-based large language models (LLMs) has captivated the world through the most recent wave of artificial intelligence chatbots, including ChatGPT. Transformers are a type of neural network architecture that enables better contextual understanding of language and efficient training on massive amounts of unlabeled data, such as unstructured text from the internet. As LLMs have increased in size, their improved performance and emergent abilities have revolutionized natural language processing. Since language is integral to human thought, applications based on LLMs have transformative potential in many industries. In fact, LLM-based chatbots have demonstrated human-level performance on many professional benchmarks, including in radiology. LLMs offer numerous clinical and research applications in radiology, several of which have been explored in the literature with encouraging results. Multimodal LLMs can simultaneously interpret text and images to generate reports, closely mimicking current diagnostic pathways in radiology. Thus, from requisition to report, LLMs have the opportunity to positively impact nearly every step of the radiology journey. Yet, these impressive models are not without limitations. This article reviews the limitations of LLMs and mitigation strategies, as well as potential uses of LLMs, including multimodal models. Also reviewed are existing LLM-based applications that can enhance efficiency in supervised settings.
Collapse
Affiliation(s)
- Rajesh Bhayana
- From University Medical Imaging Toronto, Joint Department of Medical Imaging, University Health Network, Mount Sinai Hospital, and Women's College Hospital, University of Toronto, Toronto General Hospital, 200 Elizabeth St, Peter Munk Bldg, 1st Fl, Toronto, ON, Canada M5G 24C
| |
Collapse
|
17
|
Bell LC, Shimron E. Sharing Data Is Essential for the Future of AI in Medical Imaging. Radiol Artif Intell 2024; 6:e230337. [PMID: 38231036 PMCID: PMC10831510 DOI: 10.1148/ryai.230337] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2023] [Revised: 11/16/2023] [Accepted: 11/20/2023] [Indexed: 01/18/2024]
Abstract
If we want artificial intelligence to succeed in radiology, we must share data and learn how to share data.
Collapse
Affiliation(s)
- Laura C. Bell
- From the Clinical Imaging Group, Genentech, 1 DNA Way, South San
Francisco, CA 94080 (L.C.B.); and Department of Electrical and Computer
Engineering and Department of Biomedical Engineering, Technion-Israel Institute
of Technology, Haifa, Israel (E.S.)
| | - Efrat Shimron
- From the Clinical Imaging Group, Genentech, 1 DNA Way, South San
Francisco, CA 94080 (L.C.B.); and Department of Electrical and Computer
Engineering and Department of Biomedical Engineering, Technion-Israel Institute
of Technology, Haifa, Israel (E.S.)
| |
Collapse
|
18
|
dos Santos DP, Kotter E, Mildenberger P, Martí-Bonmatí L. ESR paper on structured reporting in radiology-update 2023. Insights Imaging 2023; 14:199. [PMID: 37995019 PMCID: PMC10667169 DOI: 10.1186/s13244-023-01560-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 10/03/2023] [Indexed: 11/24/2023] Open
Abstract
Structured reporting in radiology continues to hold substantial potential to improve the quality of service provided to patients and referring physicians. Despite many physicians' preference for structured reports and various efforts by radiological societies and some vendors, structured reporting has still not been widely adopted in clinical routine.While in many countries national radiological societies have launched initiatives to further promote structured reporting, cross-institutional applications of report templates and incentives for usage of structured reporting are lacking. Various legislative measures have been taken in the USA and the European Union to promote interoperable data formats such as Fast Healthcare Interoperability Resources (FHIR) in the context of the EU Health Data Space (EHDS) which will certainly be relevant for the future of structured reporting. Lastly, recent advances in artificial intelligence and large language models may provide innovative and efficient approaches to integrate structured reporting more seamlessly into the radiologists' workflow.The ESR will remain committed to advancing structured reporting as a key component towards more value-based radiology. Practical solutions for structured reporting need to be provided by vendors. Policy makers should incentivize the usage of structured radiological reporting, especially in cross-institutional setting.Critical relevance statement Over the past years, the benefits of structured reporting in radiology have been widely discussed and agreed upon; however, implementation in clinical routine is lacking due-policy makers should incentivize the usage of structured radiological reporting, especially in cross-institutional setting.Key points1. Various national societies have established initiatives for structured reporting in radiology.2. Almost no monetary or structural incentives exist that favor structured reporting.3. A consensus on technical standards for structured reporting is still missing.4. The application of large language models may help structuring radiological reports.5. Policy makers should incentivize the usage of structured radiological reporting.
Collapse
|
19
|
Lu X, Chang EY, Du J, Yan A, McAuley J, Gentili A, Hsu CN. Robust Multi-View Fracture Detection in the Presence of Other Abnormalities Using HAMIL-Net. Mil Med 2023; 188:590-597. [PMID: 37948284 DOI: 10.1093/milmed/usad252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 03/31/2023] [Accepted: 06/26/2023] [Indexed: 11/12/2023] Open
Abstract
INTRODUCTION Foot and ankle fractures are the most common military health problem. Automated diagnosis can save time and personnel. It is crucial to distinguish fractures not only from normal healthy cases, but also robust against the presence of other orthopedic pathologies. Artificial intelligence (AI) deep learning has been shown to be promising. Previously, we have developed HAMIL-Net to automatically detect orthopedic injuries for upper extremity injuries. In this research, we investigated the performance of HAMIL-Net for detecting foot and ankle fractures in the presence of other abnormalities. MATERIALS AND METHODS HAMIL-Net is a novel deep neural network consisting of a hierarchical attention layer followed by a multiple-instance learning layer. The design allowed it to deal with imaging studies with multiple views. We used 148K musculoskeletal imaging studies for 51K Veterans at VA San Diego in the past 20 years to create datasets for this research. We annotated each study by a semi-automated pipeline leveraging radiology reports written by board-certified radiologists and extracting findings with a natural language processing tool and manually validated the annotations. RESULTS HAMIL-Net can be trained with study-level, multiple-view examples, and detect foot and ankle fractures with a 0.87 area under the receiver operational curve, but the performance dropped when tested by cases including other abnormalities. By integrating a fracture specialized model with one that detecting a broad range of abnormalities, HAMIL-Net's accuracy of detecting any abnormality improved from 0.53 to 0.77 and F-score from 0.46 to 0.86. We also reported HAMIL-Net's performance under different study types including for young (age 18-35) patients. CONCLUSIONS Automated fracture detection is promising but to be deployed in clinical use, presence of other abnormalities must be considered to deliver its full benefit. Our results with HAMIL-Net showed that considering other abnormalities improved fracture detection and allowed for incidental findings of other musculoskeletal abnormalities pertinent or superimposed on fractures.
Collapse
Affiliation(s)
- Xing Lu
- University of California, San Diego, La Jolla, CA 92093, USA
| | - Eric Y Chang
- University of California, San Diego, La Jolla, CA 92093, USA
- VA San Diego Healthcare System, San Diego, CA 92161, USA
| | - Jiang Du
- University of California, San Diego, La Jolla, CA 92093, USA
| | - An Yan
- University of California, San Diego, La Jolla, CA 92093, USA
| | - Julian McAuley
- University of California, San Diego, La Jolla, CA 92093, USA
| | - Amilcare Gentili
- University of California, San Diego, La Jolla, CA 92093, USA
- VA San Diego Healthcare System, San Diego, CA 92161, USA
| | - Chun-Nan Hsu
- University of California, San Diego, La Jolla, CA 92093, USA
- VA San Diego Healthcare System, San Diego, CA 92161, USA
- VA National Artificial Intelligence Institute, Washington, DC 20422, USA
| |
Collapse
|
20
|
Kim M, Ong KTI, Choi S, Yeo J, Kim S, Han K, Park JE, Kim HS, Choi YS, Ahn SS, Kim J, Lee SK, Sohn B. Natural language processing to predict isocitrate dehydrogenase genotype in diffuse glioma using MR radiology reports. Eur Radiol 2023; 33:8017-8025. [PMID: 37566271 DOI: 10.1007/s00330-023-10061-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 05/18/2023] [Accepted: 06/22/2023] [Indexed: 08/12/2023]
Abstract
OBJECTIVES To evaluate the performance of natural language processing (NLP) models to predict isocitrate dehydrogenase (IDH) mutation status in diffuse glioma using routine MR radiology reports. MATERIALS AND METHODS This retrospective, multi-center study included consecutive patients with diffuse glioma with known IDH mutation status from May 2009 to November 2021 whose initial MR radiology report was available prior to pathologic diagnosis. Five NLP models (long short-term memory [LSTM], bidirectional LSTM, bidirectional encoder representations from transformers [BERT], BERT graph convolutional network [GCN], BioBERT) were trained, and area under the receiver operating characteristic curve (AUC) was assessed to validate prediction of IDH mutation status in the internal and external validation sets. The performance of the best performing NLP model was compared with that of the human readers. RESULTS A total of 1427 patients (mean age ± standard deviation, 54 ± 15; 779 men, 54.6%) with 720 patients in the training set, 180 patients in the internal validation set, and 527 patients in the external validation set were included. In the external validation set, BERT GCN showed the highest performance (AUC 0.85, 95% CI 0.81-0.89) in predicting IDH mutation status, which was higher than LSTM (AUC 0.77, 95% CI 0.72-0.81; p = .003) and BioBERT (AUC 0.81, 95% CI 0.76-0.85; p = .03). This was higher than that of a neuroradiologist (AUC 0.80, 95% CI 0.76-0.84; p = .005) and a neurosurgeon (AUC 0.79, 95% CI 0.76-0.84; p = .04). CONCLUSION BERT GCN was externally validated to predict IDH mutation status in patients with diffuse glioma using routine MR radiology reports with superior or at least comparable performance to human reader. CLINICAL RELEVANCE STATEMENT Natural language processing may be used to extract relevant information from routine radiology reports to predict cancer genotype and provide prognostic information that may aid in guiding treatment strategy and enabling personalized medicine. KEY POINTS • A transformer-based natural language processing (NLP) model predicted isocitrate dehydrogenase mutation status in diffuse glioma with an AUC of 0.85 in the external validation set. • The best NLP models were superior or at least comparable to human readers in both internal and external validation sets. • Transformer-based models showed higher performance than conventional NLP model such as long short-term memory.
Collapse
Affiliation(s)
- Minjae Kim
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea
| | - Kai Tzu-Iunn Ong
- Department of Artificial Intelligence, College of Computing, Yonsei University, Seoul, Korea
| | - Seonah Choi
- Department of Neurosurgery, Brain Tumor Center, Severance Hospital, Yonsei University College of Medicine, Seoul, Korea
| | - Jinyoung Yeo
- Department of Artificial Intelligence, College of Computing, Yonsei University, Seoul, Korea
| | - Sooyon Kim
- Department of Statistics and Data Science, Yonsei University, Seoul, Korea
| | - Kyunghwa Han
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Ji Eun Park
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea
| | - Ho Sung Kim
- Department of Radiology and Research Institute of Radiology, University of Ulsan College of Medicine, Asan Medical Center, Seoul, Korea
| | - Yoon Seong Choi
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Sung Soo Ahn
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Jinna Kim
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Seung-Koo Lee
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea
| | - Beomseok Sohn
- Department of Radiology and Research Institute of Radiological Science and Center for Clinical Imaging Data Science, Yonsei University College of Medicine, Seoul, Korea.
- Department of Radiology and Center for Imaging Sciences, Samsung Medical Center, Sungkyunkwan University School of Medicine, Seoul, Korea.
| |
Collapse
|
21
|
Tejani AS. To BERT or not to BERT: advancing non-invasive prediction of tumor biomarkers using transformer-based natural language processing (NLP). Eur Radiol 2023; 33:8014-8016. [PMID: 37740083 DOI: 10.1007/s00330-023-10224-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 08/27/2023] [Accepted: 08/29/2023] [Indexed: 09/24/2023]
Affiliation(s)
- Ali S Tejani
- Department of Radiology, The University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX, 75390, USA.
| |
Collapse
|
22
|
Tan RSYC, Lin Q, Low GH, Lin R, Goh TC, Chang CCE, Lee FF, Chan WY, Tan WC, Tey HJ, Leong FL, Tan HQ, Nei WL, Chay WY, Tai DWM, Lai GGY, Cheng LTE, Wong FY, Chua MCH, Chua MLK, Tan DSW, Thng CH, Tan IBH, Ng HT. Inferring cancer disease response from radiology reports using large language models with data augmentation and prompting. J Am Med Inform Assoc 2023; 30:1657-1664. [PMID: 37451682 PMCID: PMC10531105 DOI: 10.1093/jamia/ocad133] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2023] [Revised: 06/27/2023] [Accepted: 07/04/2023] [Indexed: 07/18/2023] Open
Abstract
OBJECTIVE To assess large language models on their ability to accurately infer cancer disease response from free-text radiology reports. MATERIALS AND METHODS We assembled 10 602 computed tomography reports from cancer patients seen at a single institution. All reports were classified into: no evidence of disease, partial response, stable disease, or progressive disease. We applied transformer models, a bidirectional long short-term memory model, a convolutional neural network model, and conventional machine learning methods to this task. Data augmentation using sentence permutation with consistency loss as well as prompt-based fine-tuning were used on the best-performing models. Models were validated on a hold-out test set and an external validation set based on Response Evaluation Criteria in Solid Tumors (RECIST) classifications. RESULTS The best-performing model was the GatorTron transformer which achieved an accuracy of 0.8916 on the test set and 0.8919 on the RECIST validation set. Data augmentation further improved the accuracy to 0.8976. Prompt-based fine-tuning did not further improve accuracy but was able to reduce the number of training reports to 500 while still achieving good performance. DISCUSSION These models could be used by researchers to derive progression-free survival in large datasets. It may also serve as a decision support tool by providing clinicians an automated second opinion of disease response. CONCLUSIONS Large clinical language models demonstrate potential to infer cancer disease response from radiology reports at scale. Data augmentation techniques are useful to further improve performance. Prompt-based fine-tuning can significantly reduce the size of the training dataset.
Collapse
Affiliation(s)
- Ryan Shea Ying Cong Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
- Duke-NUS Medical School, Singapore
| | - Qian Lin
- Department of Computer Science, National University of Singapore, Singapore
| | - Guat Hwa Low
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
| | - Ruixi Lin
- Department of Computer Science, National University of Singapore, Singapore
| | - Tzer Chew Goh
- Institute of Systems Science, National University of Singapore, Singapore
| | | | - Fung Fung Lee
- Institute of Systems Science, National University of Singapore, Singapore
| | - Wei Yin Chan
- Institute of Systems Science, National University of Singapore, Singapore
| | - Wei Chong Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
- Duke-NUS Medical School, Singapore
| | - Han Jieh Tey
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
| | - Fun Loon Leong
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
| | - Hong Qi Tan
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore
| | - Wen Long Nei
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore
| | - Wen Yee Chay
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
- Duke-NUS Medical School, Singapore
| | - David Wai Meng Tai
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
- Duke-NUS Medical School, Singapore
| | - Gillianne Geet Yi Lai
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
- Duke-NUS Medical School, Singapore
| | - Lionel Tim-Ee Cheng
- Duke-NUS Medical School, Singapore
- Department of Diagnostic Radiology, Singapore General Hospital, Singapore
| | - Fuh Yong Wong
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore
| | | | - Melvin Lee Kiang Chua
- Duke-NUS Medical School, Singapore
- Division of Radiation Oncology, National Cancer Centre Singapore, Singapore
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore
| | - Daniel Shao Weng Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
- Division of Clinical Trials and Epidemiological Sciences, National Cancer Centre Singapore, Singapore
| | - Choon Hua Thng
- Duke-NUS Medical School, Singapore
- Division of Oncologic Imaging, National Cancer Centre Singapore, Singapore
| | - Iain Bee Huat Tan
- Division of Medical Oncology, National Cancer Centre Singapore, Singapore
- Duke-NUS Medical School, Singapore
- Data and Computational Science Core, National Cancer Centre Singapore, Singapore
| | - Hwee Tou Ng
- Department of Computer Science, National University of Singapore, Singapore
| |
Collapse
|
23
|
Barrington NM, Gupta N, Musmar B, Doyle D, Panico N, Godbole N, Reardon T, D’Amico RS. A Bibliometric Analysis of the Rise of ChatGPT in Medical Research. Med Sci (Basel) 2023; 11:61. [PMID: 37755165 PMCID: PMC10535733 DOI: 10.3390/medsci11030061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 09/04/2023] [Accepted: 09/11/2023] [Indexed: 09/28/2023] Open
Abstract
The rapid emergence of publicly accessible artificial intelligence platforms such as large language models (LLMs) has led to an equally rapid increase in articles exploring their potential benefits and risks. We performed a bibliometric analysis of ChatGPT literature in medicine and science to better understand publication trends and knowledge gaps. Following title, abstract, and keyword searches of PubMed, Embase, Scopus, and Web of Science databases for ChatGPT articles published in the medical field, articles were screened for inclusion and exclusion criteria. Data were extracted from included articles, with citation counts obtained from PubMed and journal metrics obtained from Clarivate Journal Citation Reports. After screening, 267 articles were included in the study, most of which were editorials or correspondence with an average of 7.5 +/- 18.4 citations per publication. Published articles on ChatGPT were authored largely in the United States, India, and China. The topics discussed included use and accuracy of ChatGPT in research, medical education, and patient counseling. Among non-surgical specialties, radiology published the most ChatGPT-related articles, while plastic surgery published the most articles among surgical specialties. The average citation number among the top 20 most-cited articles was 60.1 +/- 35.3. Among journals with the most ChatGPT-related publications, there were on average 10 +/- 3.7 publications. Our results suggest that managing the inevitable ethical and safety issues that arise with the implementation of LLMs will require further research exploring the capabilities and accuracy of ChatGPT, to generate policies guiding the adoption of artificial intelligence in medicine and science.
Collapse
Affiliation(s)
- Nikki M. Barrington
- Chicago Medical School, Rosalind Franklin University, North Chicago, IL 60064, USA
| | - Nithin Gupta
- School of Osteopathic Medicine, Campbell University, Lillington, NC 27546, USA
| | - Basel Musmar
- Faculty of Medicine and Health Sciences, An-Najah National University, Nablus P.O. Box 7, West Bank, Palestine
| | - David Doyle
- Central Michigan College of Medicine, Mount Pleasant, MI 48858, USA
| | - Nicholas Panico
- Lake Erie College of Osteopathic Medicine, Erie, PA 16509, USA
| | - Nikhil Godbole
- School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Taylor Reardon
- Department of Neurology, Henry Ford Hospital, Detroit, MI 48202, USA
| | - Randy S. D’Amico
- Department of Neurosurgery, Lenox Hill Hospital, New York, NY 10075, USA
| |
Collapse
|
24
|
Bernstein IA, Zhang Y(V, Govil D, Majid I, Chang RT, Sun Y, Shue A, Chou JC, Schehlein E, Christopher KL, Groth SL, Ludwig C, Wang SY. Comparison of Ophthalmologist and Large Language Model Chatbot Responses to Online Patient Eye Care Questions. JAMA Netw Open 2023; 6:e2330320. [PMID: 37606922 PMCID: PMC10445188 DOI: 10.1001/jamanetworkopen.2023.30320] [Citation(s) in RCA: 38] [Impact Index Per Article: 38.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/22/2023] [Accepted: 07/13/2023] [Indexed: 08/23/2023] Open
Abstract
Importance Large language models (LLMs) like ChatGPT appear capable of performing a variety of tasks, including answering patient eye care questions, but have not yet been evaluated in direct comparison with ophthalmologists. It remains unclear whether LLM-generated advice is accurate, appropriate, and safe for eye patients. Objective To evaluate the quality of ophthalmology advice generated by an LLM chatbot in comparison with ophthalmologist-written advice. Design, Setting, and Participants This cross-sectional study used deidentified data from an online medical forum, in which patient questions received responses written by American Academy of Ophthalmology (AAO)-affiliated ophthalmologists. A masked panel of 8 board-certified ophthalmologists were asked to distinguish between answers generated by the ChatGPT chatbot and human answers. Posts were dated between 2007 and 2016; data were accessed January 2023 and analysis was performed between March and May 2023. Main Outcomes and Measures Identification of chatbot and human answers on a 4-point scale (likely or definitely artificial intelligence [AI] vs likely or definitely human) and evaluation of responses for presence of incorrect information, alignment with perceived consensus in the medical community, likelihood to cause harm, and extent of harm. Results A total of 200 pairs of user questions and answers by AAO-affiliated ophthalmologists were evaluated. The mean (SD) accuracy for distinguishing between AI and human responses was 61.3% (9.7%). Of 800 evaluations of chatbot-written answers, 168 answers (21.0%) were marked as human-written, while 517 of 800 human-written answers (64.6%) were marked as AI-written. Compared with human answers, chatbot answers were more frequently rated as probably or definitely written by AI (prevalence ratio [PR], 1.72; 95% CI, 1.52-1.93). The likelihood of chatbot answers containing incorrect or inappropriate material was comparable with human answers (PR, 0.92; 95% CI, 0.77-1.10), and did not differ from human answers in terms of likelihood of harm (PR, 0.84; 95% CI, 0.67-1.07) nor extent of harm (PR, 0.99; 95% CI, 0.80-1.22). Conclusions and Relevance In this cross-sectional study of human-written and AI-generated responses to 200 eye care questions from an online advice forum, a chatbot appeared capable of responding to long user-written eye health posts and largely generated appropriate responses that did not differ significantly from ophthalmologist-written responses in terms of incorrect information, likelihood of harm, extent of harm, or deviation from ophthalmologist community standards. Additional research is needed to assess patient attitudes toward LLM-augmented ophthalmologists vs fully autonomous AI content generation, to evaluate clarity and acceptability of LLM-generated answers from the patient perspective, to test the performance of LLMs in a greater variety of clinical contexts, and to determine an optimal manner of utilizing LLMs that is ethical and minimizes harm.
Collapse
Affiliation(s)
- Isaac A. Bernstein
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Youchen (Victor) Zhang
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Devendra Govil
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Iyad Majid
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Robert T. Chang
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Yang Sun
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Ann Shue
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Jonathan C. Chou
- Department of Ophthalmology, Kaiser Permanente San Francisco, San Francisco, California
| | | | | | - Sylvia L. Groth
- Department of Ophthalmology and Visual Sciences, Vanderbilt Eye Institute, Nashville, Tennessee
| | - Cassie Ludwig
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| | - Sophia Y. Wang
- Department of Ophthalmology, Byers Eye Institute, Stanford University, Stanford, California
| |
Collapse
|
25
|
Oh JH, Tannenbaum A, Deasy JO. Improved prediction of drug-induced liver injury literature using natural language processing and machine learning methods. Front Genet 2023; 14:1161047. [PMID: 37529777 PMCID: PMC10390074 DOI: 10.3389/fgene.2023.1161047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2023] [Accepted: 06/29/2023] [Indexed: 08/03/2023] Open
Abstract
Drug-induced liver injury (DILI) is an adverse hepatic drug reaction that can potentially lead to life-threatening liver failure. Previously published work in the scientific literature on DILI has provided valuable insights for the understanding of hepatotoxicity as well as drug development. However, the manual search of scientific literature in PubMed is laborious and time-consuming. Natural language processing (NLP) techniques along with artificial intelligence/machine learning approaches may allow for automatic processing in identifying DILI-related literature, but useful methods are yet to be demonstrated. To address this issue, we have developed an integrated NLP/machine learning classification model to identify DILI-related literature using only paper titles and abstracts. For prediction modeling, we used 14,203 publications provided by the Critical Assessment of Massive Data Analysis (CAMDA) challenge, employing word vectorization techniques in NLP in conjunction with machine learning methods. Classification modeling was performed using 2/3 of the data for training and the remainder for test in internal validation. The best performance was achieved using a linear support vector machine (SVM) model on the combined vectors derived from term frequency-inverse document frequency (TF-IDF) and Word2Vec, resulting in an accuracy of 95.0% and an F1-score of 95.0%. The final SVM model constructed from all 14,203 publications was tested on independent datasets, resulting in accuracies of 92.5%, 96.3%, and 98.3%, and F1-scores of 93.5%, 86.1%, and 75.6% for three test sets (T1-T3). Furthermore, the SVM model was tested on four external validation sets (V1-V4), resulting in accuracies of 92.0%, 96.2%, 98.3%, and 93.1%, and F1-scores of 92.4%, 82.9%, 75.0%, and 93.3%.
Collapse
Affiliation(s)
- Jung Hun Oh
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Allen Tannenbaum
- Department of Computer Science, Stony Brook University, Stony Brook, NY, United States
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY, United States
| | - Joseph O. Deasy
- Department of Medical Physics, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| |
Collapse
|
26
|
Karabacak M, Margetis K. Embracing Large Language Models for Medical Applications: Opportunities and Challenges. Cureus 2023; 15:e39305. [PMID: 37378099 PMCID: PMC10292051 DOI: 10.7759/cureus.39305] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/21/2023] [Indexed: 06/29/2023] Open
Abstract
Large language models (LLMs) have the potential to revolutionize the field of medicine by, among other applications, improving diagnostic accuracy and supporting clinical decision-making. However, the successful integration of LLMs in medicine requires addressing challenges and considerations specific to the medical domain. This viewpoint article provides a comprehensive overview of key aspects for the successful implementation of LLMs in medicine, including transfer learning, domain-specific fine-tuning, domain adaptation, reinforcement learning with expert input, dynamic training, interdisciplinary collaboration, education and training, evaluation metrics, clinical validation, ethical considerations, data privacy, and regulatory frameworks. By adopting a multifaceted approach and fostering interdisciplinary collaboration, LLMs can be developed, validated, and integrated into medical practice responsibly, effectively, and ethically, addressing the needs of various medical disciplines and diverse patient populations. Ultimately, this approach will ensure that LLMs enhance patient care and improve overall health outcomes for all.
Collapse
Affiliation(s)
- Mert Karabacak
- Neurological Surgery, Mount Sinai Health System, New York, USA
| | | |
Collapse
|
27
|
Chng SY, Tern PJW, Kan MRX, Cheng LTE. Automated labelling of radiology reports using natural language processing: Comparison of traditional and newer methods. HEALTH CARE SCIENCE 2023; 2:120-128. [PMID: 38938764 PMCID: PMC11080679 DOI: 10.1002/hcs2.40] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Revised: 01/31/2023] [Accepted: 02/23/2023] [Indexed: 06/29/2024]
Abstract
Automated labelling of radiology reports using natural language processing allows for the labelling of ground truth for large datasets of radiological studies that are required for training of computer vision models. This paper explains the necessary data preprocessing steps, reviews the main methods for automated labelling and compares their performance. There are four main methods of automated labelling, namely: (1) rules-based text-matching algorithms, (2) conventional machine learning models, (3) neural network models and (4) Bidirectional Encoder Representations from Transformers (BERT) models. Rules-based labellers perform a brute force search against manually curated keywords and are able to achieve high F1 scores. However, they require proper handling of negative words. Machine learning models require preprocessing that involves tokenization and vectorization of text into numerical vectors. Multilabel classification approaches are required in labelling radiology reports and conventional models can achieve good performance if they have large enough training sets. Deep learning models make use of connected neural networks, often a long short-term memory network, and are similarly able to achieve good performance if trained on a large data set. BERT is a transformer-based model that utilizes attention. Pretrained BERT models only require fine-tuning with small data sets. In particular, domain-specific BERT models can achieve superior performance compared with the other methods for automated labelling.
Collapse
Affiliation(s)
- Seo Yi Chng
- Department of PaediatricsNational University of SingaporeSingaporeSingapore
| | - Paul J. W. Tern
- Department of CardiologyNational Heart CentreSingaporeSingapore
| | | | - Lionel T. E. Cheng
- Department of Diagnostic RadiologySingapore General HospitalSingaporeSingapore
| |
Collapse
|
28
|
Improved Fine-Tuning of In-Domain Transformer Model for Inferring COVID-19 Presence in Multi-Institutional Radiology Reports. J Digit Imaging 2023; 36:164-177. [PMID: 36323915 PMCID: PMC9629758 DOI: 10.1007/s10278-022-00714-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2022] [Revised: 09/05/2022] [Accepted: 10/03/2022] [Indexed: 11/06/2022] Open
Abstract
Building a document-level classifier for COVID-19 on radiology reports could help assist providers in their daily clinical routine, as well as create large numbers of labels for computer vision models. We have developed such a classifier by fine-tuning a BERT-like model initialized from RadBERT, its continuous pre-training on radiology reports that can be used on all radiology-related tasks. RadBERT outperforms all biomedical pre-trainings on this COVID-19 task (P<0.01) and helps our fine-tuned model achieve an 88.9 macro-averaged F1-score, when evaluated on both X-ray and CT reports. To build this model, we rely on a multi-institutional dataset re-sampled and enriched with concurrent lung diseases, helping the model to resist to distribution shifts. In addition, we explore a variety of fine-tuning and hyperparameter optimization techniques that accelerate fine-tuning convergence, stabilize performance, and improve accuracy, especially when data or computational resources are limited. Finally, we provide a set of visualization tools and explainability methods to better understand the performance of the model, and support its practical use in the clinical setting. Our approach offers a ready-to-use COVID-19 classifier and can be applied similarly to other radiology report classification tasks.
Collapse
|
29
|
Wiggins WF, Tejani AS. On the Opportunities and Risks of Foundation Models for Natural Language Processing in Radiology. Radiol Artif Intell 2022; 4:e220119. [PMID: 35923379 PMCID: PMC9344208 DOI: 10.1148/ryai.220119] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2022] [Revised: 06/23/2022] [Accepted: 06/27/2022] [Indexed: 06/15/2023]
Affiliation(s)
- Walter F. Wiggins
- From the Department of Radiology, Duke University Health System, 2301 Erwin Rd, Durham, NC 27710 (W.F.W.); Duke Center for Artificial Intelligence in Radiology, Duke University School of Medicine, Durham, NC (W.F.W.); and Department of Radiology, University of Texas Southwestern Medical Center, Dallas, Tex (A.S.T.)
| | - Ali S. Tejani
- From the Department of Radiology, Duke University Health System, 2301 Erwin Rd, Durham, NC 27710 (W.F.W.); Duke Center for Artificial Intelligence in Radiology, Duke University School of Medicine, Durham, NC (W.F.W.); and Department of Radiology, University of Texas Southwestern Medical Center, Dallas, Tex (A.S.T.)
| |
Collapse
|