1
|
Affiliation(s)
- Maxime Barat
- Department of Radiology, Hopital Cochin, AP-HP, Université Paris Cité, Paris, France
| | - Philippe Soyer
- Department of Radiology, Hopital Cochin, AP-HP, Université Paris Cité, Paris, France
| | - Anthony Dohan
- Department of Radiology, Hopital Cochin, AP-HP, Université Paris Cité, Paris, France
| |
Collapse
|
2
|
Flores-Cohaila JA, García-Vicente A, Vizcarra-Jiménez SF, De la Cruz-Galán JP, Gutiérrez-Arratia JD, Quiroga Torres BG, Taype-Rondan A. Performance of ChatGPT on the Peruvian National Licensing Medical Examination: Cross-Sectional Study. JMIR Med Educ 2023; 9:e48039. [PMID: 37768724 PMCID: PMC10570896 DOI: 10.2196/48039] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2023] [Revised: 06/16/2023] [Accepted: 09/05/2023] [Indexed: 09/29/2023]
Abstract
BACKGROUND ChatGPT has shown impressive performance in national medical licensing examinations, such as the United States Medical Licensing Examination (USMLE), even passing it with expert-level performance. However, there is a lack of research on its performance in low-income countries' national licensing medical examinations. In Peru, where almost one out of three examinees fails the national licensing medical examination, ChatGPT has the potential to enhance medical education. OBJECTIVE We aimed to assess the accuracy of ChatGPT using GPT-3.5 and GPT-4 on the Peruvian National Licensing Medical Examination (Examen Nacional de Medicina [ENAM]). Additionally, we sought to identify factors associated with incorrect answers provided by ChatGPT. METHODS We used the ENAM 2022 data set, which consisted of 180 multiple-choice questions, to evaluate the performance of ChatGPT. Various prompts were used, and accuracy was evaluated. The performance of ChatGPT was compared to that of a sample of 1025 examinees. Factors such as question type, Peruvian-specific knowledge, discrimination, difficulty, quality of questions, and subject were analyzed to determine their influence on incorrect answers. Questions that received incorrect answers underwent a three-step process involving different prompts to explore the potential impact of adding roles and context on ChatGPT's accuracy. RESULTS GPT-4 achieved an accuracy of 86% on the ENAM, followed by GPT-3.5 with 77%. The accuracy obtained by the 1025 examinees was 55%. There was a fair agreement (κ=0.38) between GPT-3.5 and GPT-4. Moderate-to-high-difficulty questions were associated with incorrect answers in the crude and adjusted model for GPT-3.5 (odds ratio [OR] 6.6, 95% CI 2.73-15.95) and GPT-4 (OR 33.23, 95% CI 4.3-257.12). After reinputting questions that received incorrect answers, GPT-3.5 went from 41 (100%) to 12 (29%) incorrect answers, and GPT-4 from 25 (100%) to 4 (16%). CONCLUSIONS Our study found that ChatGPT (GPT-3.5 and GPT-4) can achieve expert-level performance on the ENAM, outperforming most of our examinees. We found fair agreement between both GPT-3.5 and GPT-4. Incorrect answers were associated with the difficulty of questions, which may resemble human performance. Furthermore, by reinputting questions that initially received incorrect answers with different prompts containing additional roles and context, ChatGPT achieved improved accuracy.
Collapse
Affiliation(s)
- Javier A Flores-Cohaila
- Academic Department, USAMEDIC, Lima, Peru
- Facultad de Ciencias de la Salud, Carrera de Medicina, Universidad Científica del Sur, Lima, Peru
| | - Abigaíl García-Vicente
- School of Medicine, Universidad Nacional de Piura, Piura, Peru
- Comité Permanente Académico, Sociedad Científica Médico Estudiantil Peruana, Lima, Peru
| | - Sonia F Vizcarra-Jiménez
- Comité Permanente Académico, Sociedad Científica Médico Estudiantil Peruana, Lima, Peru
- Centro de Investigación de Estudiantes de Medicina, Tacna, Peru
| | - Janith P De la Cruz-Galán
- Comité Permanente Académico, Sociedad Científica Médico Estudiantil Peruana, Lima, Peru
- School of Medicine, Universidad de San Martin de Porres - Filial Norte, Chiclayo, Peru
| | | | | | - Alvaro Taype-Rondan
- Unidad de Investigación Para la Generación y Síntesis de Evidencias en Salud, Vicerrectorado de Investigación, Universidad San Ignacio de Loyola, Lima, Peru
- EviSalud - Evidencias en Salud, Lima, Peru
| |
Collapse
|
3
|
Welsby P, Cheung BMY. ChatGPT. Postgrad Med J 2023; 99:1047-1048. [PMID: 37462242 DOI: 10.1093/postmj/qgad056] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 06/22/2023] [Indexed: 09/24/2023]
Abstract
An algorithm is a process or set of rules to be followed, especially by a computer.
Collapse
|
4
|
Waisberg E, Ong J, Masalkhi M, Zaman N, Kamran SA, Sarker P, Lee AG, Tavakkoli A. Generative Pre-Trained Transformers (GPT) and Space Health: A Potential Frontier in Astronaut Health During Exploration Missions. Prehosp Disaster Med 2023; 38:532-536. [PMID: 37264946 PMCID: PMC10445113 DOI: 10.1017/s1049023x23005848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 04/29/2023] [Indexed: 06/03/2023]
Abstract
In anticipation of space exploration where astronauts are traveling away from Earth, and for longer durations with an increasing communication lag, artificial intelligence (AI) frameworks such as large language learning models (LLMs) that can be trained on Earth can provide real-time answers. This emerging technology may be helpful for acute medical emergencies, particularly in austere and distant space environments. In this manuscript, we provide an overview of generative pre-trained transformer (GPT) technology, a rapidly emerging AI technology, and implications, considerations, and limitations of such technology for space health.
Collapse
Affiliation(s)
- Ethan Waisberg
- University College Dublin School of Medicine, Belfield, Dublin, Ireland
| | - Joshua Ong
- Michigan Medicine, University of Michigan, Ann Arbor, MichiganUSA
| | - Mouayad Masalkhi
- University College Dublin School of Medicine, Belfield, Dublin, Ireland
| | - Nasif Zaman
- Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada - Reno, Reno, NevadaUSA
| | - Sharif Amit Kamran
- Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada - Reno, Reno, NevadaUSA
| | - Prithul Sarker
- Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada - Reno, Reno, NevadaUSA
| | - Andrew G. Lee
- Center for Space Medicine, Baylor College of Medicine, Houston, TexasUSA
- Department of Ophthalmology, Blanton Eye Institute, Houston Methodist Hospital, Houston, TexasUSA
- The Houston Methodist Research Institute, Houston Methodist Hospital, Houston, TexasUSA
- Departments of Ophthalmology, Neurology, and Neurosurgery, Weill Cornell Medicine, New York, New YorkUSA
- Department of Ophthalmology, University of Texas Medical Branch, Galveston, TexasUSA
- University of Texas MD Anderson Cancer Center, Houston, TexasUSA
- Texas A&M College of Medicine, Bryan, TexasUSA
- Department of Ophthalmology, The University of Iowa Hospitals and Clinics, Iowa City, IowaUSA
| | - Alireza Tavakkoli
- Human-Machine Perception Laboratory, Department of Computer Science and Engineering, University of Nevada - Reno, Reno, NevadaUSA
| |
Collapse
|
5
|
Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. Authors' Reply to: Variability in Large Language Models' Responses to Medical Licensing and Certification Examinations. JMIR Med Educ 2023; 9:e50336. [PMID: 37440299 PMCID: PMC10375396 DOI: 10.2196/50336] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 07/05/2023] [Indexed: 07/14/2023]
Affiliation(s)
- Aidan Gilson
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States
| | - Conrad W Safranek
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
| | - Thomas Huang
- Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States
| | - Vimig Socrates
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States
| | - Ling Chi
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
| | - Richard Andrew Taylor
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States
| | - David Chartash
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- School of Medicine, University College Dublin, National University of Ireland, Dublin, Dublin, Ireland
| |
Collapse
|
6
|
Epstein RH, Dexter F. Variability in Large Language Models' Responses to Medical Licensing and Certification Examinations. Comment on "How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment". JMIR Med Educ 2023; 9:e48305. [PMID: 37440293 PMCID: PMC10375390 DOI: 10.2196/48305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 06/16/2023] [Accepted: 06/22/2023] [Indexed: 07/14/2023]
Affiliation(s)
- Richard H Epstein
- Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, Miami, FL, United States
| | - Franklin Dexter
- Division of Management Consulting, Department of Anesthesia, University of Iowa, Iowa City, IA, United States
| |
Collapse
|
7
|
Wang Y, Zhao H, Sciabola S, Wang W. cMolGPT: A Conditional Generative Pre-Trained Transformer for Target-Specific De Novo Molecular Generation. Molecules 2023; 28:molecules28114430. [PMID: 37298906 DOI: 10.3390/molecules28114430] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 05/18/2023] [Accepted: 05/25/2023] [Indexed: 06/12/2023] Open
Abstract
Deep generative models applied to the generation of novel compounds in small-molecule drug design have attracted a lot of attention in recent years. To design compounds that interact with specific target proteins, we propose a Generative Pre-Trained Transformer (GPT)-inspired model for de novo target-specific molecular design. By implementing different keys and values for the multi-head attention conditional on a specified target, the proposed method can generate drug-like compounds both with and without a specific target. The results show that our approach (cMolGPT) is capable of generating SMILES strings that correspond to both drug-like and active compounds. Moreover, the compounds generated from the conditional model closely match the chemical space of real target-specific molecules and cover a significant portion of novel compounds. Thus, the proposed Conditional Generative Pre-Trained Transformer (cMolGPT) is a valuable tool for de novo molecule design and has the potential to accelerate the molecular optimization cycle time.
Collapse
Affiliation(s)
- Ye Wang
- Biotherapeutic and Medicinal Sciences, Biogen, 225 Binney Street, Cambridge, MA 02142, USA
| | - Honggang Zhao
- College of Agriculture and Life Sciences, Cornell University, Ithaca, NY 14850, USA
| | - Simone Sciabola
- Biotherapeutic and Medicinal Sciences, Biogen, 225 Binney Street, Cambridge, MA 02142, USA
| | - Wenlu Wang
- Computer Science, Texas A&M University-Corpus Christi, 6300 Ocean Dr, Corpus Christi, TX 78412, USA
| |
Collapse
|
8
|
Gilson A, Safranek CW, Huang T, Socrates V, Chi L, Taylor RA, Chartash D. How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment. JMIR Med Educ 2023; 9:e45312. [PMID: 36753318 PMCID: PMC9947764 DOI: 10.2196/45312] [Citation(s) in RCA: 352] [Impact Index Per Article: 352.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 01/27/2023] [Accepted: 01/29/2023] [Indexed: 05/05/2023]
Abstract
BACKGROUND Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. OBJECTIVE This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. METHODS We used 2 sets of multiple-choice questions to evaluate ChatGPT's performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT's performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. RESULTS Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT's answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. CONCLUSIONS ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT's capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning.
Collapse
Affiliation(s)
- Aidan Gilson
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States
| | - Conrad W Safranek
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
| | - Thomas Huang
- Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States
| | - Vimig Socrates
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- Program of Computational Biology and Bioinformatics, Yale University, New Haven, CT, United States
| | - Ling Chi
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
| | - Richard Andrew Taylor
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- Department of Emergency Medicine, Yale University School of Medicine, New Haven, CT, United States
| | - David Chartash
- Section for Biomedical Informatics and Data Science, Yale University School of Medicine, New Haven, CT, United States
- School of Medicine, University College Dublin, National University of Ireland, Dublin, Dublin, Ireland
| |
Collapse
|