1
|
Ozduran E, Akkoc I, Büyükçoban S, Erkin Y, Hanci V. Readability, reliability and quality of responses generated by ChatGPT, gemini, and perplexity for the most frequently asked questions about pain. Medicine (Baltimore) 2025; 104:e41780. [PMID: 40101096 PMCID: PMC11922396 DOI: 10.1097/md.0000000000041780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 03/20/2025] Open
Abstract
It is clear that artificial intelligence-based chatbots will be popular applications in the field of healthcare in the near future. It is known that more than 30% of the world's population suffers from chronic pain and individuals try to access the health information they need through online platforms before applying to the hospital. This study aimed to examine the readability, reliability and quality of the responses given by 3 different artificial intelligence chatbots (ChatGPT, Gemini and Perplexity) to frequently asked questions about pain. In this study, the 25 most frequently used keywords related to pain were determined using Google Trend and asked to every 3 artificial intelligence chatbots. The readability of the response texts was determined by Flesch Reading Ease Score (FRES), Simple Measure of Gobbledygook, Gunning Fog and Flesch-Kincaid Grade Level readability scoring. Reliability assessment was determined by the Journal of American Medical Association (JAMA), DISCERN scales. Global Quality Score and Ensuring Quality Information for Patients (EQIP) score were used in quality assessment. As a result of Google Trend search, the first 3 keywords were determined as "back pain," "stomach pain," and "chest pain." The readability of the answers given by all 3 artificial intelligence applications was determined to be higher than the recommended 6th grade readability level (P < .001). In the readability evaluation, the order from easy to difficult was determined as Google Gemini, ChatGPT and Perplexity. Higher GQS scores (P = .008) were detected in Gemini compared to other chatbots. Perplexity had higher JAMA, DISCERN and EQIP scores compared to other chatbots, respectively (P < .001, P < .001, P < .05). It has been determined that the answers given by ChatGPT, Gemini, and Perplexity to pain-related questions are difficult to read and their reliability and quality are low. It can be stated that these artificial intelligence chatbots cannot replace a comprehensive medical consultation. In artificial intelligence applications, it may be recommended to facilitate the readability of text content, create texts containing reliable references, and control them by a supervisory expert team.
Collapse
Affiliation(s)
- Erkan Ozduran
- Sivas Numune Hospital, Physical Medicine and Rehabilitation, Pain Medicine, Sivas, Turkey
| | - Ibrahim Akkoc
- University of Health Sciences, Basaksehir Cam and Sakura City Hospital, Anesthesiology and Reanimation, Istanbul, Turkey
| | | | - Yüksel Erkin
- Dokuz Eylul University, Anesthesiology and Reanimation, Pain Medicine, Izmir, Turkey
| | - Volkan Hanci
- Dokuz Eylul University, Anesthesiology and Reanimation, Critical Care Medicine, Izmir, Turkey
| |
Collapse
|
2
|
Ozduran E, Hancı V, Erkin Y, Özbek İC, Abdulkerimov V. Assessing the readability, quality and reliability of responses produced by ChatGPT, Gemini, and Perplexity regarding most frequently asked keywords about low back pain. PeerJ 2025; 13:e18847. [PMID: 39866564 PMCID: PMC11760201 DOI: 10.7717/peerj.18847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 12/19/2024] [Indexed: 01/28/2025] Open
Abstract
Background Patients who are informed about the causes, pathophysiology, treatment and prevention of a disease are better able to participate in treatment procedures in the event of illness. Artificial intelligence (AI), which has gained popularity in recent years, is defined as the study of algorithms that provide machines with the ability to reason and perform cognitive functions, including object and word recognition, problem solving and decision making. This study aimed to examine the readability, reliability and quality of responses to frequently asked keywords about low back pain (LBP) given by three different AI-based chatbots (ChatGPT, Perplexity and Gemini), which are popular applications in online information presentation today. Methods All three AI chatbots were asked the 25 most frequently used keywords related to LBP determined with the help of Google Trend. In order to prevent possible bias that could be created by the sequential processing of keywords in the answers given by the chatbots, the study was designed by providing input from different users (EO, VH) for each keyword. The readability of the responses given was determined with the Simple Measure of Gobbledygook (SMOG), Flesch Reading Ease Score (FRES) and Gunning Fog (GFG) readability scores. Quality was assessed using the Global Quality Score (GQS) and the Ensuring Quality Information for Patients (EQIP) score. Reliability was assessed by determining with DISCERN and Journal of American Medical Association (JAMA) scales. Results The first three keywords detected as a result of Google Trend search were "Lower Back Pain", "ICD 10 Low Back Pain", and "Low Back Pain Symptoms". It was determined that the readability of the responses given by all AI chatbots was higher than the recommended 6th grade readability level (p < 0.001). In the EQIP, JAMA, modified DISCERN and GQS score evaluation, Perplexity was found to have significantly higher scores than other chatbots (p < 0.001). Conclusion It has been determined that the answers given by AI chatbots to keywords about LBP are difficult to read and have low reliability and quality assessment. It is clear that when new chatbots are introduced, they can provide better guidance to patients with increased clarity and text quality. This study can provide inspiration for future studies on improving the algorithms and responses of AI chatbots.
Collapse
Affiliation(s)
- Erkan Ozduran
- Physical Medicine and Rehabilitation, Pain Medicine, Sivas Numune Hospital, Sivas, Turkey
| | - Volkan Hancı
- Anesthesiology and Reanimation, Critical Care Medicine, Dokuz Eylül University, Izmir, Turkey
| | - Yüksel Erkin
- Anesthesiology and Reanimation, Pain Medicine, Dokuz Eylül University, Izmir, Turkey
| | - İlhan Celil Özbek
- Physical Medicine and Rehabilitation, Health Science University, Derince Education and Research Hospital, Kocaeli, Turkey
| | - Vugar Abdulkerimov
- Anesthesiology and Reanimation, Central Clinical Hospital, Baku, Azerbaijan
| |
Collapse
|
3
|
Karataş Ö, Karataş S. Quality and comprehensiveness of YouTube videos on back pain during pregnancy. Int J Gynaecol Obstet 2024; 166:419-425. [PMID: 38366748 DOI: 10.1002/ijgo.15419] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2023] [Revised: 01/17/2024] [Accepted: 01/29/2024] [Indexed: 02/18/2024]
Abstract
OBJECTIVES Back pain during pregnancy is a common issue that impacts the quality of life for many women. YouTube has become an increasingly popular source of health information. Pregnant women often turn to YouTube for advice on managing back pain, but the quality of available videos is highly variable. This study aimed to assess the quality and comprehensiveness of YouTube videos related to back pain during pregnancy. METHODS A YouTube search was conducted using the keyword "back pain in pregnancy", and the first 100 resulting videos were included in the study. After a thorough review and exclusion of ineligible videos, the final sample consisted of 71 videos. Various parameters such as the number of views, likes, viewer interaction, video age, uploaded source (healthcare or nonhealthcare), and video length were evaluated for all videos. RESULTS Regarding the source of the videos, 44 (61.9%) were created by healthcare professionals, while 27 (38%) were created by nonprofessionals. Videos created by healthcare professionals had significantly higher scores in terms of DISCERN score, Journal of the American Medical Association (JAMA) score, and Global Quality Scale (GQS) (P < 0.001). Our findings indicate a statistically significant and strong positive correlation among the three scoring systems (P < 0.001). CONCLUSION Videos created by healthcare professionals were generally of higher quality, but many videos were still rated as low-moderate quality. The majority of videos focused on self-care strategies, with fewer discussing other treatment options. Our findings highlight the need for improved quality and comprehensiveness of YouTube videos on back pain during pregnancy.
Collapse
Affiliation(s)
- Özlem Karataş
- Department of Physical Medicine and Rehabilitation Pinarbasi Neighbourhood, Dumlupinar Boulevard, Faculty of Medicine, Akdeniz University, Antalya, Turkey
| | - Selim KarataÅŸ
- Department of Obstetrics and Gynecology, Olimpos Private Hospital, Antalya, Turkey
| |
Collapse
|
4
|
Ömür Arça D, Erdemir İ, Kara F, Shermatov N, Odacioğlu M, İbişoğlu E, Hanci FB, Sağiroğlu G, Hanci V. Assessing the readability, reliability, and quality of artificial intelligence chatbot responses to the 100 most searched queries about cardiopulmonary resuscitation: An observational study. Medicine (Baltimore) 2024; 103:e38352. [PMID: 39259094 PMCID: PMC11142831 DOI: 10.1097/md.0000000000038352] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/07/2024] [Revised: 04/26/2024] [Accepted: 05/03/2024] [Indexed: 09/12/2024] Open
Abstract
This study aimed to evaluate the readability, reliability, and quality of responses by 4 selected artificial intelligence (AI)-based large language model (LLM) chatbots to questions related to cardiopulmonary resuscitation (CPR). This was a cross-sectional study. Responses to the 100 most frequently asked questions about CPR by 4 selected chatbots (ChatGPT-3.5 [Open AI], Google Bard [Google AI], Google Gemini [Google AI], and Perplexity [Perplexity AI]) were analyzed for readability, reliability, and quality. The chatbots were asked the following question: "What are the 100 most frequently asked questions about cardio pulmonary resuscitation?" in English. Each of the 100 queries derived from the responses was individually posed to the 4 chatbots. The 400 responses or patient education materials (PEM) from the chatbots were assessed for quality and reliability using the modified DISCERN Questionnaire, Journal of the American Medical Association and Global Quality Score. Readability assessment utilized 2 different calculators, which computed readability scores independently using metrics such as Flesch Reading Ease Score, Flesch-Kincaid Grade Level, Simple Measure of Gobbledygook, Gunning Fog Readability and Automated Readability Index. Analyzed 100 responses from each of the 4 chatbots. When the readability values of the median results obtained from Calculators 1 and 2 were compared with the 6th-grade reading level, there was a highly significant difference between the groups (P < .001). Compared to all formulas, the readability level of the responses was above 6th grade. It can be seen that the order of readability from easy to difficult is Bard, Perplexity, Gemini, and ChatGPT-3.5. The readability of the text content provided by all 4 chatbots was found to be above the 6th-grade level. We believe that enhancing the quality, reliability, and readability of PEMs will lead to easier understanding by readers and more accurate performance of CPR. So, patients who receive bystander CPR may experience an increased likelihood of survival.
Collapse
Affiliation(s)
- Dilek Ömür Arça
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - İsmail Erdemir
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Fevzi Kara
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Nurgazy Shermatov
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Mürüvvet Odacioğlu
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Emel İbişoğlu
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Ferid Baran Hanci
- Departments of Faculty of Engineering, Ostim Technical University, Artificial Intelligence Engineering, Ankara, Turkey
| | - Gönül Sağiroğlu
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| | - Volkan Hanci
- Department of Anesthesiology and Reanimation, School of Medicine, Dokuz Eylul University, Izmir, Turkey
| |
Collapse
|
5
|
Gül Ş, Erdemir İ, Hanci V, Aydoğmuş E, Erkoç YS. How artificial intelligence can provide information about subdural hematoma: Assessment of readability, reliability, and quality of ChatGPT, BARD, and perplexity responses. Medicine (Baltimore) 2024; 103:e38009. [PMID: 38701313 PMCID: PMC11062651 DOI: 10.1097/md.0000000000038009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 04/04/2024] [Indexed: 05/05/2024] Open
Abstract
Subdural hematoma is defined as blood collection in the subdural space between the dura mater and arachnoid. Subdural hematoma is a condition that neurosurgeons frequently encounter and has acute, subacute and chronic forms. The incidence in adults is reported to be 1.72-20.60/100.000 people annually. Our study aimed to evaluate the quality, reliability and readability of the answers to questions asked to ChatGPT, Bard, and perplexity about "Subdural Hematoma." In this observational and cross-sectional study, we asked ChatGPT, Bard, and perplexity to provide the 100 most frequently asked questions about "Subdural Hematoma" separately. Responses from both chatbots were analyzed separately for readability, quality, reliability and adequacy. When the median readability scores of ChatGPT, Bard, and perplexity answers were compared with the sixth-grade reading level, a statistically significant difference was observed in all formulas (P < .001). All 3 chatbot responses were found to be difficult to read. Bard responses were more readable than ChatGPT's (P < .001) and perplexity's (P < .001) responses for all scores evaluated. Although there were differences between the results of the evaluated calculators, perplexity's answers were determined to be more readable than ChatGPT's answers (P < .05). Bard answers were determined to have the best GQS scores (P < .001). Perplexity responses had the best Journal of American Medical Association and modified DISCERN scores (P < .001). ChatGPT, Bard, and perplexity's current capabilities are inadequate in terms of quality and readability of "Subdural Hematoma" related text content. The readability standard for patient education materials as determined by the American Medical Association, National Institutes of Health, and the United States Department of Health and Human Services is at or below grade 6. The readability levels of the responses of artificial intelligence applications such as ChatGPT, Bard, and perplexity are significantly higher than the recommended 6th grade level.
Collapse
Affiliation(s)
- Şanser Gül
- Department of Neurosurgery, Ankara Ataturk Sanatory Education and Research Hospital, Ankara, Turkey
| | - İsmail Erdemir
- Department of Anesthesiology and Critical Care, Faculty of Medicine, Dokuz Eylül University, Izmir, Turkey
| | - Volkan Hanci
- Department of Anesthesiology and Reanimation, Ankara Sincan Education and Research Hospital, Ankara, Turkey
| | - Evren AydoÄŸmuÅŸ
- Department of Neurosurgery, Istanbul Kartal Dr Lütfi Kırdar City Hospital, Istanbul, Turkey
| | - Yavuz Selim Erkoç
- Department of Neurosurgery, Ankara Ataturk Sanatory Education and Research Hospital, Ankara, Turkey
| |
Collapse
|
6
|
Hanci V, Otlu B, BiyikoÄŸlu AS. Assessment of the Readability of the Online Patient Education Materials of Intensive and Critical Care Societies. Crit Care Med 2024; 52:e47-e57. [PMID: 37962133 DOI: 10.1097/ccm.0000000000006121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
OBJECTIVES This study aimed to evaluate the readability of patient education materials (PEMs) on websites of intensive and critical care societies. DATA SOURCES Websites of intensive and critical care societies, which are members of The World Federation of Intensive and Critical Care and The European Society of Intensive Care Medicine. SETTING Cross-sectional observational, internet-based, website, PEMs, readability study. STUDY SELECTION The readability of the PEMs available on societies' sites was evaluated. DATA EXTRACTION The readability formulas used were the Flesch Reading Ease Score (FRES), Flesch-Kincaid Grade Level (FKGL), Simple Measure of Gobbledygook (SMOG), and Gunning Fog (GFOG). DATA SYNTHESIS One hundred twenty-seven PEM from 11 different societies were included in our study. In the readability analysis of PEM, the FRES was 58.10 (48.85-63.77) (difficult), the mean FKGL and SMOG were 10.19 (8.93-11.72) and 11.10 (10.11-11.87) years, respectively, and the mean GFOG score was 12.73 (11.37-14.15) (very difficult). All readability formula results were significantly higher than the recommended sixth-grade level ( p < 0.001). All PEMs were above the sixth-grade level when the societies were evaluated individually according to all readability results ( p < 0.05). CONCLUSIONS Compared with the sixth-grade level recommended by the American Medical Association and the National Institutes of Health, the readability of PEMs in intensive and critical care societies is relatively high. PEMs in intensive and critical care societies should be prepared with attention to recommendations on readability.
Collapse
Affiliation(s)
- Volkan Hanci
- Anesthesiology and Reanimation Department, Dokuz Eylul University, Izmir, Tukey
| | | | | |
Collapse
|