1
|
Tran VT, Gartlehner G, Yaacoub S, Boutron I, Schwingshackl L, Stadelmaier J, Sommer I, Alebouyeh F, Afach S, Meerpohl J, Ravaud P. Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses. Ann Intern Med 2024. [PMID: 38768452 DOI: 10.7326/m23-3389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 05/22/2024] Open
Abstract
BACKGROUND Systematic reviews are performed manually despite the exponential growth of scientific literature. OBJECTIVE To investigate the sensitivity and specificity of GPT-3.5 Turbo, from OpenAI, as a single reviewer, for title and abstract screening in systematic reviews. DESIGN Diagnostic test accuracy study. SETTING Unannotated bibliographic databases from 5 systematic reviews representing 22 665 citations. PARTICIPANTS None. MEASUREMENTS A generic prompt framework to instruct GPT to perform title and abstract screening was designed. The output of the model was compared with decisions from authors under 2 rules. The first rule balanced sensitivity and specificity, for example, to act as a second reviewer. The second rule optimized sensitivity, for example, to reduce the number of citations to be manually screened. RESULTS Under the balanced rule, sensitivities ranged from 81.1% to 96.5% and specificities ranged from 25.8% to 80.4%. Across all reviews, GPT identified 7 of 708 citations (1%) missed by humans that should have been included after full-text screening at the cost of 10 279 of 22 665 false-positive recommendations (45.3%) that would require reconciliation during the screening process. Under the sensitive rule, sensitivities ranged from 94.6% to 99.8% and specificities ranged from 2.2% to 46.6%. Limiting manual screening to citations not ruled out by GPT could reduce the number of citations to screen from 127 of 6334 (2%) to 1851 of 4077 (45.4%), at the cost of missing from 0 to 1 of 26 citations (3.8%) at the full-text level. LIMITATIONS Time needed to fine-tune prompt. Retrospective nature of the study, convenient sample of 5 systematic reviews, and GPT performance sensitive to prompt development and time. CONCLUSION The GPT-3.5 Turbo model may be used as a second reviewer for title and abstract screening, at the cost of additional work to reconcile added false positives. It also showed potential to reduce the number of citations before screening by humans, at the cost of missing some citations at the full-text level. PRIMARY FUNDING SOURCE None.
Collapse
Affiliation(s)
- Viet-Thi Tran
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris; and Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France (V.-T.T.)
| | - Gerald Gartlehner
- Department for Evidence-based Medicine and Evaluation, University for Continuing Education Krems, Krems, Austria; and Center for Public Health Methods, RTI International, Research Triangle Park, North Carolina (G.G.)
| | - Sally Yaacoub
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France (S.Y., F.A.)
| | - Isabelle Boutron
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France; and Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France (I.B.)
| | - Lukas Schwingshackl
- Institute for Evidence in Medicine, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany (L.S., J.S., J.M.)
| | - Julia Stadelmaier
- Institute for Evidence in Medicine, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany (L.S., J.S., J.M.)
| | - Isolde Sommer
- Department for Evidence-based Medicine and Evaluation, University for Continuing Education Krems, Krems, Austria (I.S.)
| | - Farzaneh Alebouyeh
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France (S.Y., F.A.)
| | - Sivem Afach
- Epidemiology in Dermatology and Evaluation of Therapeutics (EpiDermE)-EA 7379, University Paris Est Créteil Val de Marne, Créteil, France (S.A.)
| | - Joerg Meerpohl
- Institute for Evidence in Medicine, Medical Center-University of Freiburg, Faculty of Medicine, University of Freiburg, Freiburg, Germany (L.S., J.S., J.M.)
| | - Philippe Ravaud
- Université Paris Cité and Université Sorbonne Paris Nord, Inserm, INRAe, Centre for Research in Epidemiology and Statistics (CRESS), Paris, France; Centre d'Epidémiologie Clinique, Hôpital Hôtel-Dieu, AP-HP, Paris, France; and Department of Epidemiology, Columbia University Mailman School of Public Health, New York, New York (P.R.)
| |
Collapse
|
2
|
Roth S, Wermer-Colan A. Machine Learning Methods for Systematic Reviews:: A Rapid Scoping Review. Dela J Public Health 2023; 9:40-47. [PMID: 38173960 PMCID: PMC10759980 DOI: 10.32481/djph.2023.11.008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2024] Open
Abstract
Objective At the forefront of machine learning research since its inception has been natural language processing, also known as text mining, referring to a wide range of statistical processes for analyzing textual data and retrieving information. In medical fields, text mining has made valuable contributions in unexpected ways, not least by synthesizing data from disparate biomedical studies. This rapid scoping review examines how machine learning methods for text mining can be implemented at the intersection of these disparate fields to improve the workflow and process of conducting systematic reviews in medical research and related academic disciplines. Methods The primary research question that this investigation asked, "what impact does the use of machine learning have on the methods used by systematic review teams to carry out the systematic review process, such as the precision of search strategies, unbiased article selection or data abstraction and/or analysis for systematic reviews and other comprehensive review types of similar methodology?" A literature search was conducted by a medical librarian utilizing multiple databases, a grey literature search and handsearching of the literature. The search was completed on December 4, 2020. Handsearching was done on an ongoing basis with an end date of April 14, 2023. Results The search yielded 23,190 studies after duplicates were removed. As a result, 117 studies (1.70%) met eligibility criteria for inclusion in this rapid scoping review. Conclusions There are several techniques and/or types of machine learning methods in development or that have already been fully developed to assist with the systematic review stages. Combined with human intelligence, these machine learning methods and tools provide promise for making the systematic review process more efficient, saving valuable time for systematic review authors, and increasing the speed in which evidence can be created and placed in the hands of decision makers and the public.
Collapse
Affiliation(s)
- Stephanie Roth
- Medical Librarian, Lewis B. Flinn Medical Library, ChristianaCare
| | - Alex Wermer-Colan
- Academic Director, Loretta C. Duckworth Scholars Studio, Temple University Libraries
| |
Collapse
|
3
|
Oliveira Dos Santos Á, Sergio da Silva E, Machado Couto L, Valadares Labanca Reis G, Silva Belo V. The use of artificial intelligence for automating or semi-automating biomedical literature analyses: a scoping review. J Biomed Inform 2023; 142:104389. [PMID: 37187321 DOI: 10.1016/j.jbi.2023.104389] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2023] [Revised: 04/11/2023] [Accepted: 05/08/2023] [Indexed: 05/17/2023]
Abstract
OBJECTIVE Evidence-based medicine (EBM) is a decision-making process based on the conscious and judicious use of the best available scientific evidence. However, the exponential increase in the amount of information currently available likely exceeds the capacity of human-only analysis. In this context, artificial intelligence (AI) and its branches such as machine learning (ML) can be used to facilitate human efforts in analyzing the literature to foster EBM. The present scoping review aimed to examine the use of AI in the automation of biomedical literature survey and analysis with a view to establishing the state-of-the-art and identifying knowledge gaps. MATERIALS AND METHODS Comprehensive searches of the main databases were performed for articles published up to June 2022 and studies were selected according to inclusion and exclusion criteria. Data were extracted from the included articles and the findings categorized. RESULTS The total number of records retrieved from the databases was 12,145, of which 273 were included in the review. Classification of the studies according to the use of AI in evaluating the biomedical literature revealed three main application groups, namely assembly of scientific evidence (n=127; 47%), mining the biomedical literature (n=112; 41%) and quality analysis (n=34; 12%). Most studies addressed the preparation of systematic reviews, while articles focusing on the development of guidelines and evidence synthesis were the least frequent. The biggest knowledge gap was identified within the quality analysis group, particularly regarding methods and tools that assess the strength of recommendation and consistency of evidence. CONCLUSION Our review shows that, despite significant progress in the automation of biomedical literature surveys and analyses in recent years, intense research is needed to fill knowledge gaps on more difficult aspects of ML, deep learning and natural language processing, and to consolidate the use of automation by end-users (biomedical researchers and healthcare professionals).
Collapse
Affiliation(s)
| | - Eduardo Sergio da Silva
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | - Letícia Machado Couto
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| | | | - Vinícius Silva Belo
- Federal University of São João del-Rei, Campus Centro-Oeste Dona Lindu, Divinópolis, Minas Gerais, Brazil.
| |
Collapse
|
4
|
Gurung P, Makineli S, Spijker R, Leeflang MMG. The Emtree term "diagnostic test accuracy study" retrieved less than half of the diagnostic accuracy studies in Embase. J Clin Epidemiol 2020; 126:116-121. [PMID: 32615208 DOI: 10.1016/j.jclinepi.2020.06.030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2020] [Revised: 05/28/2020] [Accepted: 06/24/2020] [Indexed: 01/08/2023]
Abstract
OBJECTIVES Embase is a biomedical and pharmacological bibliographic database of published literature, produced by Elsevier. In 2011, Embase introduced the Emtree term "diagnostic test accuracy study," after discussion with the diagnostic test accuracy (DTA) community of Cochrane. The aim of this study is to investigate the performance of this Emtree term when used to retrieve diagnostic accuracy studies. STUDY DESIGN AND SETTING We first piloted a random selection of 1,000 titles from Embase and then repeated the process with 1,223 studies specifically limited to humans. Two researchers independently screened those for eligibility. From titles that were indicated as being relevant or potentially relevant by at least one assessor, the full texts were retrieved and screened. A third researcher retrieved the Emtree terms for each title and checked whether "diagnostic test accuracy study" was one of the attached Emtree terms. The results of both exercises were then cross-classified, and sensitivity and specificity of the Emtree term were estimated. RESULTS Our pilot set consisted of 1,000 studies, of which 20 (2.0%) were studies from which DTA data could be extracted. Thirteen studies had the label DTA study, of which five were indeed DTA studies. The final set consisted of 1,223 studies, of which 33 (2.7%) were DTA studies. Twenty studies were labeled as DTA study, of which fourteen indeed were DTA studies. This resulted in a sensitivity of 42.4% (95% CI: 25.5% to 60.8%) and a specificity of 99.5% (95% CI: 98.9% to 99.8%). CONCLUSION Although we planned to include a more focused set of studies in our second attempt, the percentage of DTA studies was similar in both attempts. The DTA label failed to retrieve most of the DTA studies and 30% of the studies labeled as being DTA study were in fact not DTA studies. The Emtree term DTA study does not meet the requirements to be useful for retrieving DTA studies accurately.
Collapse
Affiliation(s)
- Pema Gurung
- Amsterdam UMC, University of Amsterdam, Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Amsterdam Public Health, Meibergdreef 9, Amsterdam, The Netherlands
| | - Sahile Makineli
- Amsterdam UMC, University of Amsterdam, Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Amsterdam Public Health, Meibergdreef 9, Amsterdam, The Netherlands
| | - René Spijker
- Amsterdam UMC, University of Amsterdam, Medical Library, Amsterdam Public Health, Meibergdreef 9, Amsterdam, The Netherlands; Cochrane Netherlands, Julius Center for Health Sciences and Primary Care, UMC Utrecht, Utrecht University, Utrecht, The Netherlands
| | - Mariska M G Leeflang
- Amsterdam UMC, University of Amsterdam, Department of Clinical Epidemiology, Biostatistics and Bioinformatics, Amsterdam Public Health, Meibergdreef 9, Amsterdam, The Netherlands.
| |
Collapse
|