1
|
Mullen KR, Tammen I, Matentzoglu NA, Mather M, Balhoff JP, Esdaile E, Leroy G, Park CA, Rando HM, Saklou NT, Webb TL, Vasilevsky NA, Mungall CJ, Haendel MA, Nicholas FW, Toro S. The Vertebrate Breed Ontology: Toward Effective Breed Data Standardization. J Vet Intern Med 2025; 39:e70133. [PMID: 40413720 PMCID: PMC12103836 DOI: 10.1111/jvim.70133] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Revised: 04/28/2025] [Accepted: 05/08/2025] [Indexed: 05/27/2025] Open
Abstract
BACKGROUND Limited universally-adopted data standards in veterinary medicine hinder data interoperability and therefore integration and comparison; this ultimately impedes the application of existing information-based tools to support advancement in diagnostics, treatments, and precision medicine. HYPOTHESIS/OBJECTIVES A single, coherent, logic-based standard for documenting breed names in health, production, and research-related records will improve data use capabilities in veterinary and comparative medicine. ANIMALS No live animals were used. METHODS The Vertebrate Breed Ontology (VBO) was created from breed names and related information compiled from the Food and Agriculture Organization of the United Nations, breed registries, communities, and experts, using manual and computational approaches. Each breed is represented by a VBO term that includes breed information and provenance as metadata. VBO terms are classified using description logic to allow computational applications and Artificial Intelligence-readiness. RESULTS VBO is an open, community-driven ontology representing over 19 500 livestock and companion animal breed concepts covering 49 species. Breeds are classified based on community and expert conventions (e.g., cattle breed) and supported by relations to the breed's genus and species indicated by National Center for Biotechnology Information (NCBI) Taxonomy terms. Relationships between VBO terms (e.g., relating breeds to their foundation stock) provide additional context to support advanced data analytics. VBO term metadata includes synonyms, breed identifiers/codes, and attributed cross-references to other databases. CONCLUSION AND CLINICAL IMPORTANCE The adoption of VBO as a standard for breed names in databases and veterinary electronic health records enhances veterinary data interoperability and computability, supporting precision medicine.
Collapse
Affiliation(s)
- Kathleen R. Mullen
- Department of GeneticsUniversity of North Carolina at Chapel HillChapel HillNorth CarolinaUSA
| | - Imke Tammen
- Sydney School of Veterinary ScienceThe University of SydneySydneyNew South WalesAustralia
| | | | - Marius Mather
- Sydney Informatics HubThe University of SydneySydneyNew South WalesAustralia
| | - James P. Balhoff
- Renaissance Computing InstituteUniversity of North Carolina at Chapel HillChapel HillNorth CarolinaUSA
| | - Elizabeth Esdaile
- Veterinary Genetics Laboratory, School of Veterinary MedicineUniversity of CaliforniaDavisCaliforniaUSA
| | - Gregoire Leroy
- Animal Production and Health DivisionFood and Agriculture Organization of the UnitedRomeItaly
| | - Carissa A. Park
- Department of Animal ScienceIowa State UniversityAmesIowaUSA
| | - Halie M. Rando
- Department of Computer ScienceSmith CollegeNorthamptonMassachusettsUSA
| | - Nadia T. Saklou
- Department of Clinical SciencesColorado State UniversityFort CollinsColoradoUSA
| | - Tracy L. Webb
- Department of Clinical SciencesColorado State UniversityFort CollinsColoradoUSA
| | | | | | - Melissa A. Haendel
- Department of GeneticsUniversity of North Carolina at Chapel HillChapel HillNorth CarolinaUSA
| | - Frank W. Nicholas
- Sydney School of Veterinary ScienceThe University of SydneySydneyNew South WalesAustralia
| | - Sabrina Toro
- Department of GeneticsUniversity of North Carolina at Chapel HillChapel HillNorth CarolinaUSA
| |
Collapse
|
2
|
Mirza A, Alampara N, Kunchapu S, Ríos-García M, Emoekabu B, Krishnan A, Gupta T, Schilling-Wilhelmi M, Okereke M, Aneesh A, Asgari M, Eberhardt J, Elahi AM, Elbeheiry HM, Gil MV, Glaubitz C, Greiner M, Holick CT, Hoffmann T, Ibrahim A, Klepsch LC, Köster Y, Kreth FA, Meyer J, Miret S, Peschel JM, Ringleb M, Roesner NC, Schreiber J, Schubert US, Stafast LM, Wonanke ADD, Pieler M, Schwaller P, Jablonka KM. A framework for evaluating the chemical knowledge and reasoning abilities of large language models against the expertise of chemists. Nat Chem 2025:10.1038/s41557-025-01815-x. [PMID: 40394186 DOI: 10.1038/s41557-025-01815-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2024] [Accepted: 03/26/2025] [Indexed: 05/22/2025]
Abstract
Large language models (LLMs) have gained widespread interest owing to their ability to process human language and perform tasks on which they have not been explicitly trained. However, we possess only a limited systematic understanding of the chemical capabilities of LLMs, which would be required to improve models and mitigate potential harm. Here we introduce ChemBench, an automated framework for evaluating the chemical knowledge and reasoning abilities of state-of-the-art LLMs against the expertise of chemists. We curated more than 2,700 question-answer pairs, evaluated leading open- and closed-source LLMs and found that the best models, on average, outperformed the best human chemists in our study. However, the models struggle with some basic tasks and provide overconfident predictions. These findings reveal LLMs' impressive chemical capabilities while emphasizing the need for further research to improve their safety and usefulness. They also suggest adapting chemistry education and show the value of benchmarking frameworks for evaluating LLMs in specific domains.
Collapse
Affiliation(s)
- Adrian Mirza
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Jena, Germany
| | - Nawaf Alampara
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Sreekanth Kunchapu
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Martiño Ríos-García
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Institute of Carbon Science and Technology, CSIC, Oviedo, Spain
| | - Benedict Emoekabu
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | | | - Tanya Gupta
- Laboratory of Artificial Chemical Intelligence, Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
- National Centre of Competence in Research Catalysis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Mara Schilling-Wilhelmi
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Macjonathan Okereke
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Anagha Aneesh
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Mehrdad Asgari
- Department of Chemical Engineering and Biotechnology, University of Cambridge, Cambridge, UK
| | | | - Amir Mohammad Elahi
- Laboratory of Molecular Simulation, Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne, Sion, Switzerland
| | - Hani M Elbeheiry
- Institute for Inorganic and Analytical Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | | | | | - Maximilian Greiner
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Caroline T Holick
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Tim Hoffmann
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Abdelrahman Ibrahim
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Lea C Klepsch
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Yannik Köster
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Fabian Alexander Kreth
- Institute for Technical Chemistry and Environmental Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Center for Energy and Environmental Chemistry Jena, Friedrich Schiller University Jena, Jena, Germany
| | - Jakob Meyer
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | | | - Jan Matthias Peschel
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
| | - Michael Ringleb
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Nicole C Roesner
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Johanna Schreiber
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - Ulrich S Schubert
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
- Center for Energy and Environmental Chemistry Jena, Friedrich Schiller University Jena, Jena, Germany
| | - Leanne M Stafast
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany
| | - A D Dinga Wonanke
- Theoretical Chemistry, Technische Universität Dresden, Dresden, Germany
| | | | - Philippe Schwaller
- Laboratory of Artificial Chemical Intelligence, Institut des Sciences et Ingénierie Chimiques, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
- National Centre of Competence in Research Catalysis, École Polytechnique Fédérale de Lausanne, Lausanne, Switzerland
| | - Kevin Maik Jablonka
- Laboratory of Organic and Macromolecular Chemistry, Friedrich Schiller University Jena, Jena, Germany.
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Jena, Germany.
- Jena Center for Soft Matter, Friedrich Schiller University Jena, Jena, Germany.
- Center for Energy and Environmental Chemistry Jena, Friedrich Schiller University Jena, Jena, Germany.
| |
Collapse
|
3
|
Vendetti J, Harris NL, Dorf MV, Skrenchuk A, Caufield JH, Gonçalves RS, Graybeal JB, Hegde H, Redmond T, Mungall CJ, Musen MA. BioPortal: an open community resource for sharing, searching, and utilizing biomedical ontologies. Nucleic Acids Res 2025:gkaf402. [PMID: 40357648 DOI: 10.1093/nar/gkaf402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 04/11/2025] [Accepted: 05/02/2025] [Indexed: 05/15/2025] Open
Abstract
BioPortal (https://bioportal.bioontology.org) is the world's most comprehensive repository of biomedical ontologies. It provides infrastructure for finding, sharing, searching, and utilizing biomedical ontologies. Launched in 2005, BioPortal now includes 1549 ontologies (1182 of them public). Its open, freely accessible website enables anyone (i) to browse the ontology library, (ii) to search for terms across ontologies, (iii) to browse mappings between terms, (iv) to see popularity ratings and recommendations on which ontologies are most relevant to their use cases, (v) to annotate text with ontology terms, (vi) to submit an ontology, and (vii) to request ontology changes. The library of ontologies can be accessed programmatically via a REST application programming interface (API). Recent enhancements include a BioPortal knowledge graph that integrates knowledge from multiple ontologies; a unified data model for interoperability with other knowledge sources; ontology popularity ratings and recommendations for relevant ontologies; and the ability to request ontology changes via a simple user interface that automatically converts user change requests to GitHub Pull Requests that specify the edits that will be made to the ontology upon approval.
Collapse
Affiliation(s)
- Jennifer Vendetti
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Nomi L Harris
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Michael V Dorf
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Alex Skrenchuk
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - J Harry Caufield
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Rafael S Gonçalves
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - John B Graybeal
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Harshad Hegde
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Timothy Redmond
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| | - Christopher J Mungall
- Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Mark A Musen
- Center for Biomedical Informatics Research, Stanford University, Palo Alto, CA 94304, United States
| |
Collapse
|
4
|
Dahiya DS, Ali H, Moond V, Shah MDA, Santana C, Ali N, Sheikh AB, Nadeem MA, Munir A, Quazi MA, Bharadwaj HR, Sohail AH. Large Language Models in Gastroenterology and Gastrointestinal Surgery: A New Frontier in Patient Communication and Education. Gastroenterology Res 2025; 18:39-48. [PMID: 40322195 PMCID: PMC12045793 DOI: 10.14740/gr2011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/20/2024] [Accepted: 03/12/2025] [Indexed: 05/08/2025] Open
Abstract
When integrated into healthcare, large language models (LLMs) have transformative and revolutionary effects, including significant potential for improving patient care and streamlining clinical processes. However, one specialty that particularly requires data on LLM use is gastroenterology and gastrointestinal surgery, a gap we sought to address in our research. Advanced artificial intelligence (AI) systems like LLMs have demonstrated the ability to mimic human communication, assist in diagnosis, provide patient education, and support medical research simultaneously. Despite these advantages, challenges such as biases, data privacy concerns, and lack of transparency in decision-making remain critical. The role of regulations in mitigating these risks is widely debated, with proponents advocating for structured oversight to enhance trust and patient safety, while others caution against potential barriers to innovation. Rather than replacing human expertise, AI should be integrated thoughtfully to complement clinical decision-making. Ensuring a balanced approach requires collaboration between medical professionals, AI developers, and policymakers to optimize its responsible implementation in healthcare.
Collapse
Affiliation(s)
- Dushyant Singh Dahiya
- Division of Gastroenterology, Hepatology and Motility, The University of Kansas School of Medicine, Kansas City, KS, USA
| | - Hassam Ali
- Department of Gastroenterology, Division of Internal Medicine, ECU Health Medical Center/Brody School of Medicine, Greenville, NC, USA
| | - Vishali Moond
- Department of Internal Medicine, Saint Peter’s University Hospital/Robert Wood Johnson Medical School, New Brunswick, NJ, USA
| | - M. Danial Ali Shah
- Department of Internal Medicine, King Edward Medical University, Lahore, Pakistan
| | - Christina Santana
- Department of Internal Medicine, Division of Internal Medicine, ECU Health Medical Center/Brody School of Medicine, Greenville, NC, USA
| | - Noor Ali
- Department of Public Health, East Carolina University, Greenville, NC, USA
| | - Abu Baker Sheikh
- Department of Internal Medicine, University of New Mexico Health Sciences, Albuquerque, NM, USA
| | - Muhammad Ahmad Nadeem
- Department of Hepato-Pancreato-Biliary and Liver Transplant Surgery, Digestive Diseases and Surgery Institute, Cleveland Clinic Foundation, Cleveland, OH, USA
| | - Aqsa Munir
- Department of Internal Medicine, Dow Medical College, Karachi, Pakistan
| | - Mohammed A. Quazi
- Department of Internal Medicine, University of New Mexico Health Sciences, Albuquerque, NM, USA
| | | | - Amir Humza Sohail
- Department of Surgery, University of New Mexico Health Sciences, Albuquerque, NM, USA
| |
Collapse
|
5
|
Anderson LN, Hoyt CT, Zucker JD, McNaughton AD, Teuton JR, Karis K, Arokium-Christian NN, Warley JT, Stromberg ZR, Gyori BM, Kumar N. Computational tools and data integration to accelerate vaccine development: challenges, opportunities, and future directions. Front Immunol 2025; 16:1502484. [PMID: 40124369 PMCID: PMC11925797 DOI: 10.3389/fimmu.2025.1502484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Accepted: 01/23/2025] [Indexed: 03/25/2025] Open
Abstract
The development of effective vaccines is crucial for combating current and emerging pathogens. Despite significant advances in the field of vaccine development there remain numerous challenges including the lack of standardized data reporting and curation practices, making it difficult to determine correlates of protection from experimental and clinical studies. Significant gaps in data and knowledge integration can hinder vaccine development which relies on a comprehensive understanding of the interplay between pathogens and the host immune system. In this review, we explore the current landscape of vaccine development, highlighting the computational challenges, limitations, and opportunities associated with integrating diverse data types for leveraging artificial intelligence (AI) and machine learning (ML) techniques in vaccine design. We discuss the role of natural language processing, semantic integration, and causal inference in extracting valuable insights from published literature and unstructured data sources, as well as the computational modeling of immune responses. Furthermore, we highlight specific challenges associated with uncertainty quantification in vaccine development and emphasize the importance of establishing standardized data formats and ontologies to facilitate the integration and analysis of heterogeneous data. Through data harmonization and integration, the development of safe and effective vaccines can be accelerated to improve public health outcomes. Looking to the future, we highlight the need for collaborative efforts among researchers, data scientists, and public health experts to realize the full potential of AI-assisted vaccine design and streamline the vaccine development process.
Collapse
Affiliation(s)
| | - Charles Tapley Hoyt
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, United States
| | - Jeremy D. Zucker
- Pacific Northwest National Laboratory (DOE), Richland, WA, United States
| | | | - Jeremy R. Teuton
- Pacific Northwest National Laboratory (DOE), Richland, WA, United States
| | - Klas Karis
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, United States
| | | | - Jackson T. Warley
- Pacific Northwest National Laboratory (DOE), Richland, WA, United States
| | | | - Benjamin M. Gyori
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, United States
- Department of Bioengineering, College of Engineering, Northeastern University, Boston, MA, United States
| | - Neeraj Kumar
- Pacific Northwest National Laboratory (DOE), Richland, WA, United States
| |
Collapse
|
6
|
Schilling-Wilhelmi M, Ríos-García M, Shabih S, Gil MV, Miret S, Koch CT, Márquez JA, Jablonka KM. From text to insight: large language models for chemical data extraction. Chem Soc Rev 2025; 54:1125-1150. [PMID: 39703015 DOI: 10.1039/d4cs00913d] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2024]
Abstract
The vast majority of chemical knowledge exists in unstructured natural language, yet structured data is crucial for innovative and systematic materials design. Traditionally, the field has relied on manual curation and partial automation for data extraction for specific use cases. The advent of large language models (LLMs) represents a significant shift, potentially enabling non-experts to extract structured, actionable data from unstructured text efficiently. While applying LLMs to chemical and materials science data extraction presents unique challenges, domain knowledge offers opportunities to guide and validate LLM outputs. This tutorial review provides a comprehensive overview of LLM-based structured data extraction in chemistry, synthesizing current knowledge and outlining future directions. We address the lack of standardized guidelines and present frameworks for leveraging the synergy between LLMs and chemical expertise. This work serves as a foundational resource for researchers aiming to harness LLMs for data-driven chemical research. The insights presented here could significantly enhance how researchers across chemical disciplines access and utilize scientific information, potentially accelerating the development of novel compounds and materials for critical societal needs.
Collapse
Affiliation(s)
- Mara Schilling-Wilhelmi
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
| | - Martiño Ríos-García
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | - Sherjeel Shabih
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - María Victoria Gil
- Institute of Carbon Science and Technology (INCAR), CSIC, Francisco Pintado Fe 26, 33011 Oviedo, Spain
| | | | - Christoph T Koch
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - José A Márquez
- Department of Physics and CSMB, Humboldt-Universität zu Berlin, Berlin, Germany
| | - Kevin Maik Jablonka
- Laboratory of Organic and Macromolecular Chemistry (IOMC), Friedrich Schiller University Jena, Humboldtstrasse 10, 07743 Jena, Germany.
- Center for Energy and Environmental Chemistry Jena (CEEC Jena), Friedrich Schiller University Jena, Philosophenweg 7a, 07743 Jena, Germany
- Helmholtz Institute for Polymers in Energy Applications Jena (HIPOLE Jena), Lessingstrasse 12-14, 07743 Jena, Germany
| |
Collapse
|
7
|
Zhang Y, Ren S, Wang J, Lu J, Wu C, He M, Liu X, Wu R, Zhao J, Zhan C, Du D, Zhan Z, Singla RK, Shen B. Aligning Large Language Models with Humans: A Comprehensive Survey of ChatGPT's Aptitude in Pharmacology. Drugs 2025; 85:231-254. [PMID: 39702867 PMCID: PMC11802629 DOI: 10.1007/s40265-024-02124-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/11/2024] [Indexed: 12/21/2024]
Abstract
BACKGROUND Due to the lack of a comprehensive pharmacology test set, evaluating the potential and value of large language models (LLMs) in pharmacology is complex and challenging. AIMS This study aims to provide a test set reference for assessing the application potential of both general-purpose and specialized LLMs in pharmacology. METHODS We constructed a pharmacology test set consisting of three tasks: drug information retrieval, lead compound structure optimization, and research trend summarization and analysis. Subsequently, we compared the performance of general-purpose LLMs GPT-3.5 and GPT-4 on this test set. RESULTS The results indicate that GPT-3.5 and GPT-4 can better understand instructions for information retrieval, scheme optimization, and trend summarization in pharmacology, showing significant potential in basic pharmacology tasks, especially in areas such as drug pharmacological properties, pharmacokinetics, mode of action, and toxicity prediction. These general LLMs also effectively summarize the current challenges and future trends in this field, proving their valuable resource for interdisciplinary pharmacology researchers. However, the limitations of ChatGPT become evident when handling tasks such as drug identification queries, drug interaction information retrieval, and drug structure simulation optimization. It struggles to provide accurate interaction information for individual or specific drugs and cannot optimize specific drugs. This lack of depth in knowledge integration and analysis limits its application in scientific research and clinical exploration. CONCLUSION Therefore, exploring retrieval-augmented generation (RAG) or integrating proprietary knowledge bases and knowledge graphs into pharmacology-oriented ChatGPT systems would yield favorable results. This integration will further optimize the potential of LLMs in pharmacology.
Collapse
Affiliation(s)
- Yingbo Zhang
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
- Tropical Crops Genetic Resources Institute, Chinese Academy of Tropical Agricultural Sciences (CATAS), Haikou, 571101, China
| | - Shumin Ren
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
- Department of Computer Science and Information Technology, University of A Coruña, 15071, A Coruña, Spain
| | - Jiao Wang
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
- Department of Computer Science and Information Technology, University of A Coruña, 15071, A Coruña, Spain
| | - Junyu Lu
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Cong Wu
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Mengqiao He
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Xingyun Liu
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
- Department of Computer Science and Information Technology, University of A Coruña, 15071, A Coruña, Spain
| | - Rongrong Wu
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Jing Zhao
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Chaoying Zhan
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
| | - Dan Du
- Advanced Mass Spectrometry Center, Research Core Facility, Frontiers Science Center for Disease-Related Molecular Network, West China Hospital/West China Medical School, Sichuan University, Chengdu, 610041, China
| | - Zhajun Zhan
- College of Pharmaceutical Science, Zhejiang University of Technology, Hangzhou, 310014, China
| | - Rajeev K Singla
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China
- School of Pharmaceutical Sciences, Lovely Professional University, Phagwara, Punjab-144411, India
| | - Bairong Shen
- Department of Pharmacy and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, Chengdu, 610212, China.
| |
Collapse
|
8
|
Mullen KR, Tammen I, Matentzoglu NA, Mather M, Balhoff JP, Esdaile E, Leroy G, Park CA, Rando HM, Saklou NT, Webb TL, Vasilevsky NA, Mungall CJ, Haendel MA, Nicholas FW, Toro S. The Vertebrate Breed Ontology: Towards effective breed data standardization. ARXIV 2025:arXiv:2406.02623v2. [PMID: 38883236 PMCID: PMC11177956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/18/2024]
Abstract
Background – Limited universally-adopted data standards in veterinary medicine hinder data interoperability and therefore integration and comparison; this ultimately impedes the application of existing information-based tools to support advancement in diagnostics, treatments, and precision medicine. Hypothesis/Objectives – A single, coherent, logic-based standard for documenting breed names in health, production, and research-related records will improve data use capabilities in veterinary and comparative medicine. Animals – No live animals were used. Methods – The Vertebrate Breed Ontology (VBO) was created from breed names and related information compiled from the Food and Agriculture Organization of the United Nations, breed registries, communities, and experts, using manual and computational approaches. Each breed is represented by a VBO term that includes breed information and provenance as metadata. VBO terms are classified using description logic to allow computational applications and Artificial Intelligence-readiness. Results – VBO is an open, community-driven ontology representing over 19,500 livestock and companion animal breed concepts covering 49 species. Breeds are classified based on community and expert conventions (e.g., cattle breed) and supported by relations to the breed's genus and species indicated by National Center for Biotechnology Information (NCBI) Taxonomy terms. Relationships between VBO terms (e.g., relating breeds to their foundation stock) provide additional context to support advanced data analytics. VBO term metadata includes synonyms, breed identifiers/codes, and attributed cross-references to other databases. Conclusion and clinical importance – The adoption of VBO as a source of standard breed names in databases and veterinary electronic health records can enhance veterinary data interoperability and computability.
Collapse
Affiliation(s)
- Kathleen R. Mullen
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Imke Tammen
- Sydney School of Veterinary Science, The University of Sydney, Sydney, NSW, Australia
| | | | - Marius Mather
- Sydney Informatics Hub, The University of Sydney, Sydney, NSW, Australia
| | - James P. Balhoff
- Renaissance Computing Institute, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Elizabeth Esdaile
- Veterinary Genetics Laboratory, School of Veterinary Medicine, University of California, Davis, CA, USA. Current affiliation is Department of Clinical Sciences, Colorado State University, Fort Collins, CO, USA
| | - Gregoire Leroy
- Animal Production and Health Division, Food and Agriculture Organization of the United Nations, Rome, Italy
| | - Carissa A. Park
- Department of Animal Science, Iowa State University, Ames, IA, USA
| | - Halie M. Rando
- Department of Computer Science, Smith College, Northampton, MA, USA
| | - Nadia T. Saklou
- Department of Clinical Sciences, Colorado State University, Fort Collins, CO, USA
| | - Tracy L. Webb
- Department of Clinical Sciences, Colorado State University, Fort Collins, CO, USA
| | | | | | - Melissa A. Haendel
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Frank W. Nicholas
- Sydney School of Veterinary Science, The University of Sydney, Sydney, NSW, Australia
| | - Sabrina Toro
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| |
Collapse
|
9
|
Hu M, Alkhairy S, Lee I, Pillich RT, Fong D, Smith K, Bachelder R, Ideker T, Pratt D. Evaluation of large language models for discovery of gene set function. Nat Methods 2025; 22:82-91. [PMID: 39609565 PMCID: PMC11725441 DOI: 10.1038/s41592-024-02525-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Accepted: 10/21/2024] [Indexed: 11/30/2024]
Abstract
Gene set enrichment is a mainstay of functional genomics, but it relies on gene function databases that are incomplete. Here we evaluate five large language models (LLMs) for their ability to discover the common functions represented by a gene set, supported by molecular rationale and a self-confidence assessment. For curated gene sets from Gene Ontology, GPT-4 suggests functions similar to the curated name in 73% of cases, with higher self-confidence predicting higher similarity. Conversely, random gene sets correctly yield zero confidence in 87% of cases. Other LLMs (GPT-3.5, Gemini Pro, Mixtral Instruct and Llama2 70b) vary in function recovery but are falsely confident for random sets. In gene clusters from omics data, GPT-4 identifies common functions for 45% of cases, fewer than functional enrichment but with higher specificity and gene coverage. Manual review of supporting rationale and citations finds these functions are largely verifiable. These results position LLMs as valuable omics assistants.
Collapse
Affiliation(s)
- Mengzhou Hu
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Sahar Alkhairy
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA
| | - Ingoo Lee
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Rudolf T Pillich
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Dylan Fong
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Kevin Smith
- Department of Physics, University of California San Diego, La Jolla, CA, USA
| | - Robin Bachelder
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Trey Ideker
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
| | - Dexter Pratt
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
10
|
Reese JT, Chimirri L, Bridges Y, Danis D, Caufield JH, Wissink K, McMurry JA, Graefe ASL, Casiraghi E, Valentini G, Jacobsen JOB, Haendel M, Smedley D, Mungall CJ, Robinson PN. Systematic benchmarking demonstrates large language models have not reached the diagnostic accuracy of traditional rare-disease decision support tools. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.07.22.24310816. [PMID: 39108510 PMCID: PMC11302616 DOI: 10.1101/2024.07.22.24310816] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 08/12/2024]
Abstract
Large language models (LLMs) show promise in supporting differential diagnosis, but their performance is challenging to evaluate due to the unstructured nature of their responses. To assess the current capabilities of LLMs to diagnose genetic diseases, we benchmarked these models on 5,213 case reports using the Phenopacket Schema, the Human Phenotype Ontology and Mondo disease ontology. Prompts generated from each phenopacket were sent to three generative pretrained transformer (GPT) models. The same phenopackets were used as input to a widely used diagnostic tool, Exomiser, in phenotype-only mode. The best LLM ranked the correct diagnosis first in 23.6% of cases, whereas Exomiser did so in 35.5% of cases. While the performance of LLMs for supporting differential diagnosis has been improving, it has not reached the level of commonly used traditional bioinformatics tools. Future research is needed to determine the best approach to incorporate LLMs into diagnostic pipelines.
Collapse
Affiliation(s)
- Justin T Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Monarch Initiative
| | - Leonardo Chimirri
- Monarch Initiative
- Berlin Institute of Health at Charite Universitaetsmedizin Berlin, Berlin, Germany
| | - Yasemin Bridges
- Monarch Initiative
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - Daniel Danis
- Monarch Initiative
- Berlin Institute of Health at Charite Universitaetsmedizin Berlin, Berlin, Germany
| | - J Harry Caufield
- Monarch Initiative
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Kyran Wissink
- Berlin Institute of Health at Charite Universitaetsmedizin Berlin, Berlin, Germany
| | - Julie A McMurry
- Monarch Initiative
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Adam SL Graefe
- Berlin Institute of Health at Charite Universitaetsmedizin Berlin, Berlin, Germany
| | - Elena Casiraghi
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano, Italy
- ELLIS-European Laboratory for Learning and Intelligent Systems
| | - Giorgio Valentini
- AnacletoLab, Dipartimento di Informatica, Università degli Studi di Milano, Milano, Italy
- ELLIS-European Laboratory for Learning and Intelligent Systems
| | - Julius OB Jacobsen
- Monarch Initiative
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - Melissa Haendel
- Monarch Initiative
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | - Damian Smedley
- Monarch Initiative
- William Harvey Research Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK
| | - Christopher J Mungall
- Monarch Initiative
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Peter N Robinson
- Monarch Initiative
- Berlin Institute of Health at Charite Universitaetsmedizin Berlin, Berlin, Germany
- ELLIS-European Laboratory for Learning and Intelligent Systems
- The Jackson Institute for Genomic Medicine, 10 Discovery Drive, Farmington CT 06032, USA
| |
Collapse
|
11
|
Toro S, Anagnostopoulos AV, Bello SM, Blumberg K, Cameron R, Carmody L, Diehl AD, Dooley DM, Duncan WD, Fey P, Gaudet P, Harris NL, Joachimiak MP, Kiani L, Lubiana T, Munoz-Torres MC, O'Neil S, Osumi-Sutherland D, Puig-Barbe A, Reese JT, Reiser L, Robb SM, Ruemping T, Seager J, Sid E, Stefancsik R, Weber M, Wood V, Haendel MA, Mungall CJ. Dynamic Retrieval Augmented Generation of Ontologies using Artificial Intelligence (DRAGON-AI). J Biomed Semantics 2024; 15:19. [PMID: 39415214 PMCID: PMC11484368 DOI: 10.1186/s13326-024-00320-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2024] [Accepted: 09/08/2024] [Indexed: 10/18/2024] Open
Abstract
BACKGROUND Ontologies are fundamental components of informatics infrastructure in domains such as biomedical, environmental, and food sciences, representing consensus knowledge in an accurate and computable form. However, their construction and maintenance demand substantial resources and necessitate substantial collaboration between domain experts, curators, and ontology experts. We present Dynamic Retrieval Augmented Generation of Ontologies using AI (DRAGON-AI), an ontology generation method employing Large Language Models (LLMs) and Retrieval Augmented Generation (RAG). DRAGON-AI can generate textual and logical ontology components, drawing from existing knowledge in multiple ontologies and unstructured text sources. RESULTS We assessed performance of DRAGON-AI on de novo term construction across ten diverse ontologies, making use of extensive manual evaluation of results. Our method has high precision for relationship generation, but has slightly lower precision than from logic-based reasoning. Our method is also able to generate definitions deemed acceptable by expert evaluators, but these scored worse than human-authored definitions. Notably, evaluators with the highest level of confidence in a domain were better able to discern flaws in AI-generated definitions. We also demonstrated the ability of DRAGON-AI to incorporate natural language instructions in the form of GitHub issues. CONCLUSIONS These findings suggest DRAGON-AI's potential to substantially aid the manual ontology construction process. However, our results also underscore the importance of having expert curators and ontology editors drive the ontology generation process.
Collapse
Affiliation(s)
- Sabrina Toro
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | | | - Kai Blumberg
- Department of Agriculture, Beltsville Human Nutrition Research Center, Beltsville, MD, USA
| | | | - Leigh Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA
| | | | | | | | - Petra Fey
- Northwestern University, Evanston, IL, USA
| | - Pascale Gaudet
- SIB Swiss Institute of Bioinformatics, Geneva, Switzerland
| | - Nomi L Harris
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | | | - Leila Kiani
- Independent Scientific Information Analyst, Philadelphia, USA
| | | | | | - Shawn O'Neil
- University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
| | | | | | - Justin T Reese
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | | | - Sofia Mc Robb
- Stowers Institute for Medical Research, Kansas City, MO, USA
| | | | | | - Eric Sid
- National Center for Advancing Translational Sciences, Bethesda, MD, USA
| | - Ray Stefancsik
- European Bioinformatics Institute (EMBL-EBI), Hinxton, UK
| | - Magalie Weber
- INRAE, French National Research Institute for Agriculture, Food and Environment, UR BIA, Nantes, France
| | | | | | | |
Collapse
|
12
|
Wu D, Yang J, Wang K. Exploring the reversal curse and other deductive logical reasoning in BERT and GPT-based large language models. PATTERNS (NEW YORK, N.Y.) 2024; 5:101030. [PMID: 39568650 PMCID: PMC11573886 DOI: 10.1016/j.patter.2024.101030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/04/2024] [Revised: 04/11/2024] [Accepted: 07/01/2024] [Indexed: 11/22/2024]
Abstract
The "Reversal Curse" describes the inability of autoregressive decoder large language models (LLMs) to deduce "B is A" from "A is B," assuming that B and A are distinct and can be uniquely identified from each other. This logical failure suggests limitations in using generative pretrained transformer (GPT) models for tasks like constructing knowledge graphs. Our study revealed that a bidirectional LLM, bidirectional encoder representations from transformers (BERT), does not suffer from this issue. To investigate further, we focused on more complex deductive reasoning by training encoder and decoder LLMs to perform union and intersection operations on sets. While both types of models managed tasks involving two sets, they struggled with operations involving three sets. Our findings underscore the differences between encoder and decoder models in handling logical reasoning. Thus, selecting BERT or GPT should depend on the task's specific needs, utilizing BERT's bidirectional context comprehension or GPT's sequence prediction strengths.
Collapse
Affiliation(s)
- Da Wu
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jingye Yang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Mathematics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Kai Wang
- Raymond G. Perelman Center for Cellular and Molecular Therapeutics, Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA
- Department of Pathology and Laboratory Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
13
|
Niyonkuru E, Caufield JH, Carmody LC, Gargano MA, Toro S, Whetzel PL, Blau H, Gomez MS, Casiraghi E, Chimirri L, Reese JT, Valentini G, Haendel MA, Mungall CJ, Robinson PN. Leveraging Generative AI to Accelerate Biocuration of Medical Actions for Rare Disease. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.08.22.24310814. [PMID: 39228707 PMCID: PMC11370550 DOI: 10.1101/2024.08.22.24310814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/05/2024]
Abstract
Structured representations of clinical data can support computational analysis of individuals and cohorts, and ontologies representing disease entities and phenotypic abnormalities are now commonly used for translational research. The Medical Action Ontology (MAxO) provides a computational representation of treatments and other actions taken for the clinical management of patients. Currently, manual biocuration is used to assign MAxO terms to rare diseases, enabling clinical management of rare diseases to be described computationally for use in clinical decision support and mechanism discovery. However, it is challenging to scale manual curation to comprehensively capture information about medical actions for the more than 10,000 rare diseases. We present AutoMAxO, a semi-automated workflow that leverages Large Language Models (LLMs) to streamline MAxO biocuration for rare diseases. AutoMAxO first uses LLMs to retrieve candidate curations from abstracts of relevant publications. Next, the candidate curations are matched to ontology terms from MAxO, Human Phenotype Ontology (HPO), and MONDO disease ontology via a combination of LLMs and post-processing techniques. Finally, the matched terms are presented in a structured form to a human curator for approval. We used this approach to process 4,918 unique medical abstracts and identified annotations for 21 rare genetic diseases, we extracted 18,631 candidate disease-treatment curations, 538 of which were confirmed and transferred to the MAxO annotation dataset. The results of this project underscore the potential of generative AI to accelerate precision medicine by enabling a robust and comprehensive curation of the primary literature to represent information about diseases and procedures in a structured fashion. Although we focused on MAxO in this project, similar approaches could be taken for other biomedical curation tasks.
Collapse
Affiliation(s)
- Enock Niyonkuru
- Trinity College, Hartford, CT, USA
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - J Harry Caufield
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Leigh C Carmody
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Michael A Gargano
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Sabrina Toro
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Patricia L Whetzel
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Hannah Blau
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
| | - Mauricio Soto Gomez
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy
| | - Elena Casiraghi
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy
| | - Leonardo Chimirri
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
| | - Justin T Reese
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Giorgio Valentini
- AnacletoLab, Computer Science Department, Dipartimento di Informatica, Università degli Studi di Milano, Milan, 20133, Italy
| | - Melissa A Haendel
- Department of Genetics, The University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Christopher J Mungall
- Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Peter N Robinson
- The Jackson Laboratory for Genomic Medicine, Farmington, CT 06032, USA
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117 Berlin, Germany
| |
Collapse
|
14
|
Yuan H, Mancuso CA, Johnson K, Braasch I, Krishnan A. Computational strategies for cross-species knowledge transfer and translational biomedicine. ARXIV 2024:arXiv:2408.08503v1. [PMID: 39184546 PMCID: PMC11343225] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/27/2024]
Abstract
Research organisms provide invaluable insights into human biology and diseases, serving as essential tools for functional experiments, disease modeling, and drug testing. However, evolutionary divergence between humans and research organisms hinders effective knowledge transfer across species. Here, we review state-of-the-art methods for computationally transferring knowledge across species, primarily focusing on methods that utilize transcriptome data and/or molecular networks. We introduce the term "agnology" to describe the functional equivalence of molecular components regardless of evolutionary origin, as this concept is becoming pervasive in integrative data-driven models where the role of evolutionary origin can become unclear. Our review addresses four key areas of information and knowledge transfer across species: (1) transferring disease and gene annotation knowledge, (2) identifying agnologous molecular components, (3) inferring equivalent perturbed genes or gene sets, and (4) identifying agnologous cell types. We conclude with an outlook on future directions and several key challenges that remain in cross-species knowledge transfer.
Collapse
Affiliation(s)
- Hao Yuan
- Genetics and Genome Science Program; Ecology, Evolution, and Behavior Program, Michigan State University
| | - Christopher A. Mancuso
- Department of Biostatistics & Informatics, University of Colorado Anschutz Medical Campus
| | - Kayla Johnson
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus
| | - Ingo Braasch
- Department of Integrative Biology; Genetics and Genome Science Program; Ecology, Evolution, and Behavior Program, Michigan State University
| | - Arjun Krishnan
- Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus
| |
Collapse
|
15
|
Joachimiak MP, Caufield JH, Harris NL, Kim H, Mungall CJ. Gene Set Summarization Using Large Language Models. ARXIV 2024:arXiv:2305.13338v3. [PMID: 37292480 PMCID: PMC10246080] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Molecular biologists frequently interpret gene lists derived from high-throughput experiments and computational analysis. This is typically done as a statistical enrichment analysis that measures the over- or under-representation of biological function terms associated with genes or their properties, based on curated assertions from a knowledge base (KB) such as the Gene Ontology (GO). Interpreting gene lists can also be framed as a textual summarization task, enabling Large Language Models (LLMs) to use scientific texts directly and avoid reliance on a KB. TALISMAN (Terminological ArtificiaL Intelligence SuMmarization of Annotation and Narratives) uses generative AI to perform gene set function summarization as a complement to standard enrichment analysis. This method can use different sources of gene functional information: (1) structured text derived from curated ontological KB annotations, (2) ontology-free narrative gene summaries, or (3) direct retrieval from the model. We demonstrate that these methods are able to generate plausible and biologically valid summary GO term lists for an input gene set. However, LLM-based approaches are unable to deliver reliable scores or p-values and often return terms that are not statistically significant. Crucially, in our experiments these methods were rarely able to recapitulate the most precise and informative term from standard enrichment analysis. We also observe minor differences depending on prompt input information, with GO term descriptions leading to higher recall but lower precision. However, newer LLM models perform statistically significantly better than the oldest model across all performance metrics, suggesting that future models may lead to further improvements. Overall, the results are nondeterministic, with minor variations in prompt resulting in radically different term lists, true to the stochastic nature of LLMs. Our results show that at this point, LLM-based methods are unsuitable as a replacement for standard term enrichment analysis, however they may provide summarization benefits for implicit knowledge integration across extant but unstandardized knowledge, for large sets of features, and where the amount of information is difficult for humans to process.
Collapse
Affiliation(s)
- Marcin P Joachimiak
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - J Harry Caufield
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | - Nomi L Harris
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| | | | - Christopher J Mungall
- Biosystems Data Science Department, Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, 1 Cyclotron Road, Berkeley, CA 94720, USA
| |
Collapse
|
16
|
Gillon CJ, Baker C, Ly R, Balzani E, Brunton BW, Schottdorf M, Ghosh S, Dehghani N. Open Data In Neurophysiology: Advancements, Solutions & Challenges. ARXIV 2024:arXiv:2407.00976v1. [PMID: 39010879 PMCID: PMC11247910] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
Across the life sciences, an ongoing effort over the last 50 years has made data and methods more reproducible and transparent. This openness has led to transformative insights and vastly accelerated scientific progress1,2. For example, structural biology3 and genomics4,5 have undertaken systematic collection and publication of protein sequences and structures over the past half-century, and these data have led to scientific breakthroughs that were unthinkable when data collection first began (e.g.6). We believe that neuroscience is poised to follow the same path, and that principles of open data and open science will transform our understanding of the nervous system in ways that are impossible to predict at the moment. To this end, new social structures along with active and open scientific communities are essential7 to facilitate and expand the still limited adoption of open science practices in our field8. Unified by shared values of openness, we set out to organize a symposium for Open Data in Neuroscience (ODIN) to strengthen our community and facilitate transformative neuroscience research at large. In this report, we share what we learned during this first ODIN event. We also lay out plans for how to grow this movement, document emerging conversations, and propose a path toward a better and more transparent science of tomorrow.
Collapse
Affiliation(s)
- Colleen J Gillon
- These authors contributed equally to this paper
- Department of Bioengineering, Imperial College London, London, UK
| | - Cody Baker
- These authors contributed equally to this paper
- CatalystNeuro, Benicia, CA, USA
| | - Ryan Ly
- These authors contributed equally to this paper
- Scientific Data Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA
| | - Edoardo Balzani
- Center for Computational Neuroscience, Flatiron Institute, New York, NY, USA
| | - Bingni W Brunton
- Department of Biology, University of Washington, Seattle, WA, USA
| | - Manuel Schottdorf
- Princeton Neuroscience Institute, Princeton University, Princeton, NJ, USA
| | - Satrajit Ghosh
- McGovern Institute for Brain Research, MIT, Cambridge, MA, USA
| | - Nima Dehghani
- McGovern Institute for Brain Research, MIT, Cambridge, MA, USA
- These authors contributed equally to this paper
| |
Collapse
|
17
|
Anderson W, Braun I, Bhatnagar R, Romero K, Walls R, Schito M, Podichetty JT. Unlocking the Capabilities of Large Language Models for Accelerating Drug Development. Clin Pharmacol Ther 2024; 116:38-41. [PMID: 38649744 DOI: 10.1002/cpt.3279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Accepted: 03/27/2024] [Indexed: 04/25/2024]
Affiliation(s)
| | - Ian Braun
- Critical Path Institute, Tucson, Arizona, USA
| | | | | | | | | | | |
Collapse
|