1
|
Chang YC, Huang MS, Huang YH, Lin YH. The influence of prompt engineering on large language models for protein-protein interaction identification in biomedical literature. Sci Rep 2025; 15:15493. [PMID: 40319086 PMCID: PMC12049485 DOI: 10.1038/s41598-025-99290-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Accepted: 04/18/2025] [Indexed: 05/07/2025] Open
Abstract
Identifying protein-protein interactions (PPIs) is a foundational task in biomedical natural language processing. While specialized models have been developed, the potential of general-domain large language models (LLMs) in PPI extraction, particularly for researchers without computational expertise, remains unexplored. This study evaluates the effectiveness of proprietary LLMs (GPT-3.5, GPT-4, and Google Gemini) in PPI prediction through systematic prompt engineering. We designed six prompting scenarios of increasing complexity, from basic interaction queries to sophisticated entity-tagged formats, and assessed model performance across multiple benchmark datasets (LLL, IEPA, HPRD50, AIMed, BioInfer, and PEDD). Carefully designed prompts effectively guided LLMs in PPI prediction. Gemini 1.5 Pro achieved the highest performance across most datasets, with notable F1-scores in LLL (90.3%), IEPA (68.2%), HPRD50 (67.5%), and PEDD (70.2%). GPT-4 showed competitive performance, particularly in the LLL dataset (87.3%). We identified and addressed a positive prediction bias, demonstrating improved performance after evaluation refinement. While not surpassing specialized models, general-purpose LLMs with appropriate prompting strategies can effectively perform PPI prediction tasks, offering valuable tools for biomedical researchers without extensive computational expertise.
Collapse
Affiliation(s)
- Yung-Chun Chang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan.
- Clinical Big Data Research Center, Taipei Medical University Hospital, Taipei, Taiwan.
| | - Ming-Siang Huang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Yi-Hsuan Huang
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| | - Yi-Hsuan Lin
- Graduate Institute of Data Science, Taipei Medical University, Taipei, Taiwan
| |
Collapse
|
2
|
Halder A, Saha B, Roy M, Majumder S. A novel deep sequential learning architecture for drug drug interaction prediction using DDINet. Sci Rep 2025; 15:9337. [PMID: 40102542 PMCID: PMC11920219 DOI: 10.1038/s41598-025-93952-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2024] [Accepted: 03/11/2025] [Indexed: 03/20/2025] Open
Abstract
Drug drug Interactions (DDI) present considerable challenges in healthcare, often resulting in adverse effects or decreased therapeutic efficacy. This article proposes a novel deep sequential learning architecture called DDINet to predict and classify DDIs between pairs of drugs based on different mechanisms viz., Excretion, Absorption, Metabolism, and Excretion rate (higher serum level) etc. Chemical features such as Hall Smart, Amino Acid count and Carbon types are extracted from each drug (pairs) to apply as an input to the proposed model. Proposed DDINet incorporates attention mechanism and deep sequential learning architectures, such as Long Short-Term Memory and gated recurrent unit. It utilizes the Rcpi toolkit to extract biochemical features of drugs from their chemical composition in Simplified Molecular-Input Line-Entry System format. Experiments are conducted on publicly available DDI datasets from DrugBank and Kaggle. The model's efficacy in predicting and classifying DDIs is evaluated using various performance measures. The experimental results show that DDINet outperformed eight counterpart techniques achieving [Formula: see text] overall accuracy which is also statistically confirmed by Confidence Interval tests and paired t-tests. This architecture may act as an effective computational technique for drug drug interaction with respect to mechanism which may act as a complementary tool to reduce costly wet lab experiments for DDI prediction and classification.
Collapse
Affiliation(s)
- Anindya Halder
- Department of Computer Application, School of Technology, North-Eastern Hill University, Tura Campus, Tura, Meghalaya, 794002, India.
| | - Biswanath Saha
- Department of Computer Application, School of Technology, North-Eastern Hill University, Tura Campus, Tura, Meghalaya, 794002, India.
| | - Moumita Roy
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, 741235, India.
| | - Sukanta Majumder
- Department of Computer Science and Engineering, University of Kalyani, Kalyani, West Bengal, 741235, India.
| |
Collapse
|
3
|
Huang MS, Han JC, Lin PY, You YT, Tsai RTH, Hsu WL. Surveying biomedical relation extraction: a critical examination of current datasets and the proposal of a new resource. Brief Bioinform 2024; 25:bbae132. [PMID: 38609331 PMCID: PMC11014787 DOI: 10.1093/bib/bbae132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 11/06/2023] [Accepted: 03/02/2023] [Indexed: 04/14/2024] Open
Abstract
Natural language processing (NLP) has become an essential technique in various fields, offering a wide range of possibilities for analyzing data and developing diverse NLP tasks. In the biomedical domain, understanding the complex relationships between compounds and proteins is critical, especially in the context of signal transduction and biochemical pathways. Among these relationships, protein-protein interactions (PPIs) are of particular interest, given their potential to trigger a variety of biological reactions. To improve the ability to predict PPI events, we propose the protein event detection dataset (PEDD), which comprises 6823 abstracts, 39 488 sentences and 182 937 gene pairs. Our PEDD dataset has been utilized in the AI CUP Biomedical Paper Analysis competition, where systems are challenged to predict 12 different relation types. In this paper, we review the state-of-the-art relation extraction research and provide an overview of the PEDD's compilation process. Furthermore, we present the results of the PPI extraction competition and evaluate several language models' performances on the PEDD. This paper's outcomes will provide a valuable roadmap for future studies on protein event detection in NLP. By addressing this critical challenge, we hope to enable breakthroughs in drug discovery and enhance our understanding of the molecular mechanisms underlying various diseases.
Collapse
Affiliation(s)
- Ming-Siang Huang
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| | - Jen-Chieh Han
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
| | - Pei-Yen Lin
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Yu-Ting You
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
| | - Richard Tzong-Han Tsai
- Intelligent Information Service Research Laboratory, Department of Computer Science and Information Engineering, National Central University, Taoyuan, Taiwan
- Center for Geographic Information Science, Research Center for Humanities and Social Sciences, Academia Sinica, Taipei, Taiwan
| | - Wen-Lian Hsu
- Intelligent Agent Systems Laboratory, Department of Computer Science and Information Engineering, Asia University, New Taipei City, Taiwan
- Department of Computer Science and Information Engineering, College of Information and Electrical Engineering, Asia University, Taichung, Taiwan
| |
Collapse
|
4
|
Lee M. Recent Advances in Deep Learning for Protein-Protein Interaction Analysis: A Comprehensive Review. Molecules 2023; 28:5169. [PMID: 37446831 DOI: 10.3390/molecules28135169] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 06/30/2023] [Accepted: 06/30/2023] [Indexed: 07/15/2023] Open
Abstract
Deep learning, a potent branch of artificial intelligence, is steadily leaving its transformative imprint across multiple disciplines. Within computational biology, it is expediting progress in the understanding of Protein-Protein Interactions (PPIs), key components governing a wide array of biological functionalities. Hence, an in-depth exploration of PPIs is crucial for decoding the intricate biological system dynamics and unveiling potential avenues for therapeutic interventions. As the deployment of deep learning techniques in PPI analysis proliferates at an accelerated pace, there exists an immediate demand for an exhaustive review that encapsulates and critically assesses these novel developments. Addressing this requirement, this review offers a detailed analysis of the literature from 2021 to 2023, highlighting the cutting-edge deep learning methodologies harnessed for PPI analysis. Thus, this review stands as a crucial reference for researchers in the discipline, presenting an overview of the recent studies in the field. This consolidation helps elucidate the dynamic paradigm of PPI analysis, the evolution of deep learning techniques, and their interdependent dynamics. This scrutiny is expected to serve as a vital aid for researchers, both well-established and newcomers, assisting them in maneuvering the rapidly shifting terrain of deep learning applications in PPI analysis.
Collapse
Affiliation(s)
- Minhyeok Lee
- School of Electrical and Electronics Engineering, Chung-Ang University, Seoul 06974, Republic of Korea
| |
Collapse
|
5
|
Serna García G, Al Khalaf R, Invernici F, Ceri S, Bernasconi A. CoVEffect: interactive system for mining the effects of SARS-CoV-2 mutations and variants based on deep learning. Gigascience 2022; 12:giad036. [PMID: 37222749 PMCID: PMC10205000 DOI: 10.1093/gigascience/giad036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/11/2023] [Accepted: 04/27/2023] [Indexed: 05/25/2023] Open
Abstract
BACKGROUND Literature about SARS-CoV-2 widely discusses the effects of variations that have spread in the past 3 years. Such information is dispersed in the texts of several research articles, hindering the possibility of practically integrating it with related datasets (e.g., millions of SARS-CoV-2 sequences available to the community). We aim to fill this gap, by mining literature abstracts to extract-for each variant/mutation-its related effects (in epidemiological, immunological, clinical, or viral kinetics terms) with labeled higher/lower levels in relation to the nonmutated virus. RESULTS The proposed framework comprises (i) the provisioning of abstracts from a COVID-19-related big data corpus (CORD-19) and (ii) the identification of mutation/variant effects in abstracts using a GPT2-based prediction model. The above techniques enable the prediction of mutations/variants with their effects and levels in 2 distinct scenarios: (i) the batch annotation of the most relevant CORD-19 abstracts and (ii) the on-demand annotation of any user-selected CORD-19 abstract through the CoVEffect web application (http://gmql.eu/coveffect), which assists expert users with semiautomated data labeling. On the interface, users can inspect the predictions and correct them; user inputs can then extend the training dataset used by the prediction model. Our prototype model was trained through a carefully designed process, using a minimal and highly diversified pool of samples. CONCLUSIONS The CoVEffect interface serves for the assisted annotation of abstracts, allowing the download of curated datasets for further use in data integration or analysis pipelines. The overall framework can be adapted to resolve similar unstructured-to-structured text translation tasks, which are typical of biomedical domains.
Collapse
Affiliation(s)
- Giuseppe Serna García
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Ruba Al Khalaf
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Francesco Invernici
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Stefano Ceri
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| | - Anna Bernasconi
- Dipartimento di Informazione, Elettronica e Bioingegneria, 20133 Milano Country: Italy, Italy
| |
Collapse
|
6
|
Serna Garcia G, Leone M, Bernasconi A, Carman MJ. GeMI: interactive interface for transformer-based Genomic Metadata Integration. Database (Oxford) 2022; 2022:6600540. [PMID: 35657113 PMCID: PMC9216561 DOI: 10.1093/database/baac036] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Revised: 03/26/2022] [Accepted: 04/26/2022] [Indexed: 11/15/2022]
Abstract
The Gene Expression Omnibus (GEO) is a public archive containing >4 million digital samples from functional genomics experiments collected over almost two decades. The accompanying metadata describing the experiments suffer from redundancy, inconsistency and incompleteness due to the prevalence of free text and the lack of well-defined data formats and their validation. To remedy this situation, we created Genomic Metadata Integration (GeMI; http://gmql.eu/gemi/), a web application that learns to automatically extract structured metadata (in the form of key-value pairs) from the plain text descriptions of GEO experiments. The extracted information can then be indexed for structured search and used for various downstream data mining activities. GeMI works in continuous interaction with its users. The natural language processing transformer-based model at the core of our system is a fine-tuned version of the Generative Pre-trained Transformer 2 (GPT2) model that is able to learn continuously from the feedback of the users thanks to an active learning framework designed for the purpose. As a part of such a framework, a machine learning interpretation mechanism (that exploits saliency maps) allows the users to understand easily and quickly whether the predictions of the model are correct and improves the overall usability. GeMI’s ability to extract attributes not explicitly mentioned (such as sex, tissue type, cell type, ethnicity and disease) allows researchers to perform specific queries and classification of experiments, which was previously possible only after spending time and resources with tedious manual annotation. The usefulness of GeMI is demonstrated on practical research use cases.
Database URL
http://gmql.eu/gemi/
Collapse
Affiliation(s)
- Giuseppe Serna Garcia
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Michele Leone
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Anna Bernasconi
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| | - Mark J Carman
- Department of Electronics, Information, and Bioengineering, Politecnico di Milano , Via Ponzio 34/5, Milano 20133, Italy
| |
Collapse
|
7
|
Vo TH, Nguyen NTK, Kha QH, Le NQK. On the road to explainable AI in drug-drug interactions prediction: A systematic review. Comput Struct Biotechnol J 2022; 20:2112-2123. [PMID: 35832629 PMCID: PMC9092071 DOI: 10.1016/j.csbj.2022.04.021] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2022] [Revised: 04/15/2022] [Accepted: 04/15/2022] [Indexed: 12/26/2022] Open
Abstract
Over the past decade, polypharmacy instances have been common in multi-diseases treatment. However, unwanted drug-drug interactions (DDIs) that might cause unexpected adverse drug events (ADEs) in multiple regimens therapy remain a significant issue. Since artificial intelligence (AI) is ubiquitous today, many AI prediction models have been developed to predict DDIs to support clinicians in pharmacotherapy-related decisions. However, even though DDI prediction models have great potential for assisting physicians in polypharmacy decisions, there are still concerns regarding the reliability of AI models due to their black-box nature. Building AI models with explainable mechanisms can augment their transparency to address the above issue. Explainable AI (XAI) promotes safety and clarity by showing how decisions are made in AI models, especially in critical tasks like DDI predictions. In this review, a comprehensive overview of AI-based DDI prediction, including the publicly available source for AI-DDIs studies, the methods used in data manipulation and feature preprocessing, the XAI mechanisms to promote trust of AI, especially for critical tasks as DDIs prediction, the modeling methods, is provided. Limitations and the future directions of XAI in DDIs are also discussed.
Collapse
Affiliation(s)
- Thanh Hoa Vo
- Master Program in Clinical Genomics and Proteomics, College of Pharmacy, Taipei Medical University, Taipei 110, Taiwan
| | - Ngan Thi Kim Nguyen
- School of Nutrition and Health Sciences, College of Nutrition, Taipei Medical University, Taipei 11031, Taiwan
| | - Quang Hien Kha
- International Master/Ph.D. Program in Medicine, College of Medicine, Taipei Medical University, Taipei 110, Taiwan
| | - Nguyen Quoc Khanh Le
- Professional Master Program in Artificial Intelligence in Medicine, College of Medicine, Taipei Medical University, Taipei 106, Taiwan
- Research Center for Artificial Intelligence in Medicine, Taipei Medical University, Taipei 106, Taiwan
- Translational Imaging Research Center, Taipei Medical University Hospital, Taipei 110, Taiwan
| |
Collapse
|