1
|
Li P, Shao B, Zhao G, Liu ZP. Negative sampling strategies impact the prediction of scale-free biomolecular network interactions with machine learning. BMC Biol 2025; 23:123. [PMID: 40346567 PMCID: PMC12065207 DOI: 10.1186/s12915-025-02231-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2024] [Accepted: 05/02/2025] [Indexed: 05/11/2025] Open
Abstract
BACKGROUND Understanding protein-molecular interaction is crucial for unraveling the mechanisms underlying diverse biological processes. Machine learning (ML) techniques have been extensively employed in predicting these interactions and have garnered substantial research focus. Previous studies have predominantly centered on improving model performance through novel and efficient ML approaches, often resulting in overoptimistic predictive estimates. However, these advancements frequently neglect the inherent biases stemming from network properties, particularly in biological contexts. RESULTS In this study, we examined the biases inherent in ML models during the learning and prediction of protein-molecular interactions, particularly those arising from the scale-free property of biological networks-a characteristic where in a few nodes have many connections while most have very few. Our comprehensive analysis across diverse tasks, datasets, and ML methods provides compelling evidence of these biases. We discovered that the training and evaluation of ML models are profoundly influenced by network topology, potentially distorting model performance assessments. To mitigate this issue, we propose the degree distribution balanced (DDB) sampling strategy, a straightforward yet potent approach that alleviates biases stemming from network properties. This method further underscores the limitations of certain ML models in learning protein-molecular interactions solely from intrinsic molecular features. CONCLUSIONS Our findings present a novel perspective for assessing the performance of ML models in inferring protein-molecular interactions with greater fairness. By addressing biases introduced by network properties, the DDB sampling approach provides a more balanced and precise assessment of model capabilities. These insights hold the potential to bolster the reliability of ML models in bioinformatics, fostering a more stringent evaluation framework for predicting protein-molecular interactions.
Collapse
Affiliation(s)
- Pengpai Li
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China
| | - Bowen Shao
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China
| | - Guoqing Zhao
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China
| | - Zhi-Ping Liu
- Department of Biomedical Engineering, School of Control Science and Engineering, Shandong University, Jinan, 250061, Shandong, China.
- National Center for Applied Mathematics, Shandong University, Jinan, 250100, Shandong, China.
| |
Collapse
|
2
|
Cao MY, Zainudin S, Daud KM. Feature fusion with attributed deepwalk for protein-protein interaction prediction. Sci Rep 2025; 15:12255. [PMID: 40210917 PMCID: PMC11985984 DOI: 10.1038/s41598-025-96510-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 03/28/2025] [Indexed: 04/12/2025] Open
Abstract
protein-protein interactions (PPIs) are crucial for understanding cellular processes and disease mechanisms. While experimental methods for detecting PPIs exist, computational approaches offer a more efficient alternative. However, current computational methods often rely on single feature types or simple feature concatenation, potentially missing the complex nature of protein interactions. This study proposes FFADW (Feature Fusion Method with Attributed DeepWalk), a novel approach that integrates sequence and network features using a weighted fusion strategy controlled by an adjustable α parameter. Specifically, sequence similarity is computed using Levenshtein distance, while network similarity is measured via a Gaussian kernel-based approach. These complementary features are fused through the weighting mechanism before being processed by the Attributed DeepWalk algorithm, which enhances protein representations by learning low-dimensional embeddings. The fused representations are then used to train classifiers for PPI prediction. Evaluation across three datasets using multiple classifiers demonstrated that FFADW significantly improves sample clustering and performs better than existing approaches, with the XGBoost classifier showing the best results. The weighted fusion approach effectively combines different aspects of protein data while reducing noise and redundancy, offering an improved technique for computational PPI prediction.
Collapse
Affiliation(s)
- Mei-Yuan Cao
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, UKM, 43600, Bangi, Selangor, Malaysia.
| | - Suhaila Zainudin
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, UKM, 43600, Bangi, Selangor, Malaysia
| | - Kauthar Mohd Daud
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, UKM, 43600, Bangi, Selangor, Malaysia
| |
Collapse
|
3
|
Cao MY, Zainudin S, Daud KM. Protein features fusion using attributed network embedding for predicting protein-protein interaction. BMC Genomics 2024; 25:466. [PMID: 38741045 DOI: 10.1186/s12864-024-10361-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 04/29/2024] [Indexed: 05/16/2024] Open
Abstract
BACKGROUND Protein-protein interactions (PPIs) hold significant importance in biology, with precise PPI prediction as a pivotal factor in comprehending cellular processes and facilitating drug design. However, experimental determination of PPIs is laborious, time-consuming, and often constrained by technical limitations. METHODS We introduce a new node representation method based on initial information fusion, called FFANE, which amalgamates PPI networks and protein sequence data to enhance the precision of PPIs' prediction. A Gaussian kernel similarity matrix is initially established by leveraging protein structural resemblances. Concurrently, protein sequence similarities are gauged using the Levenshtein distance, enabling the capture of diverse protein attributes. Subsequently, to construct an initial information matrix, these two feature matrices are merged by employing weighted fusion to achieve an organic amalgamation of structural and sequence details. To gain a more profound understanding of the amalgamated features, a Stacked Autoencoder (SAE) is employed for encoding learning, thereby yielding more representative feature representations. Ultimately, classification models are trained to predict PPIs by using the well-learned fusion feature. RESULTS When employing 5-fold cross-validation experiments on SVM, our proposed method achieved average accuracies of 94.28%, 97.69%, and 84.05% in terms of Saccharomyces cerevisiae, Homo sapiens, and Helicobacter pylori datasets, respectively. CONCLUSION Experimental findings across various authentic datasets validate the efficacy and superiority of this fusion feature representation approach, underscoring its potential value in bioinformatics.
Collapse
Affiliation(s)
- Mei-Yuan Cao
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia.
| | - Suhaila Zainudin
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia
| | - Kauthar Mohd Daud
- Center for Artificial Intelligence Technology (CAIT), Faculty of Information Science and Technology, Universiti Kebangsaan Malaysia, Bangi, 43600, Selangor, Malaysia
| |
Collapse
|
4
|
Dang TH, Vu TA. xCAPT5: protein-protein interaction prediction using deep and wide multi-kernel pooling convolutional neural networks with protein language model. BMC Bioinformatics 2024; 25:106. [PMID: 38461247 PMCID: PMC10924985 DOI: 10.1186/s12859-024-05725-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2024] [Accepted: 02/28/2024] [Indexed: 03/11/2024] Open
Abstract
BACKGROUND Predicting protein-protein interactions (PPIs) from sequence data is a key challenge in computational biology. While various computational methods have been proposed, the utilization of sequence embeddings from protein language models, which contain diverse information, including structural, evolutionary, and functional aspects, has not been fully exploited. Additionally, there is a significant need for a comprehensive neural network capable of efficiently extracting these multifaceted representations. RESULTS Addressing this gap, we propose xCAPT5, a novel hybrid classifier that uniquely leverages the T5-XL-UniRef50 protein large language model for generating rich amino acid embeddings from protein sequences. The core of xCAPT5 is a multi-kernel deep convolutional siamese neural network, which effectively captures intricate interaction features at both micro and macro levels, integrated with the XGBoost algorithm, enhancing PPIs classification performance. By concatenating max and average pooling features in a depth-wise manner, xCAPT5 effectively learns crucial features with low computational cost. CONCLUSION This study represents one of the initial efforts to extract informative amino acid embeddings from a large protein language model using a deep and wide convolutional network. Experimental results show that xCAPT5 outperforms recent state-of-the-art methods in binary PPI prediction, excelling in cross-validation on several benchmark datasets and demonstrating robust generalization across intra-species, cross-species, inter-species, and stringent similarity contexts.
Collapse
Affiliation(s)
- Thanh Hai Dang
- Faculty of Information Technology, VNU University of Engineering and Technology, 144 Xuan Thuy, Hanoi, 10000, Vietnam.
| | - Tien Anh Vu
- Faculty of Biology, VNU University of Science, 334 Nguyen Trai, Hanoi, 10000, Vietnam
| |
Collapse
|
5
|
Ran Y, Xu XK, Jia T. The maximum capability of a topological feature in link prediction. PNAS NEXUS 2024; 3:pgae113. [PMID: 38528954 PMCID: PMC10962729 DOI: 10.1093/pnasnexus/pgae113] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Accepted: 02/21/2024] [Indexed: 03/27/2024]
Abstract
Networks offer a powerful approach to modeling complex systems by representing the underlying set of pairwise interactions. Link prediction is the task that predicts links of a network that are not directly visible, with profound applications in biological, social, and other complex systems. Despite intensive utilization of the topological feature in this task, it is unclear to what extent a feature can be leveraged to infer missing links. Here, we aim to unveil the capability of a topological feature in link prediction by identifying its prediction performance upper bound. We introduce a theoretical framework that is compatible with different indexes to gauge the feature, different prediction approaches to utilize the feature, and different metrics to quantify the prediction performance. The maximum capability of a topological feature follows a simple yet theoretically validated expression, which only depends on the extent to which the feature is held in missing and nonexistent links. Because a family of indexes based on the same feature shares the same upper bound, the potential of all others can be estimated from one single index. Furthermore, a feature's capability is lifted in the supervised prediction, which can be mathematically quantified, allowing us to estimate the benefit of applying machine learning algorithms. The universality of the pattern uncovered is empirically verified by 550 structurally diverse networks. The findings have applications in feature and method selection, and shed light on network characteristics that make a topological feature effective in link prediction.
Collapse
Affiliation(s)
- Yijun Ran
- College of Computer and Information Science, Southwest University, Chongqing 400715, P.R. China
- Center for Computational Communication Research, Beijing Normal University, Zhuhai 519087, P.R. China
- School of Journalism and Communication, Beijing Normal University, Beijing 100875, P.R. China
| | - Xiao-Ke Xu
- Center for Computational Communication Research, Beijing Normal University, Zhuhai 519087, P.R. China
- School of Journalism and Communication, Beijing Normal University, Beijing 100875, P.R. China
| | - Tao Jia
- College of Computer and Information Science, Southwest University, Chongqing 400715, P.R. China
| |
Collapse
|
6
|
Lannelongue L, Inouye M. Pitfalls of machine learning models for protein-protein interaction networks. Bioinformatics 2024; 40:btae012. [PMID: 38200587 PMCID: PMC10868344 DOI: 10.1093/bioinformatics/btae012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 11/24/2023] [Accepted: 01/09/2024] [Indexed: 01/12/2024] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) are essential to understanding biological pathways as well as their roles in development and disease. Computational tools, based on classic machine learning, have been successful at predicting PPIs in silico, but the lack of consistent and reliable frameworks for this task has led to network models that are difficult to compare and discrepancies between algorithms that remain unexplained. RESULTS To better understand the underlying inference mechanisms that underpin these models, we designed an open-source framework for benchmarking that accounts for a range of biological and statistical pitfalls while facilitating reproducibility. We use it to shed light on the impact of network topology and how different algorithms deal with highly connected proteins. By studying functional genomics-based and sequence-based models on human PPIs, we show their complementarity as the former performs best on lone proteins while the latter specializes in interactions involving hubs. We also show that algorithm design has little impact on performance with functional genomic data. We replicate our results between both human and S. cerevisiae data and demonstrate that models using functional genomics are better suited to PPI prediction across species. With rapidly increasing amounts of sequence and functional genomics data, our study provides a principled foundation for future construction, comparison, and application of PPI networks. AVAILABILITY AND IMPLEMENTATION The code and data are available on GitHub: https://github.com/Llannelongue/B4PPI.
Collapse
Affiliation(s)
- Loïc Lannelongue
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, United Kingdom
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, CB2 0BB Cambridge, United Kingdom
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, United Kingdom
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, 3004 Victoria, Australia
- British Heart Foundation Centre of Research Excellence, University of Cambridge, CB2 0BB Cambridge, United Kingdom
| |
Collapse
|
7
|
Albu AI, Bocicor MI, Czibula G. MM-StackEns: A new deep multimodal stacked generalization approach for protein-protein interaction prediction. Comput Biol Med 2023; 153:106526. [PMID: 36623437 DOI: 10.1016/j.compbiomed.2022.106526] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Revised: 12/13/2022] [Accepted: 12/31/2022] [Indexed: 01/05/2023]
Abstract
Accurate in-silico identification of protein-protein interactions (PPIs) is a long-standing problem in biology, with important implications in protein function prediction and drug design. Current computational approaches predominantly use a single data modality for describing protein pairs, which may not fully capture the characteristics relevant for identifying PPIs. Another limitation of existing methods is their poor generalization to proteins outside the training graph. In this paper, we aim to address these shortcomings by proposing a new ensemble approach for PPI prediction, which learns information from two modalities, corresponding to pairs of sequences and to the graph formed by the training proteins and their interactions. Our approach uses a siamese neural network to process sequence information, while graph attention networks are employed for the network view. For capturing the relationships between the proteins in a pair, we design a new feature fusion module, based on computing the distance between the distributions corresponding to the two proteins. The prediction is made using a stacked generalization procedure, in which the final classifier is represented by a Logistic Regression model trained on the scores predicted by the sequence and graph models. Additionally, we show that protein sequence embeddings obtained using pretrained language models can significantly improve the generalization of PPI methods. The experimental results demonstrate the good performance of our approach, which surpasses all the related work on two Yeast data sets, while outperforming the majority of literature approaches on two Human data sets and on independent multi-species data sets.
Collapse
Affiliation(s)
- Alexandra-Ioana Albu
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| | - Maria-Iuliana Bocicor
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| | - Gabriela Czibula
- Department of Computer Science, Babeş-Bolyai University, 1 Mihail Kogalniceanu Street, Cluj-Napoca, 400084, Romania.
| |
Collapse
|
8
|
Rogers JR, Nikolényi G, AlQuraishi M. Growing ecosystem of deep learning methods for modeling protein-protein interactions. Protein Eng Des Sel 2023; 36:gzad023. [PMID: 38102755 DOI: 10.1093/protein/gzad023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2023] [Revised: 12/06/2023] [Accepted: 12/07/2023] [Indexed: 12/17/2023] Open
Abstract
Numerous cellular functions rely on protein-protein interactions. Efforts to comprehensively characterize them remain challenged however by the diversity of molecular recognition mechanisms employed within the proteome. Deep learning has emerged as a promising approach for tackling this problem by exploiting both experimental data and basic biophysical knowledge about protein interactions. Here, we review the growing ecosystem of deep learning methods for modeling protein interactions, highlighting the diversity of these biophysically informed models and their respective trade-offs. We discuss recent successes in using representation learning to capture complex features pertinent to predicting protein interactions and interaction sites, geometric deep learning to reason over protein structures and predict complex structures, and generative modeling to design de novo protein assemblies. We also outline some of the outstanding challenges and promising new directions. Opportunities abound to discover novel interactions, elucidate their physical mechanisms, and engineer binders to modulate their functions using deep learning and, ultimately, unravel how protein interactions orchestrate complex cellular behaviors.
Collapse
Affiliation(s)
- Julia R Rogers
- Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | - Gergő Nikolényi
- Department of Systems Biology, Columbia University, New York, NY 10032, USA
| | | |
Collapse
|
9
|
Yue ZX, Yan TC, Xu HQ, Liu YH, Hong YF, Chen GX, Xie T, Tao L. A systematic review on the state-of-the-art strategies for protein representation. Comput Biol Med 2023; 152:106440. [PMID: 36543002 DOI: 10.1016/j.compbiomed.2022.106440] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 12/08/2022] [Accepted: 12/15/2022] [Indexed: 12/23/2022]
Abstract
The study of drug-target protein interaction is a key step in drug research. In recent years, machine learning techniques have become attractive for research, including drug research, due to their automated nature, predictive power, and expected efficiency. Protein representation is a key step in the study of drug-target protein interaction by machine learning, which plays a fundamental role in the ultimate accomplishment of accurate research. With the progress of machine learning, protein representation methods have gradually attracted attention and have consequently developed rapidly. Therefore, in this review, we systematically classify current protein representation methods, comprehensively review them, and discuss the latest advances of interest. According to the information extraction methods and information sources, these representation methods are generally divided into structure and sequence-based representation methods. Each primary class can be further divided into specific subcategories. As for the particular representation methods involve both traditional and the latest approaches. This review contains a comprehensive assessment of the various methods which researchers can use as a reference for their specific protein-related research requirements, including drug research.
Collapse
Affiliation(s)
- Zi-Xuan Yue
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Tian-Ci Yan
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Hong-Quan Xu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yu-Hong Liu
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Yan-Feng Hong
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Gong-Xing Chen
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China
| | - Tian Xie
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| | - Lin Tao
- Key Laboratory of Elemene Class Anti-cancer Chinese Medicines, School of Pharmacy, Hangzhou Normal University, Hangzhou, 311121, China.
| |
Collapse
|