1
|
Codicè F, Pancotti C, Rollo C, Moreau Y, Fariselli P, Raimondi D. The specification game: rethinking the evaluation of drug response prediction for precision oncology. J Cheminform 2025; 17:33. [PMID: 40087708 PMCID: PMC11907791 DOI: 10.1186/s13321-025-00972-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2024] [Accepted: 02/13/2025] [Indexed: 03/17/2025] Open
Abstract
Precision oncology plays a pivotal role in contemporary healthcare, aiming to optimize treatments for each patient based on their unique characteristics. This objective has spurred the emergence of various cancer cell line drug response datasets, driven by the need to facilitate pre-clinical studies by exploring the impact of multi-omics data on drug response. Despite the proliferation of machine learning models for Drug Response Prediction (DRP), their validation remains critical to reliably assess their usefulness for drug discovery, precision oncology and their actual ability to generalize over the immense space of cancer cells and chemical compounds. Scientific contribution In this paper we show that the commonly used evaluation strategies for DRP methods can be easily fooled by commonly occurring dataset biases, and they are therefore not able to truly measure the ability of DRP methods to generalize over drugs and cell lines ("specification gaming"). This problem hinders the development of reliable DRP methods and their application to experimental pipelines. Here we propose a new validation protocol composed by three Aggregation Strategies (Global, Fixed-Drug, and Fixed-Cell Line) integrating them with three of the most commonly used train-test evaluation settings, to ensure a truly realistic assessment of the prediction performance. We also scrutinize the challenges associated with using IC50 as a prediction label, showing how its close correlation with the drug concentration ranges worsens the risk of misleading performance assessment, and we indicate an additional reason to replace it with the Area Under the Dose-Response Curve instead.
Collapse
Affiliation(s)
- Francesco Codicè
- Department of Medical Sciences, University of Torino, 10123, Torino, Italy.
| | - Corrado Pancotti
- Department of Medical Sciences, University of Torino, 10123, Torino, Italy
| | - Cesare Rollo
- Department of Medical Sciences, University of Torino, 10123, Torino, Italy
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, Leuven, 3001, Belgium
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, 10123, Torino, Italy
| | - Daniele Raimondi
- Institut de Génétique Moléculaire de Montpellier, Université de Montpellier, 34293, Montpellier, France
| |
Collapse
|
2
|
Iliadis D, De Baets B, Pahikkala T, Waegeman W. A comparison of embedding aggregation strategies in drug-target interaction prediction. BMC Bioinformatics 2024; 25:59. [PMID: 38321386 PMCID: PMC10845509 DOI: 10.1186/s12859-024-05684-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2023] [Accepted: 01/30/2024] [Indexed: 02/08/2024] Open
Abstract
The prediction of interactions between novel drugs and biological targets is a vital step in the early stage of the drug discovery pipeline. Many deep learning approaches have been proposed over the last decade, with a substantial fraction of them sharing the same underlying two-branch architecture. Their distinction is limited to the use of different types of feature representations and branches (multi-layer perceptrons, convolutional neural networks, graph neural networks and transformers). In contrast, the strategy used to combine the outputs (embeddings) of the branches has remained mostly the same. The same general architecture has also been used extensively in the area of recommender systems, where the choice of an aggregation strategy is still an open question. In this work, we investigate the effectiveness of three different embedding aggregation strategies in the area of drug-target interaction (DTI) prediction. We formally define these strategies and prove their universal approximator capabilities. We then present experiments that compare the different strategies on benchmark datasets from the area of DTI prediction, showcasing conditions under which specific strategies could be the obvious choice.
Collapse
Affiliation(s)
- Dimitrios Iliadis
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000, Ghent, Belgium.
| | - Bernard De Baets
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| | - Tapio Pahikkala
- Department of Computing, University of Turku, 20500, Turku, Finland
| | - Willem Waegeman
- Department of Data Analysis and Mathematical Modelling, Ghent University, Coupure Links 653, 9000, Ghent, Belgium
| |
Collapse
|
3
|
Xie S, Xie X, Zhao X, Liu F, Wang Y, Ping J, Ji Z. HNSPPI: a hybrid computational model combing network and sequence information for predicting protein-protein interaction. Brief Bioinform 2023; 24:bbad261. [PMID: 37480553 DOI: 10.1093/bib/bbad261] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2023] [Revised: 06/24/2023] [Accepted: 06/26/2023] [Indexed: 07/24/2023] Open
Abstract
Most life activities in organisms are regulated through protein complexes, which are mainly controlled via Protein-Protein Interactions (PPIs). Discovering new interactions between proteins and revealing their biological functions are of great significance for understanding the molecular mechanisms of biological processes and identifying the potential targets in drug discovery. Current experimental methods only capture stable protein interactions, which lead to limited coverage. In addition, expensive cost and time consuming are also the obvious shortcomings. In recent years, various computational methods have been successfully developed for predicting PPIs based only on protein homology, primary sequences of protein or gene ontology information. Computational efficiency and data complexity are still the main bottlenecks for the algorithm generalization. In this study, we proposed a novel computational framework, HNSPPI, to predict PPIs. As a hybrid supervised learning model, HNSPPI comprehensively characterizes the intrinsic relationship between two proteins by integrating amino acid sequence information and connection properties of PPI network. The experimental results show that HNSPPI works very well on six benchmark datasets. Moreover, the comparison analysis proved that our model significantly outperforms other five existing algorithms. Finally, we used the HNSPPI model to explore the SARS-CoV-2-Human interaction system and found several potential regulations. In summary, HNSPPI is a promising model for predicting new protein interactions from known PPI data.
Collapse
Affiliation(s)
- Shijie Xie
- College of Artificial Intelligence, Nanjing Agricultural University, No. 1 Weigang Rd, Nanjing, Jiangsu 210095, China
| | - Xiaojun Xie
- College of Artificial Intelligence, Nanjing Agricultural University, No. 1 Weigang Rd, Nanjing, Jiangsu 210095, China
| | - Xin Zhao
- Department of Hepatobiliary Surgery, Beijing Chaoyang Hospital affiliated to Capital Medical University, Beijing 100020, China
| | - Fei Liu
- Joint International Research Laboratory of Animal Health and Food Safety of Ministry of Education & Single Molecule Nanometry Laboratory (Sinmolab), Nanjing Agricultural University, Nanjing, Jiangsu 210095, China
| | - Yiming Wang
- Key Laboratory of Biological Interactions and Crop Health, Department of Plant Pathology, Nanjing Agricultural University, 210095, Nanjing, China
| | - Jihui Ping
- MOE International Joint Collaborative Research Laboratory for Animal Health and Food Safety & Jiangsu Engineering Laboratory of Animal Immunology, College of Veterinary Medicine, Nanjing Agricultural University, Nanjing, Jiangsu 210095, China
| | - Zhiwei Ji
- College of Artificial Intelligence, Nanjing Agricultural University, No. 1 Weigang Rd, Nanjing, Jiangsu 210095, China
| |
Collapse
|
4
|
Jha K, Saha S, Karmakar S. Prediction of Protein-Protein Interactions Using Vision Transformer and Language Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:3215-3225. [PMID: 37027644 DOI: 10.1109/tcbb.2023.3248797] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
The knowledge of protein-protein interaction (PPI) helps us to understand proteins' functions, the causes and growth of several diseases, and can aid in designing new drugs. The majority of existing PPI research has relied mainly on sequence-based approaches. With the availability of multi-omics datasets (sequence, 3D structure) and advancements in deep learning techniques, it is feasible to develop a deep multi-modal framework that fuses the features learned from different sources of information to predict PPI. In this work, we propose a multi-modal approach utilizing protein sequence and 3D structure. To extract features from the 3D structure of proteins, we use a pre-trained vision transformer model that has been fine-tuned on the structural representation of proteins. The protein sequence is encoded into a feature vector using a pre-trained language model. The feature vectors extracted from the two modalities are fused and then fed to the neural network classifier to predict the protein interactions. To showcase the effectiveness of the proposed methodology, we conduct experiments on two popular PPI datasets, namely, the human dataset and the S. cerevisiae dataset. Our approach outperforms the existing methodologies to predict PPI, including multi-modal approaches. We also evaluate the contributions of each modality by designing uni-modal baselines. We perform experiments with three modalities as well, having gene ontology as the third modality.
Collapse
|
5
|
Chen H, Cai Y, Ji C, Selvaraj G, Wei D, Wu H. AdaPPI: identification of novel protein functional modules via adaptive graph convolution networks in a protein-protein interaction network. Brief Bioinform 2023; 24:bbac523. [PMID: 36526282 DOI: 10.1093/bib/bbac523] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2022] [Revised: 10/10/2022] [Accepted: 11/02/2022] [Indexed: 12/23/2022] Open
Abstract
Identifying unknown protein functional modules, such as protein complexes and biological pathways, from protein-protein interaction (PPI) networks, provides biologists with an opportunity to efficiently understand cellular function and organization. Finding complex nonlinear relationships in underlying functional modules may involve a long-chain of PPI and pose great challenges in a PPI network with an unevenly sparse and dense node distribution. To overcome these challenges, we propose AdaPPI, an adaptive convolution graph network in PPI networks to predict protein functional modules. We first suggest an attributed graph node presentation algorithm. It can effectively integrate protein gene ontology attributes and network topology, and adaptively aggregates low- or high-order graph structural information according to the node distribution by considering graph node smoothness. Based on the obtained node representations, core cliques and expansion algorithms are applied to find functional modules in PPI networks. Comprehensive performance evaluations and case studies indicate that the framework significantly outperforms state-of-the-art methods. We also presented potential functional modules based on their confidence.
Collapse
|
6
|
Jha K, Saha S. Analyzing Effect of Multi-Modality in Predicting Protein-Protein Interactions. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:162-173. [PMID: 35259112 DOI: 10.1109/tcbb.2022.3157531] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Nowadays, multiple sources of information about proteins are available such as protein sequences, 3D structures, Gene Ontology (GO), etc. Most of the works on protein-protein interaction (PPI) identification had utilized these information about proteins, mainly sequence-based, but individually. The new advances in deep learning techniques allow us to leverage multiple sources/modalities of proteins, which complement each other. Some recent works have shown that multi-modal PPI models perform better than uni-modal approaches. This paper aims to investigate whether the performance of multi-modal PPI models is always consistent or depends on other factors such as dataset distribution, algorithms used to learn features, etc. We have used three modalities for this study: Protein sequence, 3D structure, and GO. Various techniques, including deep learning algorithms, are employed to extract features from multiple sources of proteins. These feature vectors from different modalities are then integrated in several combinations (bi-modal and tri-modal) to predict PPI. To conduct this study, we have used Human and S. cerevisiae PPI datasets. The obtained results demonstrate the potentiality of a multi-modal approach and deep learning techniques in predicting protein interactions. However, the predictive capability of a model for PPI depends on feature extraction methods as well. Also, increasing the modality does not always ensure performance improvement. In this study, the PPI model integrating two modalities outperforms the designed uni-modal and tri-modal PPI models.
Collapse
|
7
|
Zhong W, He C, Xiao C, Liu Y, Qin X, Yu Z. Long-distance dependency combined multi-hop graph neural networks for protein-protein interactions prediction. BMC Bioinformatics 2022; 23:521. [PMID: 36471248 PMCID: PMC9724439 DOI: 10.1186/s12859-022-05062-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2022] [Accepted: 11/16/2022] [Indexed: 12/10/2022] Open
Abstract
BACKGROUND Protein-protein interactions are widespread in biological systems and play an important role in cell biology. Since traditional laboratory-based methods have some drawbacks, such as time-consuming, money-consuming, etc., a large number of methods based on deep learning have emerged. However, these methods do not take into account the long-distance dependency information between each two amino acids in sequence. In addition, most existing models based on graph neural networks only aggregate the first-order neighbors in protein-protein interaction (PPI) network. Although multi-order neighbor information can be aggregated by increasing the number of layers of neural network, it is easy to cause over-fitting. So, it is necessary to design a network that can capture long distance dependency information between amino acids in the sequence and can directly capture multi-order neighbor information in protein-protein interaction network. RESULTS In this study, we propose a multi-hop neural network (LDMGNN) model combining long distance dependency information to predict the multi-label protein-protein interactions. In the LDMGNN model, we design the protein amino acid sequence encoding (PAASE) module with the multi-head self-attention Transformer block to extract the features of amino acid sequences by calculating the interdependence between every two amino acids. And expand the receptive field in space by constructing a two-hop protein-protein interaction (THPPI) network. We combine PPI network and THPPI network with amino acid sequence features respectively, then input them into two identical GIN blocks at the same time to obtain two embeddings. Next, the two embeddings are fused and input to the classifier for predict multi-label protein-protein interactions. Compared with other state-of-the-art methods, LDMGNN shows the best performance on both the SHS27K and SHS148k datasets. Ablation experiments show that the PAASE module and the construction of THPPI network are feasible and effective. CONCLUSIONS In general terms, our proposed LDMGNN model has achieved satisfactory results in the prediction of multi-label protein-protein interactions.
Collapse
Affiliation(s)
- Wen Zhong
- grid.267139.80000 0000 9188 055XCollege of Science, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093 China
| | - Changxiang He
- grid.267139.80000 0000 9188 055XCollege of Science, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093 China
| | - Chen Xiao
- grid.267139.80000 0000 9188 055XCollege of Science, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093 China
| | - Yuru Liu
- grid.267139.80000 0000 9188 055XCollege of Science, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093 China
| | - Xiaofei Qin
- grid.267139.80000 0000 9188 055XSchool of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093 China
| | - Zhensheng Yu
- grid.267139.80000 0000 9188 055XCollege of Science, University of Shanghai for Science and Technology, Jungong Road, Shanghai, 200093 China
| |
Collapse
|
8
|
Xia S, Xia Y, Xiang C, Wang H, Wang C, He J, Shi G, Gu L. A virus–target host proteins recognition method based on integrated complexes data and seed extension. BMC Bioinformatics 2022; 23:256. [PMID: 35764916 PMCID: PMC9238269 DOI: 10.1186/s12859-022-04792-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Accepted: 06/14/2022] [Indexed: 11/10/2022] Open
Abstract
Abstract
Background
Target drugs play an important role in the clinical treatment of virus diseases. Virus-encoded proteins are widely used as targets for target drugs. However, they cannot cope with the drug resistance caused by a mutated virus and ignore the importance of host proteins for virus replication. Some methods use interactions between viruses and their host proteins to predict potential virus–target host proteins, which are less susceptible to mutated viruses. However, these methods only consider the network topology between the virus and the host proteins, ignoring the influences of protein complexes. Therefore, we introduce protein complexes that are less susceptible to drug resistance of mutated viruses, which helps recognize the unknown virus–target host proteins and reduce the cost of disease treatment.
Results
Since protein complexes contain virus–target host proteins, it is reasonable to predict virus–target human proteins from the perspective of the protein complexes. We propose a coverage clustering-core-subsidiary protein complex recognition method named CCA-SE that integrates the known virus–target host proteins, the human protein–protein interaction network, and the known human protein complexes. The proposed method aims to obtain the potential unknown virus–target human host proteins. We list part of the targets after proving our results effectively in enrichment experiments.
Conclusions
Our proposed CCA-SE method consists of two parts: one is CCA, which is to recognize protein complexes, and the other is SE, which is to select seed nodes as the core of protein complexes by using seed expansion. The experimental results validate that CCA-SE achieves efficient recognition of the virus–target host proteins.
Collapse
|
9
|
Raimondi D, Codicè F, Orlando G, Schymkowitz J, Rousseau F, Moreau Y. HPMPdb: a machine learning-ready database of protein molecular phenotypes associated to human missense variants. Curr Res Struct Biol 2022; 4:167-174. [PMID: 35669450 PMCID: PMC9166469 DOI: 10.1016/j.crstbi.2022.04.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2021] [Revised: 03/24/2022] [Accepted: 04/25/2022] [Indexed: 11/10/2022] Open
Abstract
Current human Single Amino acid Variants (SAVs) databases provide a link between a SAVs and their effect on the carrier individual phenotype, often dividing them into Deleterious/Neutral variants. This is a very coarse-grained description of the genotype-to-phenotype relationship because it relies on un-realistic assumptions such as the perfect Mendelian behavior of each SAV and considers only dichotomic phenotypes. Moreover, the link between the effect of a SAV on a protein (its molecular phenotype) and the individual phenotype is often very complex, because multiple level of biological abstraction connect the protein and individual level phenotypes. Here we present HPMPdb, a manually curated database containing human SAVs associated with the detailed description of the molecular phenotype they cause on the affected proteins. With particular regards to machine learning (ML), this database can be used to let researchers go beyond the existing Deleterious/Neutral prediction paradigm, allowing them to build molecular phenotype predictors instead. Our class labels describe in a succinct way the effects that each SAV has on 15 protein molecular phenotypes, such as protein-protein interaction, small molecules binding, function, post-translational modifications (PTMs), sub-cellular localization, mimetic PTM, folding and protein expression. Moreover, we provide researchers with all necessary means to re-producibly train and test their models on our database. The webserver and the data described in this paper are available at hpmp.esat.kuleuven.be. Current variant-effect predictors perform a coarse-grained modeling and rely on unrealistic assumptions. The link between the effect of a variant and the individual phenotype is complex. It would be more intuitive to predict the molecular phenotype that each variant causes on the carrier protein. HPMP is a manually curated database containing human variants associated with the molecular phenotype they cause on the affected proteins. We manually translated variants from Uniprot into 15 Machine Learning-ready labels describing the affected protein molecular phenotype. The goal of HPMP is to allow researchers to go beyond the existing variant-effect prediction paradigm and allow them to build molecular phenotype predictors instead. The webserver and the data described in this paper are available at hpmp.esat.kuleuven.be
Collapse
|
10
|
Hu X, Feng C, Ling T, Chen M. Deep learning frameworks for protein–protein interaction prediction. Comput Struct Biotechnol J 2022; 20:3223-3233. [PMID: 35832624 PMCID: PMC9249595 DOI: 10.1016/j.csbj.2022.06.025] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 05/27/2022] [Accepted: 06/12/2022] [Indexed: 11/26/2022] Open
|
11
|
Raimondi D, Corso M, Fariselli P, Moreau Y. From genotype to phenotype in Arabidopsis thaliana: in-silico genome interpretation predicts 288 phenotypes from sequencing data. Nucleic Acids Res 2021; 50:e16. [PMID: 34792168 PMCID: PMC8860592 DOI: 10.1093/nar/gkab1099] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2021] [Revised: 10/06/2021] [Accepted: 10/22/2021] [Indexed: 01/09/2023] Open
Abstract
In many cases, the unprecedented availability of data provided by high-throughput sequencing has shifted the bottleneck from a data availability issue to a data interpretation issue, thus delaying the promised breakthroughs in genetics and precision medicine, for what concerns Human genetics, and phenotype prediction to improve plant adaptation to climate change and resistance to bioagressors, for what concerns plant sciences. In this paper, we propose a novel Genome Interpretation paradigm, which aims at directly modeling the genotype-to-phenotype relationship, and we focus on A. thaliana since it is the best studied model organism in plant genetics. Our model, called Galiana, is the first end-to-end Neural Network (NN) approach following the genomes in/phenotypes out paradigm and it is trained to predict 288 real-valued Arabidopsis thaliana phenotypes from Whole Genome sequencing data. We show that 75 of these phenotypes are predicted with a Pearson correlation ≥0.4, and are mostly related to flowering traits. We show that our end-to-end NN approach achieves better performances and larger phenotype coverage than models predicting single phenotypes from the GWAS-derived known associated genes. Galiana is also fully interpretable, thanks to the Saliency Maps gradient-based approaches. We followed this interpretation approach to identify 36 novel genes that are likely to be associated with flowering traits, finding evidence for 6 of them in the existing literature.
Collapse
Affiliation(s)
| | - Massimiliano Corso
- Institut Jean-Pierre Bourgin, Université Paris-Saclay, INRAE, AgroParisTech, 78000 Versailles, France
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, 10123 Torino, Italy
| | - Yves Moreau
- ESAT-STADIUS, KU Leuven, 3001 Leuven, Belgium
| |
Collapse
|