1
|
Castorina LV, Grazioli F, Machart P, Mösch A, Errica F. Assessing the generalization capabilities of TCR binding predictors via peptide distance analysis. PLoS One 2025; 20:e0324011. [PMID: 40392871 PMCID: PMC12091837 DOI: 10.1371/journal.pone.0324011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 04/19/2025] [Indexed: 05/22/2025] Open
Abstract
Understanding the interaction between T Cell Receptors (TCRs) and peptide-bound Major Histocompatibility Complexes (pMHCs) is crucial for comprehending immune responses and developing targeted immunotherapies. While recent machine learning (ML) models show remarkable success in predicting TCR-pMHC binding within training data, these models often fail to generalize to peptides outside their training distributions, raising concerns about their applicability in therapeutic settings. Understanding and improving the generalization of these models is therefore critical to ensure real-world applications. To address this issue, we evaluate the effect of the distance between training and testing peptide distributions on ML model empirical risk assessments, using sequence-based and 3D structure-based distance metrics. In our analysis we use several state-of-the-art models for TCR-peptide binding prediction: Attentive Variational Information Bottleneck (AVIB), NetTCR-2.0 and -2.2, and ERGO II (pre-trained autoencoder) and ERGO II (LSTM). In this work, we introduce a novel approach for assessing the generalization capabilities of TCR binding predictors: the Distance Split (DS) algorithm. The DS algorithm controls the distance between training and testing peptides based on both sequence and structure, allowing for a more nuanced evaluation of model performance. We show that lower 3D shape similarity between training and test peptides is associated with a harder out-of-distribution task definition, which is more interesting when measuring the ability to generalize to unseen peptides. However, we observe the opposite effect when splitting using sequence-based similarity. These findings highlight the importance of using a distance-based splitting approach to benchmark models. This could then be used to estimate a confidence score on predictions on novel and unseen peptides, based on how different they are from the training ones. Additionally, our results may hint that employing 3D shape to complement sequence information could improve the accuracy of TCR-pMHC binding predictors.
Collapse
Affiliation(s)
- Leonardo V. Castorina
- School of Informatics, University of Edinburgh, Edinburgh, United Kingdom
- NEC Laboratories Europe, Heidelberg, Germany
| | | | | | - Anja Mösch
- NEC Laboratories Europe, Heidelberg, Germany
| | | |
Collapse
|
2
|
Ullanat V, Jing B, Sledzieski S, Berger B. Learning the language of protein-protein interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.09.642188. [PMID: 40166198 PMCID: PMC11956943 DOI: 10.1101/2025.03.09.642188] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Protein Language Models (PLMs) trained on large databases of protein sequences have proven effective in modeling protein biology across a wide range of applications. However, while PLMs excel at capturing individual protein properties, they face challenges in natively representing protein-protein interactions (PPIs), which are crucial to understanding cellular processes and disease mechanisms. Here, we introduce MINT, a PLM specifically designed to model sets of interacting proteins in a contextual and scalable manner. Using unsupervised training on a large curated PPI dataset derived from the STRING database, MINT outperforms existing PLMs in diverse tasks relating to protein-protein interactions, including binding affinity prediction and estimation of mutational effects. Beyond these core capabilities, it excels at modeling interactions in complex protein assemblies and surpasses specialized models in antibody-antigen modeling and T cell receptor-epitope binding prediction. MINT's predictions of mutational impacts on oncogenic PPIs align with experimental studies, and it provides reliable estimates for the potential for cross-neutralization of antibodies against SARS-CoV-2 variants of concern. These findings position MINT as a powerful tool for elucidating complex protein interactions, with significant implications for biomedical research and therapeutic discovery.
Collapse
Affiliation(s)
- Varun Ullanat
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
| | - Bowen Jing
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
| | - Samuel Sledzieski
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
- Center for Computational Biology, Flatiron Insitute, New York, NY
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA
- Department of Mathematics, Massachusetts Institute of Technology, MA
| |
Collapse
|
3
|
Pham MDN, Su CTT, Nguyen TN, Nguyen HN, Nguyen DDA, Giang H, Nguyen DT, Phan MD, Nguyen V. epiTCR-KDA: knowledge distillation model on dihedral angles for TCR-peptide prediction. BIOINFORMATICS ADVANCES 2024; 4:vbae190. [PMID: 39678207 PMCID: PMC11646569 DOI: 10.1093/bioadv/vbae190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 11/03/2024] [Accepted: 11/27/2024] [Indexed: 12/17/2024]
Abstract
Motivation The prediction of the T-cell receptor (TCR) and antigen bindings is crucial for advancements in immunotherapy. However, most current TCR-peptide interaction predictors struggle to perform well on unseen data. This limitation may stem from the conventional use of TCR and/or peptide sequences as input, which may not adequately capture their structural characteristics. Therefore, incorporating the structural information of TCRs and peptides into the prediction model is necessary to improve its generalizability. Results We developed epiTCR-KDA (KDA stands for Knowledge Distillation model on Dihedral Angles), a new predictor of TCR-peptide binding that utilizes the dihedral angles between the residues of the peptide and the TCR as a structural descriptor. This structural information was integrated into a knowledge distillation model to enhance its generalizability. epiTCR-KDA demonstrated competitive prediction performance, with an area under the curve (AUC) of 1.00 for seen data and AUC of 0.91 for unseen data. On public datasets, epiTCR-KDA consistently outperformed other predictors, maintaining a median AUC of 0.93. Further analysis of epiTCR-KDA revealed that the cosine similarity of the dihedral angle vectors between the unseen testing data and training data is crucial for its stable performance. In conclusion, our epiTCR-KDA model represents a significant step forward in developing a highly effective pipeline for antigen-based immunotherapy. Availability and implementation epiTCR-KDA is available on GitHub (https://github.com/ddiem-ri-4D/epiTCR-KDA).
Collapse
Affiliation(s)
- My-Diem Nguyen Pham
- Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | | | | | | | - Dinh Duy An Nguyen
- Department of Genetics and Genomic Sciences School of Medicine, Case Western Reserve University, Cleveland, Ohio, United States
| | - Hoa Giang
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| | - Dinh-Thuc Nguyen
- Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| | - Minh-Duy Phan
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
- NexCalibur Therapeutics, DE, United States
| | - Vy Nguyen
- Medical Genetics Institute, Ho Chi Minh City, Vietnam
| |
Collapse
|
4
|
Velez-Arce A, Li MM, Gao W, Lin X, Huang K, Fu T, Pentelute BL, Kellis M, Zitnik M. Signals in the Cells: Multimodal and Contextualized Machine Learning Foundations for Therapeutics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.12.598655. [PMID: 38948789 PMCID: PMC11212894 DOI: 10.1101/2024.06.12.598655] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/02/2024]
Abstract
Drug discovery AI datasets and benchmarks have not traditionally included single-cell analysis biomarkers. While benchmarking efforts in single-cell analysis have recently released collections of single-cell tasks, they have yet to comprehensively release datasets, models, and benchmarks that integrate a broad range of therapeutic discovery tasks with cell-type-specific biomarkers. Therapeutics Commons (TDC-2) presents datasets, tools, models, and benchmarks integrating cell-type-specific contextual features with ML tasks across therapeutics. We present four tasks for contextual learning at single-cell resolution: drug-target nomination, genetic perturbation response prediction, chemical perturbation response prediction, and protein-peptide interaction prediction. We introduce datasets, models, and benchmarks for these four tasks. Finally, we detail the advancements and challenges in machine learning and biology that drove the implementation of TDC-2 and how they are reflected in its architecture, datasets and benchmarks, and foundation model tooling.
Collapse
Affiliation(s)
| | - Michelle M. Li
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115
| | - Wenhao Gao
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Xiang Lin
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02115
| | - Kexin Huang
- Department of Computer Science, Stanford School of Engineering, Stanford, CA 94305
| | - Tianfan Fu
- Department of Computational Science, Rensselaer Polytechnic Institute, Troy, NY 12180
| | - Bradley L. Pentelute
- Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Manolis Kellis
- Broad Institute of MIT and Harvard, Computer Science and Artificial Intelligence Laboratory, MIT, Electrical Engineering and Computer Science Department, Massachusetts Institute of Technology, Cambridge, MA 02139
| | - Marinka Zitnik
- Broad Institute of MIT and Harvard, Harvard Data Science Initiative, Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Department of Biomedical Informatics, Harvard Medical School, Boston, MA 02215
| |
Collapse
|
5
|
Li F, Qian X, Zhu X, Lai X, Zhang X, Wang J. TCRcost: a deep learning model utilizing TCR 3D structure for enhanced of TCR-peptide binding. Front Genet 2024; 15:1346784. [PMID: 39415981 PMCID: PMC11479912 DOI: 10.3389/fgene.2024.1346784] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Accepted: 09/05/2024] [Indexed: 10/19/2024] Open
Abstract
Introduction Predicting TCR-peptide binding is a complex and significant computational problem in systems immunology. During the past decade, a series of computational methods have been developed for better predicting TCR-peptide binding from amino acid sequences. However, the performance of sequence-based methods appears to have hit a bottleneck. Considering the 3D structures of TCR-peptide complexes, which provide much more information, could potentially lead to better prediction outcomes. Methods In this study, we developed TCRcost, a deep learning method, to predict TCR-peptide binding by incorporating 3D structures. TCRcost overcomes two significant challenges: acquiring a sufficient number of high-quality TCR-peptide structures and effectively extracting information from these structures for binding prediction. TCRcost corrects TCR 3D structures generated by protein structure tools, significantly extending the available datasets. The main and side chains of a TCR structure are separately corrected using a long short-term memory (LSTM) model. This approach prevents interference between the chains and accurately extracts interactions among both adjacent and global atoms. A 3D convolutional neural network (CNN) is designed to extract the atomic features relevant to TCR-peptide binding. The spatial features extracted by the 3DCNN are then processed through a fully connected layer to estimate the probability of TCR-peptide binding. Results Test results demonstrated that predicting TCR-peptide binding from 3D TCR structures is both efficient and highly accurate with an average accuracy of 0.974 on precise structures. Furthermore, the average accuracy on corrected structures was 0.762, significantly higher than the average accuracy of 0.375 on uncorrected original structures. Additionally, the average root mean square distance (RMSD) to precise structures was significantly reduced from 12.753 Å for predicted structures to 8.785 Å for corrected structures. Discussion Thus, utilizing structural information of TCR-peptide complexes is a promising approach to improve the accuracy of binding predictions.
Collapse
Affiliation(s)
- Fan Li
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Xinyang Qian
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Xiaoyan Zhu
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Xin Lai
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Xuanping Zhang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| | - Jiayin Wang
- School of Computer Science and Technology, Xi’an Jiaotong University, Xi’an, China
- Shaanxi Engineering Research Center of Medical and Health Big Data, Xi’an Jiaotong University, Xi’an, China
| |
Collapse
|
6
|
Deng Q, Wang Z, Xiang S, Wang Q, Liu Y, Hou T, Sun H. RLpMIEC: High-Affinity Peptide Generation Targeting Major Histocompatibility Complex-I Guided and Interpreted by Interaction Spectrum-Navigated Reinforcement Learning. J Chem Inf Model 2024; 64:6432-6449. [PMID: 39118363 DOI: 10.1021/acs.jcim.4c01153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/10/2024]
Abstract
Major histocompatibility complex (MHC) plays a vital role in presenting epitopes (short peptides from pathogenic proteins) to T-cell receptors (TCRs) to trigger the subsequent immune responses. Vaccine design targeting MHC generally aims to find epitopes with a high binding affinity for MHC presentation. Nevertheless, to find novel epitopes usually requires high-throughput screening of bulk peptide database, which is time-consuming, labor-intensive, more unaffordable, and very expensive. Excitingly, the past several years have witnessed the great success of artificial intelligence (AI) in various fields, such as natural language processing (NLP, e.g., GPT-4), protein structure prediction and engineering (e.g., AlphaFold2), and so on. Therefore, herein, we propose a deep reinforcement-learning (RL)-based generative algorithm, RLpMIEC, to quantitatively design peptide targeting MHC-I systems. Specifically, RLpMIEC combines the energetic spectrum (namely, the molecular interaction energy component, MIEC) based on the peptide-MHC interaction and the sequence information to generate peptides with strong binding affinity and precise MIEC spectra to accelerate the discovery of candidate peptide vaccines. RLpMIEC performs well in all the generative capability evaluations and can generate peptides with strong binding affinities and precise MIECs and, moreover, with high interpretability, demonstrating its powerful capability in participation for accelerating peptide-based vaccine development.
Collapse
Affiliation(s)
- Qirui Deng
- Department of Medicinal Chemistry, China Pharmaceutical University, Nanjing 210009 Jiangsu, P. R. China
| | - Zhe Wang
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058 Zhejiang, P. R. China
- School of Pharmacy, Hangzhou Normal University, Hangzhou 311121, P. R. China
| | - Sutong Xiang
- Department of Medicinal Chemistry, China Pharmaceutical University, Nanjing 210009 Jiangsu, P. R. China
| | - Qinghua Wang
- Department of Medicinal Chemistry, China Pharmaceutical University, Nanjing 210009 Jiangsu, P. R. China
| | - Yifei Liu
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058 Zhejiang, P. R. China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine of Zhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou 310058 Zhejiang, P. R. China
| | - Huiyong Sun
- Department of Medicinal Chemistry, China Pharmaceutical University, Nanjing 210009 Jiangsu, P. R. China
| |
Collapse
|
7
|
Machaca V, Goyzueta V, Cruz MG, Sejje E, Pilco LM, López J, Túpac Y. Transformers meets neoantigen detection: a systematic literature review. J Integr Bioinform 2024; 21:jib-2023-0043. [PMID: 38960869 PMCID: PMC11377031 DOI: 10.1515/jib-2023-0043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Accepted: 03/20/2024] [Indexed: 07/05/2024] Open
Abstract
Cancer immunology offers a new alternative to traditional cancer treatments, such as radiotherapy and chemotherapy. One notable alternative is the development of personalized vaccines based on cancer neoantigens. Moreover, Transformers are considered a revolutionary development in artificial intelligence with a significant impact on natural language processing (NLP) tasks and have been utilized in proteomics studies in recent years. In this context, we conducted a systematic literature review to investigate how Transformers are applied in each stage of the neoantigen detection process. Additionally, we mapped current pipelines and examined the results of clinical trials involving cancer vaccines.
Collapse
Affiliation(s)
| | | | | | - Erika Sejje
- Universidad Nacional de San Agustín, Arequipa, Perú
| | | | | | - Yván Túpac
- 187038 Universidad Católica San Pablo , Arequipa, Perú
| |
Collapse
|
8
|
Mösch A, Grazioli F, Machart P, Malone B. NeoAgDT: optimization of personal neoantigen vaccine composition by digital twin simulation of a cancer cell population. Bioinformatics 2024; 40:btae205. [PMID: 38614133 PMCID: PMC11076149 DOI: 10.1093/bioinformatics/btae205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2023] [Revised: 03/28/2024] [Accepted: 04/11/2024] [Indexed: 04/15/2024] Open
Abstract
MOTIVATION Neoantigen vaccines make use of tumor-specific mutations to enable the patient's immune system to recognize and eliminate cancer. Selecting vaccine elements, however, is a complex task which needs to take into account not only the underlying antigen presentation pathway but also tumor heterogeneity. RESULTS Here, we present NeoAgDT, a two-step approach consisting of: (i) simulating individual cancer cells to create a digital twin of the patient's tumor cell population and (ii) optimizing the vaccine composition by integer linear programming based on this digital twin. NeoAgDT shows improved selection of experimentally validated neoantigens over ranking-based approaches in a study of seven patients. AVAILABILITY AND IMPLEMENTATION The NeoAgDT code is published on Github: https://github.com/nec-research/neoagdt.
Collapse
Affiliation(s)
- Anja Mösch
- Biomedical AI Group, NEC Laboratories Europe GmbH, Heidelberg 69115, Germany
| | - Filippo Grazioli
- Biomedical AI Group, NEC Laboratories Europe GmbH, Heidelberg 69115, Germany
| | - Pierre Machart
- Biomedical AI Group, NEC Laboratories Europe GmbH, Heidelberg 69115, Germany
| | - Brandon Malone
- Biomedical AI Group, NEC Laboratories Europe GmbH, Heidelberg 69115, Germany
| |
Collapse
|
9
|
Bravi B. Development and use of machine learning algorithms in vaccine target selection. NPJ Vaccines 2024; 9:15. [PMID: 38242890 PMCID: PMC10798987 DOI: 10.1038/s41541-023-00795-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 12/07/2023] [Indexed: 01/21/2024] Open
Abstract
Computer-aided discovery of vaccine targets has become a cornerstone of rational vaccine design. In this article, I discuss how Machine Learning (ML) can inform and guide key computational steps in rational vaccine design concerned with the identification of B and T cell epitopes and correlates of protection. I provide examples of ML models, as well as types of data and predictions for which they are built. I argue that interpretable ML has the potential to improve the identification of immunogens also as a tool for scientific discovery, by helping elucidate the molecular processes underlying vaccine-induced immune responses. I outline the limitations and challenges in terms of data availability and method development that need to be addressed to bridge the gap between advances in ML predictions and their translational application to vaccine design.
Collapse
Affiliation(s)
- Barbara Bravi
- Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.
| |
Collapse
|