1
|
Ünlü A, Ulusoy E, Yiğit MG, Darcan M, Doğan T. Protein language models for predicting drug-target interactions: Novel approaches, emerging methods, and future directions. Curr Opin Struct Biol 2025; 91:103017. [PMID: 39985946 DOI: 10.1016/j.sbi.2025.103017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2024] [Revised: 01/28/2025] [Accepted: 01/29/2025] [Indexed: 02/24/2025]
Abstract
Identifying new drug candidates remains a critical and complex challenge in drug development. Recent advances in deep learning have demonstrated significant potential to accelerate this process, particularly through the use of protein language models (pLMs). These models aim to effectively capture the structural and functional properties of proteins by embedding them in high-dimensional spaces, thereby providing powerful tools for predictive tasks. This review examines the application of pLMs in drug-target interaction (DTI) prediction, addressing both small-molecule and protein-based therapeutics. We explore diverse methodologies, including end-to-end learning models and those that leverage pre-trained foundational pLMs. Furthermore, we highlight the role of heterogeneous data integration-ranging from protein structures to knowledge graphs-to improve the accuracy of DTI predictions. Despite notable progress, challenges persist in accurately identifying DTIs, mainly due to data-related limitations and algorithmic constraints. Future research directions include utilising multimodal learning approaches, incorporating temporal/dynamic interaction data into training, and employing novel deep learning architectures to refine protein representations, gain a deeper understanding of biological context regarding molecular interactions, and, thus, advance the DTI prediction field.
Collapse
Affiliation(s)
- Atabey Ünlü
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, 06800, Ankara, Türkiye
| | - Erva Ulusoy
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, 06800, Ankara, Türkiye
| | - Melih Gökay Yiğit
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Computer Engineering, Middle East Technical University, 06800, Ankara, Türkiye
| | - Melih Darcan
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye
| | - Tunca Doğan
- Biological Data Science Lab, Dept. of Computer Engineering, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Bioinformatics, Graduate School of Health Sciences, Hacettepe University, 06800, Ankara, Türkiye; Dept. of Health Informatics, Institute of Informatics, Hacettepe University, 06800, Ankara, Türkiye.
| |
Collapse
|
2
|
Dewaker V, Morya VK, Kim YH, Park ST, Kim HS, Koh YH. Revolutionizing oncology: the role of Artificial Intelligence (AI) as an antibody design, and optimization tools. Biomark Res 2025; 13:52. [PMID: 40155973 PMCID: PMC11954232 DOI: 10.1186/s40364-025-00764-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2025] [Accepted: 03/13/2025] [Indexed: 04/01/2025] Open
Abstract
Antibodies play a crucial role in defending the human body against diseases, including life-threatening conditions like cancer. They mediate immune responses against foreign antigens and, in some cases, self-antigens. Over time, antibody-based technologies have evolved from monoclonal antibodies (mAbs) to chimeric antigen receptor T cells (CAR-T cells), significantly impacting biotechnology, diagnostics, and therapeutics. Although these advancements have enhanced therapeutic interventions, the integration of artificial intelligence (AI) is revolutionizing antibody design and optimization. This review explores recent AI advancements, including large language models (LLMs), diffusion models, and generative AI-based applications, which have transformed antibody discovery by accelerating de novo generation, enhancing immune response precision, and optimizing therapeutic efficacy. Through advanced data analysis, AI enables the prediction and design of antibody sequences, 3D structures, complementarity-determining regions (CDRs), paratopes, epitopes, and antigen-antibody interactions. These AI-powered innovations address longstanding challenges in antibody development, significantly improving speed, specificity, and accuracy in therapeutic design. By integrating computational advancements with biomedical applications, AI is driving next-generation cancer therapies, transforming precision medicine, and enhancing patient outcomes.
Collapse
Affiliation(s)
- Varun Dewaker
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea
| | - Vivek Kumar Morya
- Department of Orthopedic Surgery, Hallym University Dongtan Sacred Hospital, Hwaseong-Si, 18450, Republic of Korea
| | - Yoo Hee Kim
- Department of Biomedical Gerontology, Ilsong Institute of Life Science, Hallym University, Seoul, 07247, Republic of Korea
| | - Sung Taek Park
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea
- Department of Obstetrics and Gynecology, Kangnam Sacred-Heart Hospital, Hallym University Medical Center, Hallym University College of Medicine, Seoul, 07441, Republic of Korea
- EIONCELL Inc, Chuncheon-Si, 24252, Republic of Korea
| | - Hyeong Su Kim
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea.
- Department of Internal Medicine, Division of Hemato-Oncology, Kangnam Sacred-Heart Hospital, Hallym University Medical Center, Hallym University College of Medicine, Seoul, 07441, Republic of Korea.
- EIONCELL Inc, Chuncheon-Si, 24252, Republic of Korea.
| | - Young Ho Koh
- Department of Biomedical Gerontology, Ilsong Institute of Life Science, Hallym University, Seoul, 07247, Republic of Korea.
| |
Collapse
|
3
|
Wang M, Kluger Y, Kleinstein SH, Gabernet G. AMULETY: A Python package to embed adaptive immune receptor sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.21.644583. [PMID: 40196678 PMCID: PMC11974677 DOI: 10.1101/2025.03.21.644583] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2025]
Abstract
Large language models have been developed to capture relevant features of adaptive immune receptors, each with unique potential applications. However, the diversity in available models presents challenges in accessibility and usability for downstream applications. Here we present AMULETY (Adaptive imMUne receptor Language model Embedding Tool), a Python-based software package to generate language model embeddings for adaptive immune receptor sequences, enabling users to leverage the strengths of different models without the need for complex configuration. AMULETY offers functions for embedding adaptive immune receptor amino acid sequences using pre-trained protein or antibody language models for paired heavy and light chain or single chain sequences. We showcase the variability on the embedding space for several embeddings on a dataset of antibody binders to several SARS-CoV-2 epitopes and showed that different models may be effective at capturing different aspects of the distinctions between epitope groups. AMULETY is available under GPL v3 license from https://github.com/immcantation/amulety or via pip from the Python Package Index (PyPI) from https://pypi.org/project/amulety/.
Collapse
Affiliation(s)
- Meng Wang
- Program in Computational Biology and Biomedical Informatics, Yale University, New Haven, CT, USA
| | - Yuval Kluger
- Program in Computational Biology and Biomedical Informatics, Yale University, New Haven, CT, USA
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
- Applied Mathematics Program, Yale University, New Haven, CT, USA
| | - Steven H Kleinstein
- Program in Computational Biology and Biomedical Informatics, Yale University, New Haven, CT, USA
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
- Department of Immunobiology, Yale School of Medicine, New Haven, CT, USA
| | - Gisela Gabernet
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
| |
Collapse
|
4
|
Parkinson J, Hard R, Ko YS, Wang W. RESP2: An uncertainty aware multi-target multi-property optimization AI pipeline for antibody discovery. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.07.30.605700. [PMID: 39131296 PMCID: PMC11312550 DOI: 10.1101/2024.07.30.605700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 08/13/2024]
Abstract
Discovery of therapeutic antibodies against infectious disease pathogens presents distinct challenges. Ideal candidates must possess not only the properties required for any therapeutic antibody (e.g. specificity, low immunogenicity) but also high affinity to many mutants of the target antigen. Here we present RESP2, an enhanced version of our RESP pipeline, designed for the discovery of antibodies against one or multiple antigens with simultaneously optimized developability properties. We first evaluate this pipeline in silico using the Absolut! database of scores for antibodies docked to target antigens. We show that RESP2 consistently identifies sequences that bind more tightly to a group of target antigens than any sequence present in the training set with success rates >= 85%. Popular generative AI techniques evaluated on the same datasets achieve success rates of 1.5% or less by comparison. Next we use the receptor binding domain (RBD) of the COVID-19 spike protein as a case study, and discover a highly human antibody with broad (mid to high-affinity) binding to at least 8 different variants of the RBD. These results illustrate the advantages of this pipeline for antibody discovery against a challenging target. A Python package that enables users to utilize the RESP pipeline on their own targets is available at https://github.com/Wang-lab-UCSD/RESP2, together with code needed to reproduce the experiments in this paper.
Collapse
Affiliation(s)
- Jonathan Parkinson
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359
- MAP Bioscience, La Jolla, CA 92093
| | - Ryan Hard
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359
| | - Young Su Ko
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359
| | - Wei Wang
- Department of Chemistry and Biochemistry, University of California, San Diego, La Jolla, CA 92093-0359
- Department of Cellular and Molecular Medicine, University of California, San Diego, La Jolla, CA 92093-0359
| |
Collapse
|
5
|
Wang M, Patsenker J, Li H, Kluger Y, Kleinstein SH. Supervised fine-tuning of pre-trained antibody language models improves antigen specificity prediction. PLoS Comput Biol 2025; 21:e1012153. [PMID: 40163503 PMCID: PMC12013870 DOI: 10.1371/journal.pcbi.1012153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Revised: 04/22/2025] [Accepted: 03/04/2025] [Indexed: 04/02/2025] Open
Abstract
Antibodies play a crucial role in the adaptive immune response, with their specificity to antigens being a fundamental determinant of immune function. Accurate prediction of antibody-antigen specificity is vital for understanding immune responses, guiding vaccine design, and developing antibody-based therapeutics. In this study, we present a method of supervised fine-tuning for antibody language models, which improves on pre-trained antibody language model embeddings in binding specificity prediction to SARS-CoV-2 spike protein and influenza hemagglutinin. We perform supervised fine-tuning on four pre-trained antibody language models to predict specificity to these antigens and demonstrate that fine-tuned language model classifiers exhibit enhanced predictive accuracy compared to classifiers trained on pre-trained model embeddings. Additionally, we investigate the change of model attention activations after supervised fine-tuning to gain insights into the molecular basis of antigen recognition by antibodies. Furthermore, we apply the supervised fine-tuned models to BCR repertoire data related to influenza and SARS-CoV-2 vaccination, demonstrating their ability to capture changes in repertoire following vaccination. Overall, our study highlights the effect of supervised fine-tuning on pre-trained antibody language models as valuable tools to improve antigen specificity prediction.
Collapse
Affiliation(s)
- Meng Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
| | - Jonathan Patsenker
- Program in Applied Mathematics, Yale University, New Haven, Connecticut, United States of America
| | - Henry Li
- Program in Applied Mathematics, Yale University, New Haven, Connecticut, United States of America
| | - Yuval Kluger
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Program in Applied Mathematics, Yale University, New Haven, Connecticut, United States of America
- Department of Pathology, Yale School of Medicine, New Haven, Connecticut, United States of America
| | - Steven H. Kleinstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Pathology, Yale School of Medicine, New Haven, Connecticut, United States of America
- Department of Immunobiology, Yale School of Medicine, New Haven, Connecticut, United States of America
| |
Collapse
|
6
|
Zaslavsky ME, Craig E, Michuda JK, Sehgal N, Ram-Mohan N, Lee JY, Nguyen KD, Hoh RA, Pham TD, Röltgen K, Lam B, Parsons ES, Macwana SR, DeJager W, Drapeau EM, Roskin KM, Cunningham-Rundles C, Moody MA, Haynes BF, Goldman JD, Heath JR, Chinthrajah RS, Nadeau KC, Pinsky BA, Blish CA, Hensley SE, Jensen K, Meyer E, Balboni I, Utz PJ, Merrill JT, Guthridge JM, James JA, Yang S, Tibshirani R, Kundaje A, Boyd SD. Disease diagnostics using machine learning of B cell and T cell receptor sequences. Science 2025; 387:eadp2407. [PMID: 39977494 DOI: 10.1126/science.adp2407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Accepted: 11/29/2024] [Indexed: 02/22/2025]
Abstract
Clinical diagnosis typically incorporates physical examination, patient history, various laboratory tests, and imaging studies but makes limited use of the human immune system's own record of antigen exposures encoded by receptors on B cells and T cells. We analyzed immune receptor datasets from 593 individuals to develop MAchine Learning for Immunological Diagnosis, an interpretive framework to screen for multiple illnesses simultaneously or precisely test for one condition. This approach detects specific infections, autoimmune disorders, vaccine responses, and disease severity differences. Human-interpretable features of the model recapitulate known immune responses to severe acute respiratory syndrome coronavirus 2, influenza, and human immunodeficiency virus, highlight antigen-specific receptors, and reveal distinct characteristics of systemic lupus erythematosus and type-1 diabetes autoreactivity. This analysis framework has broad potential for scientific and clinical interpretation of immune responses.
Collapse
MESH Headings
- Humans
- Machine Learning
- Receptors, Antigen, T-Cell/immunology
- Receptors, Antigen, B-Cell/immunology
- Receptors, Antigen, B-Cell/metabolism
- Diabetes Mellitus, Type 1/immunology
- Diabetes Mellitus, Type 1/diagnosis
- Lupus Erythematosus, Systemic/diagnosis
- Lupus Erythematosus, Systemic/immunology
- COVID-19/diagnosis
- COVID-19/immunology
- B-Lymphocytes/immunology
Collapse
Affiliation(s)
- Maxim E Zaslavsky
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Erin Craig
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Jackson K Michuda
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - Nidhi Sehgal
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Nikhil Ram-Mohan
- Department of Emergency Medicine, Stanford University, Stanford, CA, USA
| | - Ji-Yeun Lee
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Khoa D Nguyen
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Ramona A Hoh
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Tho D Pham
- Department of Pathology, Stanford University, Stanford, CA, USA
- Stanford Blood Center, Stanford, CA, USA
| | - Katharina Röltgen
- Department of Medical Parasitology and Infection Biology, Swiss Tropical and Public Health Institute, Allschwil, Switzerland
- University of Basel, Basel, Switzerland
| | - Brandon Lam
- Department of Pathology, Stanford University, Stanford, CA, USA
| | - Ella S Parsons
- Sean N. Parker Center for Allergy and Asthma Research, Stanford University, Stanford, CA, USA
| | - Susan R Macwana
- Department of Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA
| | - Wade DeJager
- Department of Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA
| | - Elizabeth M Drapeau
- Department of Microbiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Krishna M Roskin
- Department of Pediatrics, University of Cincinnati, College of Medicine, Cincinnati, OH, USA
- Divisions of Biomedical Informatics and Immunobiology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, USA
| | | | - M Anthony Moody
- Department of Pediatrics, Duke University, Durham, NC, USA
- Duke Human Vaccine Institute, Duke University, Durham, NC, USA
- Department of Immunology, Duke University, Durham, NC, USA
| | - Barton F Haynes
- Duke Human Vaccine Institute, Duke University, Durham, NC, USA
- Department of Immunology, Duke University, Durham, NC, USA
- Department of Medicine, Duke University, Durham, NC, USA
| | - Jason D Goldman
- Swedish Center for Research and Innovation, Swedish Medical Center, Seattle, WA, USA
- Division of Allergy and Infectious Diseases, University of Washington, Seattle, WA, USA
| | - James R Heath
- Institute for Systems Biology, Seattle, WA, USA
- Department of Bioengineering, University of Washington, Seattle, WA, USA
| | - R Sharon Chinthrajah
- Sean N. Parker Center for Allergy and Asthma Research, Stanford University, Stanford, CA, USA
| | - Kari C Nadeau
- Department of Environmental Health, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Division of Allergy and Inflammation, Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Benjamin A Pinsky
- Department of Pathology, Stanford University, Stanford, CA, USA
- Department of Medicine, Stanford University, Stanford, CA, USA
| | | | - Scott E Hensley
- Department of Microbiology, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Kent Jensen
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - Everett Meyer
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - Imelda Balboni
- Department of Pediatrics, Stanford University, Stanford, CA, USA
| | - Paul J Utz
- Department of Medicine, Stanford University, Stanford, CA, USA
| | - Joan T Merrill
- Department of Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA
- Department of Medicine, Grossman School of Medicine, New York University, New York, NY, USA
- Lupus Foundation of America, Washington, DC, USA
| | - Joel M Guthridge
- Department of Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA
| | - Judith A James
- Department of Arthritis and Clinical Immunology, Oklahoma Medical Research Foundation, Oklahoma City, OK, USA
| | - Samuel Yang
- Department of Emergency Medicine, Stanford University, Stanford, CA, USA
| | - Robert Tibshirani
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
- Department of Statistics, Stanford University, Stanford, CA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Scott D Boyd
- Department of Pathology, Stanford University, Stanford, CA, USA
- Sean N. Parker Center for Allergy and Asthma Research, Stanford University, Stanford, CA, USA
| |
Collapse
|
7
|
Gallo E. Revolutionizing Synthetic Antibody Design: Harnessing Artificial Intelligence and Deep Sequencing Big Data for Unprecedented Advances. Mol Biotechnol 2025; 67:410-424. [PMID: 38308755 DOI: 10.1007/s12033-024-01064-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Accepted: 01/02/2024] [Indexed: 02/05/2024]
Abstract
Synthetic antibodies (Abs) represent a category of engineered proteins meticulously crafted to replicate the functions of their natural counterparts. Such Abs are generated in vitro, enabling advanced molecular alterations associated with antigen recognition, paratope site engineering, and biochemical refinements. In a parallel realm, deep sequencing has brought about a paradigm shift in molecular biology. It facilitates the prompt and cost-effective high-throughput sequencing of DNA and RNA molecules, enabling the comprehensive big data analysis of Ab transcriptomes, including specific regions of interest. Significantly, the integration of artificial intelligence (AI), based on machine- and deep- learning approaches, has fundamentally transformed our capacity to discern patterns hidden within deep sequencing big data, including distinctive Ab features and protein folding free energy landscapes. Ultimately, current AI advances can generate approximations of the most stable Ab structural configurations, enabling the prediction of de novo synthetic Abs. As a result, this manuscript comprehensively examines the latest and relevant literature concerning the intersection of deep sequencing big data and AI methodologies for the design and development of synthetic Abs. Together, these advancements have accelerated the exploration of antibody repertoires, contributing to the refinement of synthetic Ab engineering and optimizations, and facilitating advancements in the lead identification process.
Collapse
Affiliation(s)
- Eugenio Gallo
- Avance Biologicals, Department of Medicinal Chemistry, 950 Dupont Street, Toronto, ON, M6H 1Z2, Canada.
- RevivAb, Department of Protein Engineering, Av. Ipiranga, 6681, Partenon, Porto Alegre, RS, 90619-900, Brazil.
| |
Collapse
|
8
|
Liu J, Yang M, Yu Y, Xu H, Wang T, Li K, Zhou X. Advancing bioinformatics with large language models: components, applications and perspectives. ARXIV 2025:arXiv:2401.04155v2. [PMID: 38259343 PMCID: PMC10802675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.
Collapse
Affiliation(s)
- Jiajia Liu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Mengyuan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, China
| | - Yankai Yu
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
| | - Haixia Xu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Tiangang Wang
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- School of Dentistry, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
9
|
Singh R, Im C, Qiu Y, Mackness B, Gupta A, Joren T, Sledzieski S, Erlach L, Wendt M, Fomekong Nanfack Y, Bryson B, Berger B. Learning the language of antibody hypervariability. Proc Natl Acad Sci U S A 2025; 122:e2418918121. [PMID: 39793083 PMCID: PMC11725859 DOI: 10.1073/pnas.2418918121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 11/19/2024] [Indexed: 01/12/2025] Open
Abstract
Protein language models (PLMs) have demonstrated impressive success in modeling proteins. However, general-purpose "foundational" PLMs have limited performance in modeling antibodies due to the latter's hypervariable regions, which do not conform to the evolutionary conservation principles that such models rely on. In this study, we propose a transfer learning framework called Antibody Mutagenesis-Augmented Processing (AbMAP), which fine-tunes foundational models for antibody-sequence inputs by supervising on antibody structure and binding specificity examples. Our learned feature representations accurately predict mutational effects on antigen binding, paratope identification, and other key antibody properties. We experimentally validate AbMAP for antibody optimization by applying it to refine a set of antibodies that bind to a SARS-CoV-2 peptide, and obtain an 82% hit-rate and up to 22-fold increase in binding affinity. AbMAP also unlocks large-scale analyses of immune repertoires, revealing that B-cell receptor repertoires of individuals, while remarkably different in sequence, converge toward similar structural and functional coverage. Importantly, AbMAP's transfer learning approach can be readily adapted to advances in foundational PLMs. We anticipate AbMAP will accelerate the efficient design and modeling of antibodies, expedite the discovery of antibody-based therapeutics, and deepen our understanding of humoral immunity.
Collapse
Affiliation(s)
- Rohit Singh
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Chiho Im
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Yu Qiu
- Sanofi R&D Large Molecule Research, Cambridge, MA02141
| | | | - Abhinav Gupta
- Sanofi R&D Large Molecule Research, Cambridge, MA02141
| | - Taylor Joren
- Sanofi R&D Data and Data Science, Artificial Intelligence and Deep Analytics, Cambridge, MA02141
| | - Samuel Sledzieski
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
| | - Lena Erlach
- Department of Biosystems Science and Engineering, ETH Zürich, 8092, Switzerland
| | - Maria Wendt
- Sanofi R&D Large Molecule Research, Cambridge, MA02141
| | | | - Bryan Bryson
- Department of Biological Engineering, Massachusetts Institute of Technology, Technology, Cambridge, MA02139
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA02139
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA02139
| |
Collapse
|
10
|
Yin M, Zhou H, Zhu Y, Lin M, Wu Y, Wu J, Xu H, Hsieh CY, Hou T, Chen J, Wu J. Multi-Modal CLIP-Informed Protein Editing. HEALTH DATA SCIENCE 2024; 4:0211. [PMID: 39703565 PMCID: PMC11658819 DOI: 10.34133/hds.0211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 10/17/2024] [Accepted: 11/12/2024] [Indexed: 12/21/2024]
Abstract
Background: Proteins govern most biological functions essential for life, and achieving controllable protein editing has made great advances in probing natural systems, creating therapeutic conjugates, and generating novel protein constructs. Recently, machine learning-assisted protein editing (MLPE) has shown promise in accelerating optimization cycles and reducing experimental workloads. However, current methods struggle with the vast combinatorial space of potential protein edits and cannot explicitly conduct protein editing using biotext instructions, limiting their interactivity with human feedback. Methods: To fill these gaps, we propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises 2 stages: In the pretraining stage, contrastive learning aligns protein-biotext representations encoded by 2 large language models (LLMs). Subsequently, during the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences. Results: Comprehensive experiments demonstrated the superiority of ProtET in editing proteins to enhance human-expected functionality across multiple attribute domains, including enzyme catalytic activity, protein stability, and antibody-specific binding ability. ProtET improves the state-of-the-art results by a large margin, leading to substantial stability improvements of 16.67% and 16.90%. Conclusions: This capability positions ProtET to advance real-world artificial protein editing, potentially addressing unmet academic, industrial, and clinical needs.
Collapse
Affiliation(s)
- Mingze Yin
- School of Medicine,
Zhejiang University, Hangzhou, China
| | - Hanjing Zhou
- College of Computer Science and Technology,
Zhejiang University, Hangzhou, China
| | - Yiheng Zhu
- College of Computer Science and Technology,
Zhejiang University, Hangzhou, China
| | - Miao Lin
- Medical Big Data Center, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences),
Southern Medical University, Guangzhou, China
| | - Yixuan Wu
- School of Medicine,
Zhejiang University, Hangzhou, China
| | - Jialu Wu
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Hongxia Xu
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jintai Chen
- AI Thrust, Information Hub, HKUST (Guangzhou), Guangzhou, China
| | - Jian Wu
- Second Affiliated Hospital School of Medicine, Hangzhou, China
- School of Public Health,
Zhejiang University, Hangzhou, China
- Institute of Wenzhou, Wenzhou, China
| |
Collapse
|
11
|
Michalewicz K, Barahona M, Bravi B. ANTIPASTI: Interpretable prediction of antibody binding affinity exploiting normal modes and deep learning. Structure 2024; 32:2422-2434.e5. [PMID: 39461331 DOI: 10.1016/j.str.2024.10.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 05/30/2024] [Accepted: 10/01/2024] [Indexed: 10/29/2024]
Abstract
The high binding affinity of antibodies toward their cognate targets is key to eliciting effective immune responses, as well as to the use of antibodies as research and therapeutic tools. Here, we propose ANTIPASTI, a convolutional neural network model that achieves state-of-the-art performance in the prediction of antibody binding affinity using as input a representation of antibody-antigen structures in terms of normal mode correlation maps derived from elastic network models. This representation captures not only structural features but energetic patterns of local and global residue fluctuations. The learnt representations are interpretable: they reveal similarities of binding patterns among antibodies targeting the same antigen type, and can be used to quantify the importance of antibody regions contributing to binding affinity. Our results show the importance of the antigen imprint in the normal mode landscape, and the dominance of cooperative effects and long-range correlations between antibody regions to determine binding affinity.
Collapse
Affiliation(s)
- Kevin Michalewicz
- Department of Mathematics, Imperial College London, London SW7 2AZ, UK.
| | - Mauricio Barahona
- Department of Mathematics, Imperial College London, London SW7 2AZ, UK
| | - Barbara Bravi
- Department of Mathematics, Imperial College London, London SW7 2AZ, UK.
| |
Collapse
|
12
|
Kenlay H, Dreyer FA, Kovaltsuk A, Miketa D, Pires D, Deane CM. Large scale paired antibody language models. PLoS Comput Biol 2024; 20:e1012646. [PMID: 39642174 DOI: 10.1371/journal.pcbi.1012646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 12/18/2024] [Accepted: 11/18/2024] [Indexed: 12/08/2024] Open
Abstract
Antibodies are proteins produced by the immune system that can identify and neutralise a wide variety of antigens with high specificity and affinity, and constitute the most successful class of biotherapeutics. With the advent of next-generation sequencing, billions of antibody sequences have been collected in recent years, though their application in the design of better therapeutics has been constrained by the sheer volume and complexity of the data. To address this challenge, we present IgBert and IgT5, the best performing antibody-specific language models developed to date which can consistently handle both paired and unpaired variable region sequences as input. These models are trained comprehensively using the more than two billion unpaired sequences and two million paired sequences of light and heavy chains present in the Observed Antibody Space dataset. We show that our models outperform existing antibody and protein language models on a diverse range of design and regression tasks relevant to antibody engineering. This advancement marks a significant leap forward in leveraging machine learning, large scale data sets and high-performance computing for enhancing antibody design for therapeutic development.
Collapse
Affiliation(s)
- Henry Kenlay
- Exscientia, Oxford Science Park, Oxford, United Kingdom
| | | | | | - Dom Miketa
- Exscientia, Oxford Science Park, Oxford, United Kingdom
| | - Douglas Pires
- Exscientia, Oxford Science Park, Oxford, United Kingdom
| | - Charlotte M Deane
- Exscientia, Oxford Science Park, Oxford, United Kingdom
- Department of Statistics, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
13
|
Liang F, Sun M, Xie L, Zhao X, Liu D, Zhao K, Zhang G. Recent advances and challenges in protein complex model accuracy estimation. Comput Struct Biotechnol J 2024; 23:1824-1832. [PMID: 38707538 PMCID: PMC11066466 DOI: 10.1016/j.csbj.2024.04.049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Revised: 04/18/2024] [Accepted: 04/18/2024] [Indexed: 05/07/2024] Open
Abstract
Estimation of model accuracy plays a crucial role in protein structure prediction, aiming to evaluate the quality of predicted protein structure models accurately and objectively. This process is not only key to screening candidate models that are close to the real structure, but also provides guidance for further optimization of protein structures. With the significant advancements made by AlphaFold2 in monomer structure, the problem of single-domain protein structure prediction has been widely solved. Correspondingly, the importance of assessing the quality of single-domain protein models decreased, and the research focus has shifted to estimation of model accuracy of protein complexes. In this review, our goal is to provide a comprehensive overview of the reference and statistical metrics, as well as representative methods, and the current challenges within four distinct facets (Topology Global Score, Interface Total Score, Interface Residue-Wise Score, and Tertiary Residue-Wise Score) in the field of complex EMA.
Collapse
Affiliation(s)
| | | | - Lei Xie
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Xuanfeng Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Dong Liu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Kailong Zhao
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| | - Guijun Zhang
- College of Information Engineering, Zhejiang University of Technology, Hangzhou 310023, China
| |
Collapse
|
14
|
Meng F, Zhou N, Hu G, Liu R, Zhang Y, Jing M, Hou Q. A comprehensive overview of recent advances in generative models for antibodies. Comput Struct Biotechnol J 2024; 23:2648-2660. [PMID: 39027650 PMCID: PMC11254834 DOI: 10.1016/j.csbj.2024.06.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2024] [Revised: 06/15/2024] [Accepted: 06/18/2024] [Indexed: 07/20/2024] Open
Abstract
Therapeutic antibodies are an important class of biopharmaceuticals. With the rapid development of deep learning methods and the increasing amount of antibody data, antibody generative models have made great progress recently. They aim to solve the antibody space searching problems and are widely incorporated into the antibody development process. Therefore, a comprehensive introduction to the development methods in this field is imperative. Here, we collected 34 representative antibody generative models published recently and all generative models can be divided into three categories: sequence-generating models, structure-generating models, and hybrid models, based on their principles and algorithms. We further studied their performance and contributions to antibody sequence prediction, structure optimization, and affinity enhancement. Our manuscript will provide a comprehensive overview of the status of antibody generative models and also offer guidance for selecting different approaches.
Collapse
Affiliation(s)
- Fanxu Meng
- College of Chemical Engineering, Qingdao University of Science and Technology, Qingdao 266042, China
| | - Na Zhou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250100, China
- National Institute of Health Data Science of China, Shandong University, Jinan 250100, China
| | - Guangchun Hu
- School of Information Science and Engineering, University of Jinan, Jinan 250022, China
| | - Ruotong Liu
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250100, China
- National Institute of Health Data Science of China, Shandong University, Jinan 250100, China
| | - Yuanyuan Zhang
- College of Chemical Engineering, Qingdao University of Science and Technology, Qingdao 266042, China
| | - Ming Jing
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), Jinan 250353, China
- Shandong Provincial Key Laboratory of Computer Networks, Shandong Fundamental Research Center for Computer Science, Jinan 250000, China
| | - Qingzhen Hou
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan 250100, China
- National Institute of Health Data Science of China, Shandong University, Jinan 250100, China
| |
Collapse
|
15
|
Olsen TH, Moal IH, Deane CM. Addressing the antibody germline bias and its effect on language models for improved antibody design. Bioinformatics 2024; 40:btae618. [PMID: 39460949 PMCID: PMC11543624 DOI: 10.1093/bioinformatics/btae618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 09/03/2024] [Accepted: 10/24/2024] [Indexed: 10/28/2024] Open
Abstract
MOTIVATION The versatile binding properties of antibodies have made them an extremely important class of biotherapeutics. However, therapeutic antibody development is a complex, expensive, and time-consuming task, with the final antibody needing to not only have strong and specific binding but also be minimally impacted by developability issues. The success of transformer-based language models in protein sequence space and the availability of vast amounts of antibody sequences, has led to the development of many antibody-specific language models to help guide antibody design. Antibody diversity primarily arises from V(D)J recombination, mutations within the CDRs, and/or from a few nongermline mutations outside the CDRs. Consequently, a significant portion of the variable domain of all natural antibody sequences remains germline. This affects the pre-training of antibody-specific language models, where this facet of the sequence data introduces a prevailing bias toward germline residues. This poses a challenge, as mutations away from the germline are often vital for generating specific and potent binding to a target, meaning that language models need be able to suggest key mutations away from germline. RESULTS In this study, we explore the implications of the germline bias, examining its impact on both general-protein and antibody-specific language models. We develop and train a series of new antibody-specific language models optimized for predicting nongermline residues. We then compare our final model, AbLang-2, with current models and show how it suggests a diverse set of valid mutations with high cumulative probability. AVAILABILITY AND IMPLEMENTATION AbLang-2 is trained on both unpaired and paired data, and is freely available at https://github.com/oxpig/AbLang2.git.
Collapse
Affiliation(s)
- Tobias H Olsen
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
- GSK Medicines Research Centre, GSK, Stevenage SG1 2NY, United Kingdom
| | - Iain H Moal
- GSK Medicines Research Centre, GSK, Stevenage SG1 2NY, United Kingdom
| | - Charlotte M Deane
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| |
Collapse
|
16
|
Karenna N, Bryan B. Focused learning by antibody language models using preferential masking of non-templated regions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.23.619908. [PMID: 39553994 PMCID: PMC11565838 DOI: 10.1101/2024.10.23.619908] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/19/2024]
Abstract
Existing antibody language models (LMs) are pre-trained using a masked language modeling (MLM) objective with uniform masking probabilities. While these models excel at predicting germline residues, they often struggle with mutated and non-templated residues, which are crucial for antigen-binding specificity and concentrate in the complementarity-determining regions (CDRs). Here, we demonstrate that preferential masking of the non-templated CDR3 is a compute-efficient strategy to enhance model performance. We pre-trained two antibody LMs (AbLMs) using either uniform or preferential masking and observed that the latter improves residue prediction accuracy in the highly variable CDR3. Preferential masking also improves antibody classification by native chain pairing and binding specificity, suggesting improved CDR3 understanding and indicating that non-random, learnable patterns help govern antibody chain pairing. We further show that specificity classification is largely informed by residues in the CDRs, demonstrating that AbLMs learn meaningful patterns that align with immunological understanding.
Collapse
Affiliation(s)
- Ng Karenna
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037 USA
| | - Briney Bryan
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037 USA
- Center for Viral Systems Biology, The Scripps Research Institute, La Jolla, CA 92037 USA
- Multi-Omics Vaccine Evaluation Consortium, The Scripps Research Institute, La Jolla, CA 92037 USA
- Scripps Consortium for HIV/AIDS Vaccine Development, The Scripps Research Institute, La Jolla, CA 92037 USA
- San Diego Center for AIDS Research, The Scripps Research Institute, La Jolla, CA 92037 USA
| |
Collapse
|
17
|
Jagota M, Hsu C, Mazumder T, Sung K, DeWitt WS, Listgarten J, Matsen FA, Ye CJ, Song YS. Learning antibody sequence constraints from allelic inclusion. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.22.619760. [PMID: 39484623 PMCID: PMC11526943 DOI: 10.1101/2024.10.22.619760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Antibodies and B-cell receptors (BCRs) are produced by B cells, and are built of a heavy chain and a light chain. Although each B cell could express two different heavy chains and four different light chains, usually only a unique pair of heavy chain and light chain is expressed-a phenomenon known as allelic exclusion. However, a small fraction of naive-B cells violate allelic exclusion by expressing two productive light chains, one of which has impaired function; this has been called allelic inclusion. We demonstrate that these B cells can be used to learn constraints on antibody sequence. Using large-scale single-cell sequencing data from humans, we find examples of light chain allelic inclusion in thousands of naive-B cells, which is an order of magnitude larger than existing datasets. We train machine learning models to identify the abnormal sequences in these cells. The resulting models correlate with antibody properties that they were not trained on, including polyreactivity, surface expression, and mutation usage in affinity maturation. These correlations are larger than what is achieved by existing antibody modeling approaches, indicating that allelic inclusion data contains useful new information. We also investigate the impact of similar selection forces on the heavy chain in mouse, and observe that pairing with the surrogate light chain significantly restricts heavy chain diversity.
Collapse
Affiliation(s)
- Milind Jagota
- Computer Science Division, UC Berkeley, Berkeley, CA USA
| | - Chloe Hsu
- Computer Science Division, UC Berkeley, Berkeley, CA USA
| | - Thomas Mazumder
- Division of Rheumatology, Department of Medicine, UCSF, San Francisco, CA, USA
| | - Kevin Sung
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | | | | | - Frederick A. Matsen
- Public Health Sciences Division, Fred Hutchinson Cancer Research Center, Seattle, Washington, USA
| | - Chun Jimmie Ye
- Division of Rheumatology, Department of Medicine, UCSF, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Parker Institute for Cancer Immunotherapy, UCSF, San Francisco, CA, USA
- Institute for Human Genetics, UCSF, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, UCSF, San Francisco, California, USA
- Department of Epidemiology and Biostatistics, UCSF, San Francisco, CA, USA
| | - Yun S. Song
- Computer Science Division, UC Berkeley, Berkeley, CA USA
- Department of Statistics, UC Berkeley, Berkeley, CA, USA October 23, 2024
| |
Collapse
|
18
|
Gao X, Cao C, He C, Lai L. Pre-training with a rational approach for antibody sequence representation. Front Immunol 2024; 15:1468599. [PMID: 39507535 PMCID: PMC11537868 DOI: 10.3389/fimmu.2024.1468599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Accepted: 09/30/2024] [Indexed: 11/08/2024] Open
Abstract
Introduction Antibodies represent a specific class of proteins produced by the adaptive immune system in response to pathogens. Mining the information embedded in antibody amino acid sequences can benefit both antibody property prediction and novel therapeutic development. However, antibodies possess unique features that should be incorporated using specifically designed training methods, leaving room for improvement in pre-training models for antibody sequences. Methods In this study, we present a Pre-trained model of Antibody sequences trained with a Rational Approach for antibodies (PARA). PARA employs a strategy conforming to antibody sequence patterns and an advanced natural language processing self-encoding model structure. This approach addresses the limitations of existing protein pre-training models, which primarily utilize language models without fully considering the differences between protein sequences and language sequences. Results We demonstrate PARA's performance on several tasks by comparing it to various published pre-training models of antibodies. The results show that PARA significantly outperforms existing models on these tasks, suggesting that PARA has an advantage in capturing antibody sequence information. Discussion The antibody latent representation provided by PARA can substantially facilitate studies in relevant areas. We believe that PARA's superior performance in capturing antibody sequence information offers significant potential for both antibody property prediction and the development of novel therapeutics. PARA is available at https://github.com/xtalpi-xic.
Collapse
Affiliation(s)
- Xiangrui Gao
- XtalPi Innovation Center, XtalPi Inc., Beijing, China
| | - Changling Cao
- XtalPi Innovation Center, XtalPi Inc., Beijing, China
- School of Medical Technology, Beijing Institute of Technology, Beijing, China
| | - Chenfeng He
- XtalPi Innovation Center, XtalPi Inc., Beijing, China
| | - Lipeng Lai
- XtalPi Innovation Center, XtalPi Inc., Beijing, China
| |
Collapse
|
19
|
Chen HT, Zhang Y, Huang J, Sawant M, Smith MD, Rajagopal N, Desai AA, Makowski E, Licari G, Xie Y, Marlow MS, Kumar S, Tessier PM. Human antibody polyreactivity is governed primarily by the heavy-chain complementarity-determining regions. Cell Rep 2024; 43:114801. [PMID: 39392756 PMCID: PMC11564698 DOI: 10.1016/j.celrep.2024.114801] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Revised: 07/09/2024] [Accepted: 09/11/2024] [Indexed: 10/13/2024] Open
Abstract
Although antibody variable regions mediate antigen-specific binding, they can also mediate non-specific interactions with non-cognate antigens, impacting diverse immunological processes and the efficacy, safety, and half-life of antibody therapeutics. To understand the molecular basis of antibody non-specificity, we sorted two dissimilar human naïve antibody libraries against multiple reagents to enrich for variants with different levels of polyreactivity. Sequence analysis of >300,000 paired antibody variable regions revealed that the heavy chain primarily mediates human antibody polyreactivity, and this is due to the high positive charge, high hydrophobicity, and combinations thereof in the corresponding complementarity-determining regions, which can be predicted using a machine learning model developed in this work. Notably, a subset of the most important features governing antibody non-specific interactions, namely those that contain tyrosine, also govern specific antigen recognition. Our findings are broadly relevant for understanding fundamental aspects of antibody molecular recognition and the applied aspects of antibody-drug design.
Collapse
Affiliation(s)
- Hsin-Ting Chen
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Yulei Zhang
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Jie Huang
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Manali Sawant
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Matthew D Smith
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Nandhini Rajagopal
- Biotherapeutics Discovery, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, CT 06877, USA
| | - Alec A Desai
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Emily Makowski
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Giuseppe Licari
- Biotherapeutics Discovery, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, CT 06877, USA
| | - Yunxuan Xie
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Michael S Marlow
- Biotherapeutics Discovery, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, CT 06877, USA
| | - Sandeep Kumar
- Biotherapeutics Discovery, Boehringer Ingelheim Pharmaceuticals Inc., 900 Ridgebury Road, Ridgefield, CT 06877, USA
| | - Peter M Tessier
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
20
|
Yagimoto K, Hosoda S, Sato M, Hamada M. Prediction of antibiotic resistance mechanisms using a protein language model. Bioinformatics 2024; 40:btae550. [PMID: 39254573 PMCID: PMC11464418 DOI: 10.1093/bioinformatics/btae550] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 08/13/2024] [Accepted: 09/07/2024] [Indexed: 09/11/2024] Open
Abstract
MOTIVATION Antibiotic resistance has emerged as a major global health threat, with an increasing number of bacterial infections becoming difficult to treat. Predicting the underlying resistance mechanisms of antibiotic resistance genes (ARGs) is crucial for understanding and combating this problem. However, existing methods struggle to accurately predict resistance mechanisms for ARGs with low similarity to known sequences and lack sufficient interpretability of the prediction models. RESULTS In this study, we present a novel approach for predicting ARG resistance mechanisms using ProteinBERT, a protein language model (pLM) based on deep learning. Our method outperforms state-of-the-art techniques on diverse ARG datasets, including those with low homology to the training data, highlighting its potential for predicting the resistance mechanisms of unknown ARGs. Attention analysis of the model reveals that it considers biologically relevant features, such as conserved amino acid residues and antibiotic target binding sites, when making predictions. These findings provide valuable insights into the molecular basis of antibiotic resistance and demonstrate the interpretability of pLMs, offering a new perspective on their application in bioinformatics. AVAILABILITY AND IMPLEMENTATION The source code is available for free at https://github.com/hmdlab/ARG-BERT. The output results of the model are published at https://waseda.box.com/v/ARG-BERT-suppl.
Collapse
Affiliation(s)
- Kanami Yagimoto
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
| | - Shion Hosoda
- Center for Exploratory Research, Research and Development Group, Hitachi, Ltd, Tokyo 185-8601, Japan
| | - Miwa Sato
- Center for Exploratory Research, Research and Development Group, Hitachi, Ltd, Tokyo 185-8601, Japan
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Graduate School of Advanced Science and Engineering, Waseda University, Tokyo 169-8555, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology, Tokyo 169-8555, Japan
- Graduate School of Medicine, Nippon Medical School, Tokyo 113-8602, Japan
| |
Collapse
|
21
|
Schmirler R, Heinzinger M, Rost B. Fine-tuning protein language models boosts predictions across diverse tasks. Nat Commun 2024; 15:7407. [PMID: 39198457 PMCID: PMC11358375 DOI: 10.1038/s41467-024-51844-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 08/15/2024] [Indexed: 09/01/2024] Open
Abstract
Prediction methods inputting embeddings from protein language models have reached or even surpassed state-of-the-art performance on many protein prediction tasks. In natural language processing fine-tuning large language models has become the de facto standard. In contrast, most protein language model-based protein predictions do not back-propagate to the language model. Here, we compare the fine-tuning of three state-of-the-art models (ESM2, ProtT5, Ankh) on eight different tasks. Two results stand out. Firstly, task-specific supervised fine-tuning almost always improves downstream predictions. Secondly, parameter-efficient fine-tuning can reach similar improvements consuming substantially fewer resources at up to 4.5-fold acceleration of training over fine-tuning full models. Our results suggest to always try fine-tuning, in particular for problems with small datasets, such as for fitness landscape predictions of a single protein. For ease of adaptability, we provide easy-to-use notebooks to fine-tune all models used during this work for per-protein (pooling) and per-residue prediction tasks.
Collapse
Affiliation(s)
- Robert Schmirler
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Garching/Munich, Germany.
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching/Munich, Germany.
- AbbVie Deutschland GmbH & Co. KG, Innovation Center, BTS IR LU, Ludwigshafen, Germany.
| | - Michael Heinzinger
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Garching/Munich, Germany
| | - Burkhard Rost
- TUM (Technical University of Munich), School of Computation, Information and Technology (CIT), Faculty of Informatics, Chair of Bioinformatics & Computational Biology - i12, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany
| |
Collapse
|
22
|
Wang Q, Feng Y, Wang Y, Li B, Wen J, Zhou X, Song Q. AntiFormer: graph enhanced large language model for binding affinity prediction. Brief Bioinform 2024; 25:bbae403. [PMID: 39162312 PMCID: PMC11333967 DOI: 10.1093/bib/bbae403] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2024] [Revised: 07/24/2024] [Accepted: 07/30/2024] [Indexed: 08/21/2024] Open
Abstract
Antibodies play a pivotal role in immune defense and serve as key therapeutic agents. The process of affinity maturation, wherein antibodies evolve through somatic mutations to achieve heightened specificity and affinity to target antigens, is crucial for effective immune response. Despite their significance, assessing antibody-antigen binding affinity remains challenging due to limitations in conventional wet lab techniques. To address this, we introduce AntiFormer, a graph-based large language model designed to predict antibody binding affinity. AntiFormer incorporates sequence information into a graph-based framework, allowing for precise prediction of binding affinity. Through extensive evaluations, AntiFormer demonstrates superior performance compared with existing methods, offering accurate predictions with reduced computational time. Application of AntiFormer to severe acute respiratory syndrome coronavirus 2 patient samples reveals antibodies with strong neutralizing capabilities, providing insights for therapeutic development and vaccination strategies. Furthermore, analysis of individual samples following influenza vaccination elucidates differences in antibody response between young and older adults. AntiFormer identifies specific clonotypes with enhanced binding affinity post-vaccination, particularly in young individuals, suggesting age-related variations in immune response dynamics. Moreover, our findings underscore the importance of large clonotype category in driving affinity maturation and immune modulation. Overall, AntiFormer is a promising approach to accelerate antibody-based diagnostics and therapeutics, bridging the gap between traditional methods and complex antibody maturation processes.
Collapse
Affiliation(s)
- Qing Wang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, FL 32611, USA
| | - Yuzhou Feng
- Department of Laboratory Medicine and West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China
- Shihezi University School of Medicine, Shihezi University, Shihezi 832003, China
| | - Yanfei Wang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, FL 32611, USA
| | - Bo Li
- Department of Computer and Information Science, University of Macau, Macau SAR, China
| | - Jianguo Wen
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Qianqian Song
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, FL 32611, USA
| |
Collapse
|
23
|
Wossnig L, Furtmann N, Buchanan A, Kumar S, Greiff V. Best practices for machine learning in antibody discovery and development. Drug Discov Today 2024; 29:104025. [PMID: 38762089 DOI: 10.1016/j.drudis.2024.104025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2023] [Revised: 04/25/2024] [Accepted: 05/13/2024] [Indexed: 05/20/2024]
Abstract
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.
Collapse
Affiliation(s)
- Leonard Wossnig
- LabGenius Ltd, The Biscuit Factory, 100 Drummond Road, London SE16 4DG, UK; Department of Computer Science, University College London, 66-72 Gower St, London WC1E 6EA, UK.
| | - Norbert Furtmann
- R&D Large Molecules Research Platform, Sanofi Deutschland GmbH, Industriepark Höchst, Frankfurt Am Main, Germany
| | - Andrew Buchanan
- Biologics Engineering, R&D, AstraZeneca, Cambridge CB2 0AA, UK
| | - Sandeep Kumar
- Computational Protein Design and Modeling Group, Computational Science, Moderna Therapeutics, 200 Technology Square, Cambridge, MA 02139, USA
| | - Victor Greiff
- Department of Immunology and Oslo University Hospital, University of Oslo, Oslo, Norway
| |
Collapse
|
24
|
Vu MH, Robert PA, Akbar R, Swiatczak B, Sandve GK, Haug DTT, Greiff V. Linguistics-based formalization of the antibody language as a basis for antibody language models. NATURE COMPUTATIONAL SCIENCE 2024; 4:412-422. [PMID: 38877120 DOI: 10.1038/s43588-024-00642-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2022] [Accepted: 05/13/2024] [Indexed: 06/16/2024]
Abstract
Apparent parallels between natural language and antibody sequences have led to a surge in deep language models applied to antibody sequences for predicting cognate antigen recognition. However, a linguistic formal definition of antibody language does not exist, and insight into how antibody language models capture antibody-specific binding features remains largely uninterpretable. Here we describe how a linguistic formalization of the antibody language, by characterizing its tokens and grammar, could address current challenges in antibody language model rule mining.
Collapse
Affiliation(s)
- Mai Ha Vu
- Department of Linguistics and Scandinavian Studies, University of Oslo, Oslo, Norway.
| | - Philippe A Robert
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Rahmad Akbar
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Bartlomiej Swiatczak
- Department of History of Science and Scientific Archeology, University of Science and Technology of China, Hefei, China
| | | | | | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
| |
Collapse
|
25
|
Cheng J, Liang T, Xie XQ, Feng Z, Meng L. A new era of antibody discovery: an in-depth review of AI-driven approaches. Drug Discov Today 2024; 29:103984. [PMID: 38642702 DOI: 10.1016/j.drudis.2024.103984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 04/02/2024] [Accepted: 04/15/2024] [Indexed: 04/22/2024]
Abstract
Given their high affinity and specificity for a range of macromolecules, antibodies are widely used in the treatment of autoimmune diseases, cancers, inflammatory diseases, and Alzheimer's disease (AD). Traditional experimental methods are time-consuming, expensive, and labor-intensive. Recent advances in artificial intelligence (AI) technologies provide complementary methods that can reduce the time and costs required for antibody design by minimizing failures and increasing the success rate of experimental tests. In this review, we scrutinize the plethora of AI-driven methodologies that have been deployed over the past 4 years for modeling antibody structures, predicting antibody-antigen interactions, optimizing antibody affinity, and generating novel antibody candidates. We also briefly address the challenges faced in integrating AI-based models with traditional antibody discovery pipelines and highlight the potential future directions in this burgeoning field.
Collapse
Affiliation(s)
- Jin Cheng
- School of Pharmacy, Jiangsu Vocational College of Medicine, Yancheng, 224005, China
| | - Tianjian Liang
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics & System Pharmacology PharmacoAnalytics, School of Pharmacy, University of Pittsburgh, Pittsburgh, PA 15261, USA; Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, PA 15261, USA
| | - Xiang-Qun Xie
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics & System Pharmacology PharmacoAnalytics, School of Pharmacy, University of Pittsburgh, Pittsburgh, PA 15261, USA; Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, PA 15261, USA; Drug Discovery Institute, University of Pittsburgh, Pittsburgh, PA 15261, USA; Department of Computational Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15261, USA; Department of Structural Biology, School of Medicine, University of Pittsburgh, Pittsburgh, PA 15261, USA.
| | - Zhiwei Feng
- Department of Pharmaceutical Sciences, Computational Chemical Genomics Screening Center, and Pharmacometrics & System Pharmacology PharmacoAnalytics, School of Pharmacy, University of Pittsburgh, Pittsburgh, PA 15261, USA; Center of Excellence for Computational Drug Abuse Research, University of Pittsburgh, Pittsburgh, PA 15261, USA.
| | - Li Meng
- School of Pharmacy, Jiangsu Vocational College of Medicine, Yancheng, 224005, China.
| |
Collapse
|
26
|
Joubbi S, Micheli A, Milazzo P, Maccari G, Ciano G, Cardamone D, Medini D. Antibody design using deep learning: from sequence and structure design to affinity maturation. Brief Bioinform 2024; 25:bbae307. [PMID: 38960409 PMCID: PMC11221890 DOI: 10.1093/bib/bbae307] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 05/20/2024] [Accepted: 06/12/2024] [Indexed: 07/05/2024] Open
Abstract
Deep learning has achieved impressive results in various fields such as computer vision and natural language processing, making it a powerful tool in biology. Its applications now encompass cellular image classification, genomic studies and drug discovery. While drug development traditionally focused deep learning applications on small molecules, recent innovations have incorporated it in the discovery and development of biological molecules, particularly antibodies. Researchers have devised novel techniques to streamline antibody development, combining in vitro and in silico methods. In particular, computational power expedites lead candidate generation, scaling and potential antibody development against complex antigens. This survey highlights significant advancements in protein design and optimization, specifically focusing on antibodies. This includes various aspects such as design, folding, antibody-antigen interactions docking and affinity maturation.
Collapse
Affiliation(s)
- Sara Joubbi
- Department of Computer Science, University of Pisa, Largo B. Pontecorvo, 3, 56127, Pisa, Italy
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Alessio Micheli
- Department of Computer Science, University of Pisa, Largo B. Pontecorvo, 3, 56127, Pisa, Italy
| | - Paolo Milazzo
- Department of Computer Science, University of Pisa, Largo B. Pontecorvo, 3, 56127, Pisa, Italy
| | - Giuseppe Maccari
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Giorgio Ciano
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Dario Cardamone
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| | - Duccio Medini
- Data Science for Health (DaScH) Lab, Fondazione Toscana Life Sciences, Via Fiorentina, 1, 53100, Siena, Italy
| |
Collapse
|
27
|
Jing H, Gao Z, Xu S, Shen T, Peng Z, He S, You T, Ye S, Lin W, Sun S. Accurate prediction of antibody function and structure using bio-inspired antibody language model. Brief Bioinform 2024; 25:bbae245. [PMID: 38797969 PMCID: PMC11128484 DOI: 10.1093/bib/bbae245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 04/08/2024] [Accepted: 05/07/2024] [Indexed: 05/29/2024] Open
Abstract
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% nonredundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. The BALMFold structure prediction server is freely available at https://beamlab-sh.com/models/BALMFold.
Collapse
Affiliation(s)
- Hongtai Jing
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
| | - Zhengtao Gao
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Sheng Xu
- Shanghai AI Laboratory, Shanghai 200232, China
| | - Tao Shen
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Zelixir Biotech, Shanghai 201206, China
| | - Zhangzhi Peng
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shwai He
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Tao You
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shuang Ye
- Department of Gynecologic Oncology, Fudan University Shanghai Cancer Center, Shanghai 200032, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Wei Lin
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
- Shanghai AI Laboratory, Shanghai 200232, China
- School of Mathematical Sciences and Shanghai Center for Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Siqi Sun
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Shanghai AI Laboratory, Shanghai 200232, China
| |
Collapse
|
28
|
Wang M, Patsenker J, Li H, Kluger Y, Kleinstein SH. Supervised fine-tuning of pre-trained antibody language models improves antigen specificity prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.13.593807. [PMID: 38798340 PMCID: PMC11118465 DOI: 10.1101/2024.05.13.593807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Antibodies play a crucial role in adaptive immune responses by determining B cell specificity to antigens and focusing immune function on target pathogens. Accurate prediction of antibody-antigen specificity directly from antibody sequencing data would be a great aid in understanding immune responses, guiding vaccine design, and developing antibody-based therapeutics. In this study, we present a method of supervised fine-tuning for antibody language models, which improves on previous results in binding specificity prediction to SARS-CoV-2 spike protein and influenza hemagglutinin. We perform supervised fine-tuning on four pre-trained antibody language models to predict specificity to these antigens and demonstrate that fine-tuned language model classifiers exhibit enhanced predictive accuracy compared to classifiers trained on pre-trained model embeddings. The change of model attention activations after supervised fine-tuning suggested that this performance was driven by an increased model focus on the complementarity determining regions (CDRs). Application of the supervised fine-tuned models to BCR repertoire data demonstrated that these models could recognize the specific responses elicited by influenza and SARS-CoV-2 vaccination. Overall, our study highlights the benefits of supervised fine-tuning on pre-trained antibody language models as a mechanism to improve antigen specificity prediction.
Collapse
Affiliation(s)
- Meng Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
| | - Jonathan Patsenker
- Program in Applied Mathematics, Yale University, New Haven, Connecticut, United States of America
| | - Henry Li
- Program in Applied Mathematics, Yale University, New Haven, Connecticut, United States of America
| | - Yuval Kluger
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Program in Applied Mathematics, Yale University, New Haven, Connecticut, United States of America
- Department of Pathology, Yale School of Medicine, New Haven, Connecticut, United States of America
| | - Steven H Kleinstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut, United States of America
- Department of Pathology, Yale School of Medicine, New Haven, Connecticut, United States of America
- Department of Immunobiology, Yale School of Medicine, New Haven, Connecticut, United States of America
| |
Collapse
|
29
|
Burbach SM, Briney B. Improving antibody language models with native pairing. PATTERNS (NEW YORK, N.Y.) 2024; 5:100967. [PMID: 38800360 PMCID: PMC11117052 DOI: 10.1016/j.patter.2024.100967] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/09/2023] [Revised: 01/25/2024] [Accepted: 03/08/2024] [Indexed: 05/29/2024]
Abstract
Existing antibody language models are limited by their use of unpaired antibody sequence data. A recently published dataset of ∼1.6 × 106 natively paired human antibody sequences offers a unique opportunity to evaluate how antibody language models are improved by training with native pairs. We trained three baseline antibody language models (BALM), using natively paired (BALM-paired), randomly-paired (BALM-shuffled), or unpaired (BALM-unpaired) sequences from this dataset. To address the paucity of paired sequences, we additionally fine-tuned ESM (evolutionary scale modeling)-2 with natively paired antibody sequences (ft-ESM). We provide evidence that training with native pairs allows the model to learn immunologically relevant features that span the light and heavy chains, which cannot be simulated by training with random pairs. We additionally show that training with native pairs improves model performance on a variety of metrics, including the ability of the model to classify antibodies by pathogen specificity.
Collapse
Affiliation(s)
- Sarah M. Burbach
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Center for Viral Systems Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Multi-Omics Vaccine Evaluation Consortium, The Scripps Research Institute, La Jolla, CA 92037, USA
| | - Bryan Briney
- Department of Immunology and Microbiology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Center for Viral Systems Biology, The Scripps Research Institute, La Jolla, CA 92037, USA
- Multi-Omics Vaccine Evaluation Consortium, The Scripps Research Institute, La Jolla, CA 92037, USA
- Scripps Consortium for HIV/AIDS Vaccine Development, The Scripps Research Institute, La Jolla, CA 92037, USA
- San Diego Center for AIDS Research, The Scripps Research Institute, La Jolla, CA 92037, USA
| |
Collapse
|
30
|
Townsend DR, Towers DM, Lavinder JJ, Ippolito GC. Innovations and trends in antibody repertoire analysis. Curr Opin Biotechnol 2024; 86:103082. [PMID: 38428225 DOI: 10.1016/j.copbio.2024.103082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2023] [Revised: 12/07/2023] [Accepted: 01/28/2024] [Indexed: 03/03/2024]
Abstract
Monoclonal antibodies have revolutionized the treatment of human diseases, which has made them the fastest-growing class of therapeutics, with global sales expected to reach $346.6 billion USD by 2028. Advances in antibody engineering and development have led to the creation of increasingly sophisticated antibody-based therapeutics (e.g. bispecific antibodies and chimeric antigen receptor T cells). However, approaches for antibody discovery have remained comparatively grounded in conventional yet reliable in vitro assays. Breakthrough developments in high-throughput single B-cell sequencing and immunoglobulin proteomic serology, however, have enabled the identification of high-affinity antibodies directly from endogenous B cells or circulating immunoglobulin produced in vivo. Moreover, advances in artificial intelligence offer vast potential for antibody discovery and design with large-scale repertoire datasets positioned as the optimal source of training data for such applications. We highlight advances and recent trends in how these technologies are being applied to antibody repertoire analysis.
Collapse
Affiliation(s)
- Douglas R Townsend
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Dalton M Towers
- Department of Chemical Engineering, The University of Texas at Austin, Austin, TX, USA
| | - Jason J Lavinder
- Department of Chemical Engineering, The University of Texas at Austin, Austin, TX, USA
| | - Gregory C Ippolito
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
31
|
Chomicz D, Kończak J, Wróbel S, Satława T, Dudzic P, Janusz B, Tarkowski M, Deszyński P, Gawłowski T, Kostyn A, Orłowski M, Klaus T, Schulte L, Martin K, Comeau SR, Krawczyk K. Benchmarking antibody clustering methods using sequence, structural, and machine learning similarity measures for antibody discovery applications. Front Mol Biosci 2024; 11:1352508. [PMID: 38606289 PMCID: PMC11008471 DOI: 10.3389/fmolb.2024.1352508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Accepted: 02/09/2024] [Indexed: 04/13/2024] Open
Abstract
Antibodies are proteins produced by our immune system that have been harnessed as biotherapeutics. The discovery of antibody-based therapeutics relies on analyzing large volumes of diverse sequences coming from phage display or animal immunizations. Identification of suitable therapeutic candidates is achieved by grouping the sequences by their similarity and subsequent selection of a diverse set of antibodies for further tests. Such groupings are typically created using sequence-similarity measures alone. Maximizing diversity in selected candidates is crucial to reducing the number of tests of molecules with near-identical properties. With the advances in structural modeling and machine learning, antibodies can now be grouped across other diversity dimensions, such as predicted paratopes or three-dimensional structures. Here we benchmarked antibody grouping methods using clonotype, sequence, paratope prediction, structure prediction, and embedding information. The results were benchmarked on two tasks: binder detection and epitope mapping. We demonstrate that on binder detection no method appears to outperform the others, while on epitope mapping, clonotype, paratope, and embedding clusterings are top performers. Most importantly, all the methods propose orthogonal groupings, offering more diverse pools of candidates when using multiple methods than any single method alone. To facilitate exploring the diversity of antibodies using different methods, we have created an online tool-CLAP-available at (clap.naturalantibody.com) that allows users to group, contrast, and visualize antibodies using the different grouping methods.
Collapse
Affiliation(s)
| | | | - Sonia Wróbel
- NaturalAntibody, Szczecin, West Pomeranian, Poland
| | | | - Paweł Dudzic
- NaturalAntibody, Szczecin, West Pomeranian, Poland
| | | | | | | | | | | | - Marek Orłowski
- Pure Biologics, Wrocław, Poland
- Department of Biochemistry, Molecular Biology and Biotechnology, Faculty of Chemistry, Wrocław University of Science and Technology, Wrocław, Poland
| | | | - Lukas Schulte
- Global Computational Biology & Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach, Germany
| | - Kyle Martin
- Biotherapeutics Discovery, Boehringer Ingelheim, Biberach, Germany
| | | | | |
Collapse
|
32
|
Li S, Meng X, Li R, Huang B, Wang X. NanoBERTa-ASP: predicting nanobody paratope based on a pretrained RoBERTa model. BMC Bioinformatics 2024; 25:122. [PMID: 38515052 PMCID: PMC10956323 DOI: 10.1186/s12859-024-05750-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/18/2024] [Indexed: 03/23/2024] Open
Abstract
BACKGROUND Nanobodies, also known as VHH or single-domain antibodies, are unique antibody fragments derived solely from heavy chains. They offer advantages of small molecules and conventional antibodies, making them promising therapeutics. The paratope is the specific region on an antibody that binds to an antigen. Paratope prediction involves the identification and characterization of the antigen-binding site on an antibody. This process is crucial for understanding the specificity and affinity of antibody-antigen interactions. Various computational methods and experimental approaches have been developed to predict and analyze paratopes, contributing to advancements in antibody engineering, drug development, and immunotherapy. However, existing predictive models trained on traditional antibodies may not be suitable for nanobodies. Additionally, the limited availability of nanobody datasets poses challenges in constructing accurate models. METHODS To address these challenges, we have developed a novel nanobody prediction model, named NanoBERTa-ASP (Antibody Specificity Prediction), which is specifically designed for predicting nanobody-antigen binding sites. The model adopts a training strategy more suitable for nanobodies, based on an advanced natural language processing (NLP) model called BERT (Bidirectional Encoder Representations from Transformers). To be more specific, the model utilizes a masked language modeling approach named RoBERTa (Robustly Optimized BERT Pretraining Approach) to learn the contextual information of the nanobody sequence and predict its binding site. RESULTS NanoBERTa-ASP achieved exceptional performance in predicting nanobody binding sites, outperforming existing methods, indicating its proficiency in capturing sequence information specific to nanobodies and accurately identifying their binding sites. Furthermore, NanoBERTa-ASP provides insights into the interaction mechanisms between nanobodies and antigens, contributing to a better understanding of nanobodies and facilitating the design and development of nanobodies with therapeutic potential. CONCLUSION NanoBERTa-ASP represents a significant advancement in nanobody paratope prediction. Its superior performance highlights the potential of deep learning approaches in nanobody research. By leveraging the increasing volume of nanobody data, NanoBERTa-ASP can further refine its predictions, enhance its performance, and contribute to the development of novel nanobody-based therapeutics. Github repository: https://github.com/WangLabforComputationalBiology/NanoBERTa-ASP.
Collapse
Affiliation(s)
- Shangru Li
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China
| | - Xiangpeng Meng
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China
| | - Rui Li
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China
| | - Bingding Huang
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China.
| | - Xin Wang
- College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China.
| |
Collapse
|
33
|
Lee H, Shin K, Lee Y, Lee S, Lee S, Lee E, Kim SW, Shin HY, Kim JH, Chung J, Kwon S. Identification of B cell subsets based on antigen receptor sequences using deep learning. Front Immunol 2024; 15:1342285. [PMID: 38576618 PMCID: PMC10991714 DOI: 10.3389/fimmu.2024.1342285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2023] [Accepted: 03/07/2024] [Indexed: 04/06/2024] Open
Abstract
B cell receptors (BCRs) denote antigen specificity, while corresponding cell subsets indicate B cell functionality. Since each B cell uniquely encodes this combination, physical isolation and subsequent processing of individual B cells become indispensable to identify both attributes. However, this approach accompanies high costs and inevitable information loss, hindering high-throughput investigation of B cell populations. Here, we present BCR-SORT, a deep learning model that predicts cell subsets from their corresponding BCR sequences by leveraging B cell activation and maturation signatures encoded within BCR sequences. Subsequently, BCR-SORT is demonstrated to improve reconstruction of BCR phylogenetic trees, and reproduce results consistent with those verified using physical isolation-based methods or prior knowledge. Notably, when applied to BCR sequences from COVID-19 vaccine recipients, it revealed inter-individual heterogeneity of evolutionary trajectories towards Omicron-binding memory B cells. Overall, BCR-SORT offers great potential to improve our understanding of B cell responses.
Collapse
Affiliation(s)
- Hyunho Lee
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
| | - Kyoungseob Shin
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
| | - Yongju Lee
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
| | - Soobin Lee
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
| | - Seungyoun Lee
- Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul, Republic of Korea
- Department of Biomedical Science, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Eunjae Lee
- Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul, Republic of Korea
- Department of Biomedical Science, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Seung Woo Kim
- Department of Neurology, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Ha Young Shin
- Department of Neurology, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Jong Hoon Kim
- Department of Dermatology and Cutaneous Biology Research Institute, Gangnam Severance Hospital, Yonsei University College of Medicine, Seoul, Republic of Korea
| | - Junho Chung
- Department of Biochemistry and Molecular Biology, Seoul National University College of Medicine, Seoul, Republic of Korea
- Department of Biomedical Science, Seoul National University College of Medicine, Seoul, Republic of Korea
- Cancer Research Institute, Seoul National University College of Medicine, Seoul, Republic of Korea
| | - Sunghoon Kwon
- Department of Electrical and Computer Engineering, Seoul National University, Seoul, Republic of Korea
- Interdisciplinary Program in Bioengineering, Seoul National University, Seoul, Republic of Korea
- Bio-MAX Institute, Seoul National University, Seoul, Republic of Korea
- Inter-University Semiconductor Research Center, Seoul National University, Seoul, Republic of Korea
| |
Collapse
|
34
|
Hadsund JT, Satława T, Janusz B, Shan L, Zhou L, Röttger R, Krawczyk K. nanoBERT: a deep learning model for gene agnostic navigation of the nanobody mutational space. BIOINFORMATICS ADVANCES 2024; 4:vbae033. [PMID: 38560554 PMCID: PMC10978573 DOI: 10.1093/bioadv/vbae033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 02/05/2024] [Accepted: 03/04/2024] [Indexed: 04/04/2024]
Abstract
Motivation Nanobodies are a subclass of immunoglobulins, whose binding site consists of only one peptide chain, bestowing favorable biophysical properties. Recently, the first nanobody therapy was approved, paving the way for further clinical applications of this antibody format. Further development of nanobody-based therapeutics could be streamlined by computational methods. One of such methods is infilling-positional prediction of biologically feasible mutations in nanobodies. Being able to identify possible positional substitutions based on sequence context, facilitates functional design of such molecules. Results Here we present nanoBERT, a nanobody-specific transformer to predict amino acids in a given position in a query sequence. We demonstrate the need to develop such machine-learning based protocol as opposed to gene-specific positional statistics since appropriate genetic reference is not available. We benchmark nanoBERT with respect to human-based language models and ESM-2, demonstrating the benefit for domain-specific language models. We also demonstrate the benefit of employing nanobody-specific predictions for fine-tuning on experimentally measured thermostability dataset. We hope that nanoBERT will help engineers in a range of predictive tasks for designing therapeutic nanobodies. Availability and implementation https://huggingface.co/NaturalAntibody/.
Collapse
Affiliation(s)
| | | | | | - Lu Shan
- Alector Therapeutics, San Francisco, CA, 94080, United States
| | - Li Zhou
- Alector Therapeutics, San Francisco, CA, 94080, United States
| | - Richard Röttger
- Department Mathematics and Computer Science, University of Southern, Odense, 5230, Denmark
| | | |
Collapse
|
35
|
Barton J, Gaspariunas A, Galson JD, Leem J. Building Representation Learning Models for Antibody Comprehension. Cold Spring Harb Perspect Biol 2024; 16:a041462. [PMID: 38012013 PMCID: PMC10910360 DOI: 10.1101/cshperspect.a041462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Antibodies are versatile proteins with both the capacity to bind a broad range of targets and a proven track record as some of the most successful therapeutics. However, the development of novel antibody therapeutics is a lengthy and costly process. It is challenging to predict the functional and biophysical properties of antibodies from their amino acid sequence alone, requiring numerous experiments for full characterization. Machine learning, specifically deep representation learning, has emerged as a family of methods that can complement wet lab approaches and accelerate the overall discovery and engineering process. Here, we review advances in antibody sequence representation learning, and how this has improved antibody structure prediction and facilitated antibody optimization. We discuss challenges in the development and implementation of such models, such as the lack of publicly available, well-curated antibody function data and highlight opportunities for improvement. These and future advances in machine learning for antibody sequences have the potential to increase the success rate in developing new therapeutics, resulting in broader access to transformative medicines and improved patient outcomes.
Collapse
Affiliation(s)
- Justin Barton
- Alchemab Therapeutics Ltd, London N1C 4AX, United Kingdom
| | | | - Jacob D Galson
- Alchemab Therapeutics Ltd, London N1C 4AX, United Kingdom
| | - Jinwoo Leem
- Alchemab Therapeutics Ltd, London N1C 4AX, United Kingdom
| |
Collapse
|
36
|
Wang H, Hao X, He Y, Fan L. AbImmPred: An immunogenicity prediction method for therapeutic antibodies using AntiBERTy-based sequence features. PLoS One 2024; 19:e0296737. [PMID: 38394128 PMCID: PMC10889861 DOI: 10.1371/journal.pone.0296737] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2023] [Accepted: 12/18/2023] [Indexed: 02/25/2024] Open
Abstract
Due to the unnecessary immune responses induced by therapeutic antibodies in clinical applications, immunogenicity is an important factor to be considered in the development of antibody therapeutics. To a certain extent, there is a lag in using wet-lab experiments to test the immunogenicity in the development process of antibody therapeutics. Developing a computational method to predict the immunogenicity at once the antibody sequence is designed, is of great significance for the screening in the early stage and reducing the risk of antibody therapeutics development. In this study, a computational immunogenicity prediction method was proposed on the basis of AntiBERTy-based features of amino sequences in the antibody variable region. The AntiBERTy-based sequence features were first calculated using the AntiBERTy pre-trained model. Principal component analysis (PCA) was then applied to reduce the extracted feature to two dimensions to obtain the final features. AutoGluon was then used to train multiple machine learning models and the best one, the weighted ensemble model, was obtained through 5-fold cross-validation on the collected data. The data contains 199 commercial therapeutic antibodies, of which 177 samples were used for model training and 5-fold cross-validation, and the remaining 22 samples were used as an independent test dataset to evaluate the performance of the constructed model and compare it with other prediction methods. Test results show that the proposed method outperforms the comparison method with 0.7273 accuracy on the independent test dataset, which is 9.09% higher than the comparison method. The corresponding web server is available through the official website of GenScript Co., Ltd., https://www.genscript.com/tools/antibody-immunogenicity.
Collapse
Affiliation(s)
- Hong Wang
- Production and R&D Center I of Life Science Services, GenScript Biotech Corporation, Nanjing, China
| | - Xiaohu Hao
- Production and R&D Center I of Life Science Services, GenScript Biotech Corporation, Nanjing, China
| | - Yuzhuo He
- Production and R&D Center I of Life Science Services, GenScript Biotech Corporation, Nanjing, China
| | - Long Fan
- Production and R&D Center I of Life Science Services, GenScript Biotech Corporation, Nanjing, China
- Production and R&D Center I of Life Science Services, GenScript (Shanghai) Biotech Co., Ltd., Shanghai, China
| |
Collapse
|
37
|
Wang M, Patsenker J, Li H, Kluger Y, Kleinstein S. Language model-based B cell receptor sequence embeddings can effectively encode receptor specificity. Nucleic Acids Res 2024; 52:548-557. [PMID: 38109302 PMCID: PMC10810273 DOI: 10.1093/nar/gkad1128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 10/18/2023] [Accepted: 11/11/2023] [Indexed: 12/20/2023] Open
Abstract
High throughput sequencing of B cell receptors (BCRs) is increasingly applied to study the immense diversity of antibodies. Learning biologically meaningful embeddings of BCR sequences is beneficial for predictive modeling. Several embedding methods have been developed for BCRs, but no direct performance benchmarking exists. Moreover, the impact of the input sequence length and paired-chain information on the prediction remains to be explored. We evaluated the performance of multiple embedding models to predict BCR sequence properties and receptor specificity. Despite the differences in model architectures, most embeddings effectively capture BCR sequence properties and specificity. BCR-specific embeddings slightly outperform general protein language models in predicting specificity. In addition, incorporating full-length heavy chains and paired light chain sequences improves the prediction performance of all embeddings. This study provides insights into the properties of BCR embeddings to improve downstream prediction applications for antibody analysis and discovery.
Collapse
Affiliation(s)
- Meng Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | | | - Henry Li
- Program in Applied Mathematics, Yale University, New Haven, CT, USA
| | - Yuval Kluger
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
- Program in Applied Mathematics, Yale University, New Haven, CT, USA
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
| | - Steven H Kleinstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
- Department of Immunobiology, Yale School of Medicine, New Haven, CT, USA
| |
Collapse
|
38
|
Bravi B. Development and use of machine learning algorithms in vaccine target selection. NPJ Vaccines 2024; 9:15. [PMID: 38242890 PMCID: PMC10798987 DOI: 10.1038/s41541-023-00795-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 12/07/2023] [Indexed: 01/21/2024] Open
Abstract
Computer-aided discovery of vaccine targets has become a cornerstone of rational vaccine design. In this article, I discuss how Machine Learning (ML) can inform and guide key computational steps in rational vaccine design concerned with the identification of B and T cell epitopes and correlates of protection. I provide examples of ML models, as well as types of data and predictions for which they are built. I argue that interpretable ML has the potential to improve the identification of immunogens also as a tool for scientific discovery, by helping elucidate the molecular processes underlying vaccine-induced immune responses. I outline the limitations and challenges in terms of data availability and method development that need to be addressed to bridge the gap between advances in ML predictions and their translational application to vaccine design.
Collapse
Affiliation(s)
- Barbara Bravi
- Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.
| |
Collapse
|
39
|
Hutchinson M, Ruffolo JA, Haskins N, Iannotti M, Vozza G, Pham T, Mehzabeen N, Shandilya H, Rickert K, Croasdale-Wood R, Damschroder M, Fu Y, Dippel A, Gray JJ, Kaplan G. Toward enhancement of antibody thermostability and affinity by computational design in the absence of antigen. MAbs 2024; 16:2362775. [PMID: 38899735 PMCID: PMC11195458 DOI: 10.1080/19420862.2024.2362775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Accepted: 05/29/2024] [Indexed: 06/21/2024] Open
Abstract
Over the past two decades, therapeutic antibodies have emerged as a rapidly expanding domain within the field of biologics. In silico tools that can streamline the process of antibody discovery and optimization are critical to support a pipeline that is growing more numerous and complex every year. High-quality structural information remains critical for the antibody optimization process, but antibody-antigen complex structures are often unavailable and in silico antibody docking methods are still unreliable. In this study, DeepAb, a deep learning model for predicting antibody Fv structure directly from sequence, was used in conjunction with single-point experimental deep mutational scanning (DMS) enrichment data to design 200 potentially optimized variants of an anti-hen egg lysozyme (HEL) antibody. We sought to determine whether DeepAb-designed variants containing combinations of beneficial mutations from the DMS exhibit enhanced thermostability and whether this optimization affected their developability profile. The 200 variants were produced through a robust high-throughput method and tested for thermal and colloidal stability (Tonset, Tm, Tagg), affinity (KD) relative to the parental antibody, and for developability parameters (nonspecific binding, aggregation propensity, self-association). Of the designed clones, 91% and 94% exhibited increased thermal and colloidal stability and affinity, respectively. Of these, 10% showed a significantly increased affinity for HEL (5- to 21-fold increase) and thermostability (>2.5C increase in Tm1), with most clones retaining the favorable developability profile of the parental antibody. Additional in silico tests suggest that these methods would enrich for binding affinity even without first collecting experimental DMS measurements. These data open the possibility of in silico antibody optimization without the need to predict the antibody-antigen interface, which is notoriously difficult in the absence of crystal structures.
Collapse
Affiliation(s)
- Mark Hutchinson
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
| | - Jeffrey A. Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA
- Profluent Bio, Machine Learning, Berkeley, CA, USA
| | - Nantaporn Haskins
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
- Currently at Protein Engineering, R&D, Amgen Inc, Rockville, MD, USA
| | - Michael Iannotti
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
- Honigman LLP, Intellectual Property, Washington, DC, United States
| | - Giuliana Vozza
- Biopharmaceuticals Development, R&D, AstraZeneca, Cambridge, UK
| | - Tony Pham
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
| | | | | | - Keith Rickert
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
| | | | | | - Ying Fu
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
| | - Andrew Dippel
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
| | - Jeffrey J. Gray
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, USA
| | - Gilad Kaplan
- Biologics Engineering, R&D, AstraZeneca, Gaithersburg, MD, USA
| |
Collapse
|
40
|
Dudzic P, Chomicz D, Kończak J, Satława T, Janusz B, Wrobel S, Gawłowski T, Jaszczyszyn I, Bielska W, Demharter S, Spreafico R, Schulte L, Martin K, Comeau SR, Krawczyk K. Large-scale data mining of four billion human antibody variable regions reveals convergence between therapeutic and natural antibodies that constrains search space for biologics drug discovery. MAbs 2024; 16:2361928. [PMID: 38844871 PMCID: PMC11164219 DOI: 10.1080/19420862.2024.2361928] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Accepted: 05/27/2024] [Indexed: 06/12/2024] Open
Abstract
The naïve human antibody repertoire has theoretical access to an estimated > 1015 antibodies. Identifying subsets of this prohibitively large space where therapeutically relevant antibodies may be found is useful for development of these agents. It was previously demonstrated that, despite the immense sequence space, different individuals can produce the same antibodies. It was also shown that therapeutic antibodies, which typically follow seemingly unnatural development processes, can arise independently naturally. To check for biases in how the sequence space is explored, we data mined public repositories to identify 220 bioprojects with a combined seven billion reads. Of these, we created a subset of human bioprojects that we make available as the AbNGS database (https://naturalantibody.com/ngs/). AbNGS contains 135 bioprojects with four billion productive human heavy variable region sequences and 385 million unique complementarity-determining region (CDR)-H3s. We find that 270,000 (0.07% of 385 million) unique CDR-H3s are highly public in that they occur in at least five of 135 bioprojects. Of 700 unique therapeutic CDR-H3, a total of 6% has direct matches in the small set of 270,000. This observation extends to a match between CDR-H3 and V-gene call as well. Thus, the subspace of shared ('public') CDR-H3s shows utility for serving as a starting point for therapeutic antibody design.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | | | | | | | - Lukas Schulte
- Global Computational Biology & Digital Sciences, Boehringer Ingelheim Pharma GmbH & Co. KG, Biberach an der Riß, Germany
| | - Kyle Martin
- Biotherapeutics Discovery, Boehringer Ingelheim, Ridgefield, CT, USA
| | - Stephen R. Comeau
- Biotherapeutics Discovery, Boehringer Ingelheim, Ridgefield, CT, USA
| | | |
Collapse
|
41
|
Chungyoun M, Gray JJ. AI Models for Protein Design are Driving Antibody Engineering. CURRENT OPINION IN BIOMEDICAL ENGINEERING 2023; 28:100473. [PMID: 37484815 PMCID: PMC10361400 DOI: 10.1016/j.cobme.2023.100473] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/25/2023]
Abstract
Therapeutic antibody engineering seeks to identify antibody sequences with specific binding to a target and optimized drug-like properties. When guided by deep learning, antibody generation methods can draw on prior knowledge and experimental efforts to improve this process. By leveraging the increasing quantity and quality of predicted structures of antibodies and target antigens, powerful structure-based generative models are emerging. In this review, we tie the advancements in deep learning-based protein structure prediction and design to the study of antibody therapeutics.
Collapse
Affiliation(s)
- Michael Chungyoun
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21287, USA
| | - Jeffrey J Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD, 21287, USA
- Program in Molecular Biophysics, institute for Nanobiotechnology, and Center for Computational Biology, Johns Hopkins University, Baltimore, MD, 21287, USA
| |
Collapse
|
42
|
Robbins M. Therapies for Tau-associated neurodegenerative disorders: targeting molecules, synapses, and cells. Neural Regen Res 2023; 18:2633-2637. [PMID: 37449601 PMCID: PMC10358644 DOI: 10.4103/1673-5374.373670] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 02/14/2023] [Accepted: 03/15/2023] [Indexed: 07/18/2023] Open
Abstract
Advances in experimental and computational technologies continue to grow rapidly to provide novel avenues for the treatment of neurodegenerative disorders. Despite this, there remain only a handful of drugs that have shown success in late-stage clinical trials for Tau-associated neurodegenerative disorders. The most commonly prescribed treatments are symptomatic treatments such as cholinesterase inhibitors and N-methyl-D-aspartate receptor blockers that were approved for use in Alzheimer's disease. As diagnostic screening can detect disorders at earlier time points, the field needs pre-symptomatic treatments that can prevent, or significantly delay the progression of these disorders (Koychev et al., 2019). These approaches may be different from late-stage treatments that may help to ameliorate symptoms and slow progression once symptoms have become more advanced should early diagnostic screening fail. This mini-review will highlight five key avenues of academic and industrial research for identifying therapeutic strategies to treat Tau-associated neurodegenerative disorders. These avenues include investigating (1) the broad class of chemicals termed "small molecules"; (2) adaptive immunity through both passive and active antibody treatments; (3) innate immunity with an emphasis on microglial modulation; (4) synaptic compartments with the view that Tau-associated neurodegenerative disorders are synaptopathies. Although this mini-review will focus on Alzheimer's disease due to its prevalence, it will also argue the need to target other tauopathies, as through understanding Alzheimer's disease as a Tau-associated neurodegenerative disorder, we may be able to generalize treatment options. For this reason, added detail linking back specifically to Tau protein as a direct therapeutic target will be added to each topic.
Collapse
Affiliation(s)
- Miranda Robbins
- MRC Laboratory of Molecular Biology, Cambridge Biomedical Campus, Francis Crick Ave, Trumpington, Cambridge, UK; University of Cambridge, Department of Zoology, Cambridge, UK
| |
Collapse
|
43
|
Shuai RW, Ruffolo JA, Gray JJ. IgLM: Infilling language modeling for antibody sequence design. Cell Syst 2023; 14:979-989.e4. [PMID: 37909045 PMCID: PMC11018345 DOI: 10.1016/j.cels.2023.10.001] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 06/14/2023] [Accepted: 10/02/2023] [Indexed: 11/02/2023]
Abstract
Discovery and optimization of monoclonal antibodies for therapeutic applications relies on large sequence libraries but is hindered by developability issues such as low solubility, high aggregation, and high immunogenicity. Generative language models, trained on millions of protein sequences, are a powerful tool for the on-demand generation of realistic, diverse sequences. We present the Immunoglobulin Language Model (IgLM), a deep generative language model for creating synthetic antibody libraries. Compared with prior methods that leverage unidirectional context for sequence generation, IgLM formulates antibody design based on text-infilling in natural language, allowing it to re-design variable-length spans within antibody sequences using bidirectional context. We trained IgLM on 558 million (M) antibody heavy- and light-chain variable sequences, conditioning on each sequence's chain type and species of origin. We demonstrate that IgLM can generate full-length antibody sequences from a variety of species and its infilling formulation allows it to generate infilled complementarity-determining region (CDR) loop libraries with improved in silico developability profiles. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Jeffrey A Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA
| | - Jeffrey J Gray
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA; Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
44
|
Nijkamp E, Ruffolo JA, Weinstein EN, Naik N, Madani A. ProGen2: Exploring the boundaries of protein language models. Cell Syst 2023; 14:968-978.e3. [PMID: 37909046 DOI: 10.1016/j.cels.2023.10.002] [Citation(s) in RCA: 61] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Revised: 05/01/2023] [Accepted: 10/02/2023] [Indexed: 11/02/2023]
Abstract
Attention-based models trained on protein sequences have demonstrated incredible success at classification and generation tasks relevant for artificial-intelligence-driven protein design. However, we lack a sufficient understanding of how very large-scale models and data play a role in effective protein model development. We introduce a suite of protein language models, named ProGen2, that are scaled up to 6.4B parameters and trained on different sequence datasets drawn from over a billion proteins from genomic, metagenomic, and immune repertoire databases. ProGen2 models show state-of-the-art performance in capturing the distribution of observed evolutionary sequences, generating novel viable sequences, and predicting protein fitness without additional fine-tuning. As large model sizes and raw numbers of protein sequences continue to become more widely accessible, our results suggest that a growing emphasis needs to be placed on the data distribution provided to a protein sequence model. Our models and code are open sourced for widespread adoption in protein engineering. A record of this paper's Transparent Peer Review process is included in the supplemental information.
Collapse
Affiliation(s)
| | - Jeffrey A Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, USA; Profluent Bio, Berkeley, CA, USA
| | - Eli N Weinstein
- Data Science Institute, Columbia University, New York, NY, USA
| | | | - Ali Madani
- Salesforce Research, Palo Alto, CA, USA; Profluent Bio, Berkeley, CA, USA.
| |
Collapse
|
45
|
Erasmus MF, Ferrara F, D'Angelo S, Spector L, Leal-Lopes C, Teixeira AA, Sørensen J, Nagpal S, Perea-Schmittle K, Choudhary A, Honnen W, Calianese D, Antonio Rodriguez Carnero L, Cocklin S, Greiff V, Pinter A, Bradbury ARM. Insights into next generation sequencing guided antibody selection strategies. Sci Rep 2023; 13:18370. [PMID: 37884618 PMCID: PMC10603065 DOI: 10.1038/s41598-023-45538-w] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2023] [Accepted: 10/20/2023] [Indexed: 10/28/2023] Open
Abstract
Therapeutic antibody discovery often relies on in-vitro display methods to identify lead candidates. Assessing selected output diversity traditionally involves random colony picking and Sanger sequencing, which has limitations. Next-generation sequencing (NGS) offers a cost-effective solution with increased read depth, allowing a comprehensive understanding of diversity. Our study establishes NGS guidelines for antibody drug discovery, demonstrating its advantages in expanding the number of unique HCDR3 clusters, broadening the number of high affinity antibodies, expanding the total number of antibodies recognizing different epitopes, and improving lead prioritization. Surprisingly, our investigation into the correlation between NGS-derived frequencies of CDRs and affinity revealed a lack of association, although this limitation could be moderately mitigated by leveraging NGS clustering, enrichment and/or relative abundance across different regions to enhance lead prioritization. This study highlights NGS benefits, offering insights, recommendations, and the most effective approach to leverage NGS in therapeutic antibody discovery.
Collapse
Affiliation(s)
| | | | - Sara D'Angelo
- Specifica LLC, a Q2 Solutions Company, Santa Fe, USA
| | - Laura Spector
- Specifica LLC, a Q2 Solutions Company, Santa Fe, USA
| | | | | | | | | | | | - Alok Choudhary
- Public Health Research Institute, New Jersey Medical School, Rutgers, The State University of New Jersey, Newark, NJ, 07103, USA
| | - William Honnen
- Public Health Research Institute, New Jersey Medical School, Rutgers, The State University of New Jersey, Newark, NJ, 07103, USA
| | - David Calianese
- Public Health Research Institute, New Jersey Medical School, Rutgers, The State University of New Jersey, Newark, NJ, 07103, USA
| | | | - Simon Cocklin
- Specifica LLC, a Q2 Solutions Company, Santa Fe, USA
| | | | - Abraham Pinter
- Public Health Research Institute, New Jersey Medical School, Rutgers, The State University of New Jersey, Newark, NJ, 07103, USA
| | | |
Collapse
|
46
|
Zhou Y, Huang Z, Li W, Wei J, Jiang Q, Yang W, Huang J. Deep learning in preclinical antibody drug discovery and development. Methods 2023; 218:57-71. [PMID: 37454742 DOI: 10.1016/j.ymeth.2023.07.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2022] [Revised: 03/20/2023] [Accepted: 07/10/2023] [Indexed: 07/18/2023] Open
Abstract
Antibody drugs have become a key part of biotherapeutics. Patients suffering from various diseases have benefited from antibody therapies. However, its development process is rather long, expensive and risky. To speed up the process, reduce cost and improve success rate, artificial intelligence, especially deep learning methods, have been widely used in all aspects of preclinical antibody drug development, from library generation to hit identification, developability screening, lead selection and optimization. In this review, we systematically summarize antibody encodings, deep learning architectures and models used in preclinical antibody drug discovery and development. We also critically discuss challenges and opportunities, problems and possible solutions, current applications and future directions of deep learning in antibody drug development.
Collapse
Affiliation(s)
- Yuwei Zhou
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Ziru Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wenzhen Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jinyi Wei
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Qianhu Jiang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Wei Yang
- School of Computer Science and Engineering, University of Electronic Science and Technology of China, Chengdu 611731, China
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu 611731, China.
| |
Collapse
|
47
|
Bravi B, Di Gioacchino A, Fernandez-de-Cossio-Diaz J, Walczak AM, Mora T, Cocco S, Monasson R. A transfer-learning approach to predict antigen immunogenicity and T-cell receptor specificity. eLife 2023; 12:e85126. [PMID: 37681658 PMCID: PMC10522340 DOI: 10.7554/elife.85126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2022] [Accepted: 09/07/2023] [Indexed: 09/09/2023] Open
Abstract
Antigen immunogenicity and the specificity of binding of T-cell receptors to antigens are key properties underlying effective immune responses. Here we propose diffRBM, an approach based on transfer learning and Restricted Boltzmann Machines, to build sequence-based predictive models of these properties. DiffRBM is designed to learn the distinctive patterns in amino-acid composition that, on the one hand, underlie the antigen's probability of triggering a response, and on the other hand the T-cell receptor's ability to bind to a given antigen. We show that the patterns learnt by diffRBM allow us to predict putative contact sites of the antigen-receptor complex. We also discriminate immunogenic and non-immunogenic antigens, antigen-specific and generic receptors, reaching performances that compare favorably to existing sequence-based predictors of antigen immunogenicity and T-cell receptor specificity.
Collapse
Affiliation(s)
- Barbara Bravi
- Department of Mathematics, Imperial College LondonLondonUnited Kingdom
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-CitéParisFrance
| | - Andrea Di Gioacchino
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-CitéParisFrance
| | - Jorge Fernandez-de-Cossio-Diaz
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-CitéParisFrance
| | - Aleksandra M Walczak
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-CitéParisFrance
| | - Thierry Mora
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-CitéParisFrance
| | - Simona Cocco
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-CitéParisFrance
| | - Rémi Monasson
- Laboratoire de Physique de l’Ecole Normale Supérieure, ENS, Université PSL, CNRS, Sorbonne Université, Université Paris-CitéParisFrance
| |
Collapse
|
48
|
Makowski EK, Chen HT, Tessier PM. Simplifying complex antibody engineering using machine learning. Cell Syst 2023; 14:667-675. [PMID: 37591204 PMCID: PMC10733906 DOI: 10.1016/j.cels.2023.04.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 03/06/2023] [Accepted: 04/26/2023] [Indexed: 08/19/2023]
Abstract
Machine learning is transforming antibody engineering by enabling the generation of drug-like monoclonal antibodies with unprecedented efficiency. Unsupervised algorithms trained on massive and diverse protein sequence datasets facilitate the prediction of panels of antibody variants with native-like intrinsic properties (e.g., high stability), greatly reducing the amount of subsequent experimentation needed to identify specific candidates that also possess desired extrinsic properties (e.g., high affinity). Additionally, supervised algorithms, which are trained on deep sequencing datasets obtained after enrichment of in vitro antibody libraries for one or more specific extrinsic properties, enable the prediction of antibody variants with desired combinations of extrinsic properties without the need for additional screening. Here we review recent advances using both machine learning approaches and how they are impacting the field of antibody engineering as well as key outstanding challenges and opportunities for these paradigm-changing methods.
Collapse
Affiliation(s)
- Emily K Makowski
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Hsin-Ting Chen
- Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA
| | - Peter M Tessier
- Department of Pharmaceutical Sciences, University of Michigan, Ann Arbor, MI 48109, USA; Department of Chemical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Department of Biomedical Engineering, University of Michigan, Ann Arbor, MI 48109, USA; Biointerfaces Institute, University of Michigan, Ann Arbor, MI 48109, USA.
| |
Collapse
|
49
|
Ruffolo JA, Chu LS, Mahajan SP, Gray JJ. Fast, accurate antibody structure prediction from deep learning on massive set of natural antibodies. Nat Commun 2023; 14:2389. [PMID: 37185622 PMCID: PMC10129313 DOI: 10.1038/s41467-023-38063-x] [Citation(s) in RCA: 86] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 04/14/2023] [Indexed: 05/17/2023] Open
Abstract
Antibodies have the capacity to bind a diverse set of antigens, and they have become critical therapeutics and diagnostic molecules. The binding of antibodies is facilitated by a set of six hypervariable loops that are diversified through genetic recombination and mutation. Even with recent advances, accurate structural prediction of these loops remains a challenge. Here, we present IgFold, a fast deep learning method for antibody structure prediction. IgFold consists of a pre-trained language model trained on 558 million natural antibody sequences followed by graph networks that directly predict backbone atom coordinates. IgFold predicts structures of similar or better quality than alternative methods (including AlphaFold) in significantly less time (under 25 s). Accurate structure prediction on this timescale makes possible avenues of investigation that were previously infeasible. As a demonstration of IgFold's capabilities, we predicted structures for 1.4 million paired antibody sequences, providing structural insights to 500-fold more antibodies than have experimentally determined structures.
Collapse
Affiliation(s)
- Jeffrey A Ruffolo
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Lee-Shin Chu
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA
| | - Jeffrey J Gray
- Program in Molecular Biophysics, The Johns Hopkins University, Baltimore, MD, 21218, USA.
- Department of Chemical and Biomolecular Engineering, The Johns Hopkins University, Baltimore, MD, 21218, USA.
| |
Collapse
|
50
|
Rosace A, Bennett A, Oeller M, Mortensen MM, Sakhnini L, Lorenzen N, Poulsen C, Sormanni P. Automated optimisation of solubility and conformational stability of antibodies and proteins. Nat Commun 2023; 14:1937. [PMID: 37024501 PMCID: PMC10079162 DOI: 10.1038/s41467-023-37668-6] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Accepted: 03/24/2023] [Indexed: 04/08/2023] Open
Abstract
Biologics, such as antibodies and enzymes, are crucial in research, biotechnology, diagnostics, and therapeutics. Often, biologics with suitable functionality are discovered, but their development is impeded by developability issues. Stability and solubility are key biophysical traits underpinning developability potential, as they determine aggregation, correlate with production yield and poly-specificity, and are essential to access parenteral and oral delivery. While advances for the optimisation of individual traits have been made, the co-optimization of multiple traits remains highly problematic and time-consuming, as mutations that improve one property often negatively impact others. In this work, we introduce a fully automated computational strategy for the simultaneous optimisation of conformational stability and solubility, which we experimentally validate on six antibodies, including two approved therapeutics. Our results on 42 designs demonstrate that the computational procedure is highly effective at improving developability potential, while not affecting antigen-binding. We make the method available as a webserver at www-cohsoftware.ch.cam.ac.uk.
Collapse
Affiliation(s)
- Angelo Rosace
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield road, CB2 1EW, Cambridge, UK
- Master in Bioinformatics for Health Sciences, Universitat Pompeu Fabra, Barcelona, Catalonia, Spain
- Institute for Research in Biomedicine (IRB), The Barcelona Institute of Science and Technology, Barcelona, Catalonia, Spain
| | - Anja Bennett
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield road, CB2 1EW, Cambridge, UK
- Department of Mammalian Expression, Global Research Technologies, Novo Nordisk A/S, Novo Nordisk Park 1, 2760, Måløv, Denmark
- BRIC, Faculty of Health and Medical Sciences, University of Copenhagen, Ole Maaløes Vej 5, 2200, Copenhagen, Denmark
| | - Marc Oeller
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield road, CB2 1EW, Cambridge, UK
| | - Mie M Mortensen
- Department of Purification Technologies, Global Research Technologies, Novo Nordisk A/S, Novo Nordisk Park 1, 2760, Måløv, Denmark
- Faculty of Engineering and Science, Department of Biotechnology, Chemistry and Environmental Engineering, University of Aalborg, Fredrik Bajers Vej 7H, 9220, Aalborg, Denmark
| | - Laila Sakhnini
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield road, CB2 1EW, Cambridge, UK
- Department of Biophysics and Injectable Formulation 2, Global Research Technologies, Novo Nordisk A/S, Måløv, 2760, Denmark
| | - Nikolai Lorenzen
- Department of Biophysics and Injectable Formulation 2, Global Research Technologies, Novo Nordisk A/S, Måløv, 2760, Denmark
| | - Christian Poulsen
- Department of Mammalian Expression, Global Research Technologies, Novo Nordisk A/S, Novo Nordisk Park 1, 2760, Måløv, Denmark
| | - Pietro Sormanni
- Centre for Misfolding Diseases, Yusuf Hamied Department of Chemistry, University of Cambridge, Lensfield road, CB2 1EW, Cambridge, UK.
| |
Collapse
|