1
|
Zhang M, Yang Q, Lou J, Hu Y, Shi Y. A new strategy to HER2-specific antibody discovery through artificial intelligence-powered phage display screening based on the Trastuzumab framework. Biochim Biophys Acta Mol Basis Dis 2025; 1871:167772. [PMID: 40056877 DOI: 10.1016/j.bbadis.2025.167772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2024] [Revised: 02/23/2025] [Accepted: 02/28/2025] [Indexed: 03/10/2025]
Abstract
Human epidermal growth factor receptor 2 (HER2) is a recognized drug target, and it serves as a critical target for various cancer treatments, necessitating the discovery of more antibodies for therapeutic and detection purposes. Here, we have developed an innovative workflow for antibody generation through Artificial Intelligence-powered Phage Display Screening (AIPDS). This workflow integrates artificial intelligence-driven antibody CDRH3 sequence design, high-throughput DNA synthesis and phage display screening. We applied AIPDS workflow to generate promising antibodies against the human epidermal growth factor receptor 2 (HER2), offering a template for streamlined antibody generation. Seven novel antibodies stood out, demonstrating promising efficacy in various functional assays. Notably, DYHER2-02 demonstrates strong performance across all experimental tests. In summary, our study introduces a novel methodology to generate new antibody variants of an existing antibody using an AI-assisted phage display approach. These new antibody variants hold potential applications in research, diagnosis, and therapeutic applications.
Collapse
Affiliation(s)
- Mancang Zhang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200030, People's Republic of China
| | - Qiangzhen Yang
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200030, People's Republic of China
| | - Jiangrong Lou
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200030, People's Republic of China
| | - Yang Hu
- United Research Center for Next Generation DNA Synthesis of SJTU-Dynegene, Shanghai 201108, People's Republic of China
| | - Yongyong Shi
- Bio-X Institutes, Key Laboratory for the Genetics of Developmental and Neuropsychiatric Disorders, Ministry of Education, Shanghai Jiao Tong University, Shanghai 200030, People's Republic of China; Institute of Neuroscience, Center for Excellence in Brain Science and Intelligence Technology, Chinese Academy of Sciences, Shanghai 200031, People's Republic of China.
| |
Collapse
|
2
|
Ahmed FS, Aly S, El-Tabakh MAM, Liu X. NABP-LSTM-Att: Nanobody-Antigen binding prediction using bidirectional LSTM and soft attention mechanism. Comput Biol Chem 2025; 118:108490. [PMID: 40347542 DOI: 10.1016/j.compbiolchem.2025.108490] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2025] [Revised: 04/16/2025] [Accepted: 04/21/2025] [Indexed: 05/14/2025]
Abstract
In vertebrates, antibody-mediated immunity is a vital component of the immune system, and antibodies have become a rapidly expanding class of therapeutic agents. Nanobodies, a distinct type of antibody, have recently emerged as a stable and cost-effective alternative to traditional antibodies. Their small size, high target specificity, notable solubility, and stability make nanobodies promising candidates for developing high-quality drugs. However, the lack of available nanobodies for most antigens remains a key challenge. Advancing the development of nanobodies requires a better understanding of their interactions with antigens to enhance binding affinity and specificity. Experimental methods for identifying these interactions are essential but often costly and time-consuming, posing challenges for developing nanobody therapies. Although several computational approaches have been designed to screen potential nanobodies, their dependency on 3D structures limits their broad application. This research introduces NABP-LSTM-Att, a deep learning model designed to predict nanobody-antigen binding solely from sequence information. NABP-LSTM-Att leverages bidirectional long short-term memory (biLSTM) to capture both long- and short-term dependencies within nanobody and antigen sequences, combined with a soft attention mechanism to focus on key features. When evaluated on nanobody-antigen sequence pairs from the SAbDab-nano database, NABP-LSTM-Att achieved an AUROC of 0.926 and an AUPR of 0.952. Considering the significance of nanobody-based treatments and their prospective uses in immunotherapy and diagnostics, we believe that the proposed model will serve as an effective tool for predicting nanobody-antigen binding.
Collapse
Affiliation(s)
- Fatma S Ahmed
- Department of Computer Science and Technology, Xiamen University, Xiamen, 361005, China; Department of Electrical Engineering, Aswan University, Aswan, 81542, Egypt.
| | - Saleh Aly
- Department of Information Technology, Majmaah University, Majmaah, 11952, Saudi Arabia.
| | | | - Xiangrong Liu
- Department of Computer Science and Technology, Xiamen University, Xiamen, 361005, China; National Institute for Data Science in Health and Medicine, State Key Laboratory of Vaccines for Infectious Diseases, XiangAn Biomedicine Laboratory, Xiamen University, Xiamen, 361005, China.
| |
Collapse
|
3
|
Tan Y, Zhou B, Zheng L, Fan G, Hong L. Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability. eLife 2025; 13:RP98033. [PMID: 40314227 PMCID: PMC12048155 DOI: 10.7554/elife.98033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025] Open
Abstract
Protein engineering is a pivotal aspect of synthetic biology, involving the modification of amino acids within existing protein sequences to achieve novel or enhanced functionalities and physical properties. Accurate prediction of protein variant effects requires a thorough understanding of protein sequence, structure, and function. Deep learning methods have demonstrated remarkable performance in guiding protein modification for improved functionality. However, existing approaches predominantly rely on protein sequences, which face challenges in efficiently encoding the geometric aspects of amino acids' local environment and often fall short in capturing crucial details related to protein folding stability, internal molecular interactions, and bio-functions. Furthermore, there lacks a fundamental evaluation for developed methods in predicting protein thermostability, although it is a key physical property that is frequently investigated in practice. To address these challenges, this article introduces a novel pre-training framework that integrates sequential and geometric encoders for protein primary and tertiary structures. This framework guides mutation directions toward desired traits by simulating natural selection on wild-type proteins and evaluates variant effects based on their fitness to perform specific functions. We assess the proposed approach using three benchmarks comprising over 300 deep mutational scanning assays. The prediction results showcase exceptional performance across extensive experiments compared to other zero-shot learning methods, all while maintaining a minimal cost in terms of trainable parameters. This study not only proposes an effective framework for more accurate and comprehensive predictions to facilitate efficient protein engineering, but also enhances the in silico assessment system for future deep learning models to better align with empirical requirements. The PyTorch implementation is available at https://github.com/ai4protein/ProtSSN.
Collapse
Affiliation(s)
- Yang Tan
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
| | - Bingxin Zhou
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| | - Lirong Zheng
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
| | - Liang Hong
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| |
Collapse
|
4
|
Livesey BJ, Marsh JA. Variant effect predictor correlation with functional assays is reflective of clinical classification performance. Genome Biol 2025; 26:104. [PMID: 40264194 PMCID: PMC12016141 DOI: 10.1186/s13059-025-03575-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2024] [Accepted: 04/11/2025] [Indexed: 04/24/2025] Open
Abstract
BACKGROUND Understanding the relationship between protein sequence and function is crucial for accurate classification of missense variants. Variant effect predictors (VEPs) play a vital role in deciphering this complex relationship, yet evaluating their performance remains challenging for several reasons, including data circularity, where the same or related data is used for training and assessment. High-throughput experimental strategies like deep mutational scanning (DMS) offer a promising solution. RESULTS In this study, we extend upon our previous benchmarking approach, assessing the performance of 97 VEPs using missense DMS measurements from 36 different human proteins. In addition, a new pairwise, VEP-centric approach mitigates the impact of missing predictions on overall performance comparison. We observe a strong correspondence between VEP performance in DMS-based benchmarks and clinical variant classification, especially for predictors that have not been directly trained on human clinical variants. CONCLUSIONS Our results suggest that comparing VEP performance against diverse functional assays represents a reliable strategy for assessing their relative performance in clinical variant classification. However, major challenges in clinical interpretation of VEP scores persist, highlighting the need for further research to fully leverage computational predictors for genetic diagnosis. We also address practical considerations for end users in terms of choice of methodology.
Collapse
Affiliation(s)
- Benjamin J Livesey
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
5
|
Sun Y, Tan W, Gu Z, He R, Chen S, Pang M, Yan B. A data-efficient strategy for building high-performing medical foundation models. Nat Biomed Eng 2025; 9:539-551. [PMID: 40044818 DOI: 10.1038/s41551-025-01365-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2024] [Accepted: 02/04/2025] [Indexed: 04/04/2025]
Abstract
Foundation models are pretrained on massive datasets. However, collecting medical datasets is expensive and time-consuming, and raises privacy concerns. Here we show that synthetic data generated via conditioning with disease labels can be leveraged for building high-performing medical foundation models. We pretrained a retinal foundation model, first with approximately one million synthetic retinal images with physiological structures and feature distribution consistent with real counterparts, and then with only 16.7% of the 904,170 real-world colour fundus photography images required in a recently reported retinal foundation model (RETFound). The data-efficient model performed as well or better than RETFound across nine public datasets and four diagnostic tasks; and for diabetic-retinopathy grading, it used only 40% of the expert-annotated training data used by RETFound. We also support the generalizability of the data-efficient strategy by building a classifier for the detection of tuberculosis on chest X-ray images. The text-conditioned generation of synthetic data may enhance the performance and generalization of medical foundation models.
Collapse
Affiliation(s)
- Yuqi Sun
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Weimin Tan
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Zhuoyao Gu
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Ruian He
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Siyuan Chen
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Miao Pang
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China
| | - Bo Yan
- Shanghai Key Laboratory of Intelligent Information Processing, School of Computer Science, Fudan University, Shanghai, China.
| |
Collapse
|
6
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. Trends Genet 2025; 41:286-302. [PMID: 39753409 DOI: 10.1016/j.tig.2024.11.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 11/21/2024] [Accepted: 11/21/2024] [Indexed: 04/10/2025]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, USA; Department of Statistics, University of California, Berkeley, CA, USA; Center for Computational Biology, University of California, Berkeley, CA, USA.
| |
Collapse
|
7
|
Dewaker V, Morya VK, Kim YH, Park ST, Kim HS, Koh YH. Revolutionizing oncology: the role of Artificial Intelligence (AI) as an antibody design, and optimization tools. Biomark Res 2025; 13:52. [PMID: 40155973 PMCID: PMC11954232 DOI: 10.1186/s40364-025-00764-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2025] [Accepted: 03/13/2025] [Indexed: 04/01/2025] Open
Abstract
Antibodies play a crucial role in defending the human body against diseases, including life-threatening conditions like cancer. They mediate immune responses against foreign antigens and, in some cases, self-antigens. Over time, antibody-based technologies have evolved from monoclonal antibodies (mAbs) to chimeric antigen receptor T cells (CAR-T cells), significantly impacting biotechnology, diagnostics, and therapeutics. Although these advancements have enhanced therapeutic interventions, the integration of artificial intelligence (AI) is revolutionizing antibody design and optimization. This review explores recent AI advancements, including large language models (LLMs), diffusion models, and generative AI-based applications, which have transformed antibody discovery by accelerating de novo generation, enhancing immune response precision, and optimizing therapeutic efficacy. Through advanced data analysis, AI enables the prediction and design of antibody sequences, 3D structures, complementarity-determining regions (CDRs), paratopes, epitopes, and antigen-antibody interactions. These AI-powered innovations address longstanding challenges in antibody development, significantly improving speed, specificity, and accuracy in therapeutic design. By integrating computational advancements with biomedical applications, AI is driving next-generation cancer therapies, transforming precision medicine, and enhancing patient outcomes.
Collapse
Affiliation(s)
- Varun Dewaker
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea
| | - Vivek Kumar Morya
- Department of Orthopedic Surgery, Hallym University Dongtan Sacred Hospital, Hwaseong-Si, 18450, Republic of Korea
| | - Yoo Hee Kim
- Department of Biomedical Gerontology, Ilsong Institute of Life Science, Hallym University, Seoul, 07247, Republic of Korea
| | - Sung Taek Park
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea
- Department of Obstetrics and Gynecology, Kangnam Sacred-Heart Hospital, Hallym University Medical Center, Hallym University College of Medicine, Seoul, 07441, Republic of Korea
- EIONCELL Inc, Chuncheon-Si, 24252, Republic of Korea
| | - Hyeong Su Kim
- Institute of New Frontier Research Team, Hallym University, Chuncheon-Si, Gangwon-Do, 24252, Republic of Korea.
- Department of Internal Medicine, Division of Hemato-Oncology, Kangnam Sacred-Heart Hospital, Hallym University Medical Center, Hallym University College of Medicine, Seoul, 07441, Republic of Korea.
- EIONCELL Inc, Chuncheon-Si, 24252, Republic of Korea.
| | - Young Ho Koh
- Department of Biomedical Gerontology, Ilsong Institute of Life Science, Hallym University, Seoul, 07247, Republic of Korea.
| |
Collapse
|
8
|
Orenbuch R, Shearer CA, Kollasch AW, Spinner HD, Hopf TA, van Niekerk L, Franceschi D, Dias M, Frazer J, Marks DS. Proteome-wide model for human disease genetics. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2023.11.27.23299062. [PMID: 38076790 PMCID: PMC10705666 DOI: 10.1101/2023.11.27.23299062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Identifying variants driving disease accelerates both genetic diagnosis and therapeutic development, but missense variants still present a bottleneck as their effects are less straightforward than truncations or nonsense mutations. While computational prediction methods are sufficiently accurate to be of clinical value for variants in known disease genes, they do not generalize well to other genes as the scores are not calibrated across the proteome 1-6 . To address this, we developed a deep generative model, popEVE, that combines evolutionary information with population sequence data 7 and achieves state-of-the-art performance on a suite of proteome-wide prediction tasks, without overestimating the prevalence of deleterious variants in the population. popEVE identifies 442 genes in a developmental disorder cohort 8 , including evidence of 123 novel candidates, many without the need for cohort-wide enrichment. Candidate genes are functionally similar to known developmental disorder genes and case variants tend to fall in functionally important regions of these genes. Finally, we show that these findings can be reproduced from analysis of the patient exomes alone, demonstrating that popEVE provides a new avenue for genetic analysis in situations where traditional methods fail, including genetic diagnosis of rare-as-one diseases, even in the absence of parent sequencing.
Collapse
|
9
|
Kilgore HR, Chinn I, Mikhael PG, Mitnikov I, Van Dongen C, Zylberberg G, Afeyan L, Banani S, Wilson-Hawken S, Lee TI, Barzilay R, Young RA. Protein codes promote selective subcellular compartmentalization. Science 2025; 387:1095-1101. [PMID: 39913643 PMCID: PMC12034300 DOI: 10.1126/science.adq2634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/05/2024] [Revised: 11/07/2024] [Accepted: 01/28/2025] [Indexed: 02/12/2025]
Abstract
Cells have evolved mechanisms to distribute ~10 billion protein molecules to subcellular compartments where diverse proteins involved in shared functions must assemble. In this study, we demonstrate that proteins with shared functions share amino acid sequence codes that guide them to compartment destinations. We developed a protein language model, ProtGPS, that predicts with high performance the compartment localization of human proteins excluded from the training set. ProtGPS successfully guided generation of novel protein sequences that selectively assemble in the nucleolus. ProtGPS identified pathological mutations that change this code and lead to altered subcellular localization of proteins. Our results indicate that protein sequences contain not only a folding code but also a previously unrecognized code governing their distribution to diverse subcellular compartments.
Collapse
Affiliation(s)
- Henry R. Kilgore
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Itamar Chinn
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Peter G. Mikhael
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Ilan Mitnikov
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | | | - Guy Zylberberg
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Lena Afeyan
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Salman Banani
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Pathology, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA
| | - Susana Wilson-Hawken
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Program of Computational & Systems Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Tong Ihn Lee
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
| | - Regina Barzilay
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Richard A. Young
- Whitehead Institute for Biomedical Research, Cambridge, MA 02142, USA
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| |
Collapse
|
10
|
Sun Y, Shen Y. Structure-informed protein language models are robust predictors for variant effects. Hum Genet 2025; 144:209-225. [PMID: 39117802 PMCID: PMC12068927 DOI: 10.1007/s00439-024-02695-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 07/20/2024] [Indexed: 08/10/2024]
Abstract
Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.
Collapse
Affiliation(s)
- Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA.
- Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA.
- Institute of Biosciences and Technology and Department of Translational Medical Sciences, Texas A&M University, Houston, 77030, Texas, USA.
| |
Collapse
|
11
|
Zhao Y, Tang Y, Zhang P. Nonequilibrium statistical mechanics revealed by Doob h transform and variational autoregressive networks. Phys Rev E 2025; 111:034120. [PMID: 40247533 DOI: 10.1103/physreve.111.034120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Accepted: 02/24/2025] [Indexed: 04/19/2025]
Abstract
The nonequilibrium dynamics of stochastic systems are typically characterized by the joint probability distribution and its time evolution within configuration space or through trajectory ensembles. Within the framework of trajectory ensembles, processes occurring on exponentially rare spatio-temporal scales are of particular interest. Recent research has demonstrated that Doob dynamics is a highly effective method for sampling these rare trajectories. Most existing methods directly sample specific transition trajectories within the trajectory space. However, they fundamentally lack the capacity to resolve the time evolution of configuration-space probability distributions associated with these rare trajectories. In this study, we demonstrate how to construct Doob dynamics by approximating the leading eigenstate of the tilted generator using the variational autoregressive networks (VAN) ansatz. This approach enables us to sample the time-evolution information of the probability distribution associated with rare trajectories and to compute the large deviation statistics. We apply our methodology to two typical lattice models: the East and the Fredrickson-Andersen models, to sample large deviation statistics and probability distributions in one-dimensional and two-dimensional scenarios, respectively. Finally, we discuss the limitations of our method and propose potential solutions.
Collapse
Affiliation(s)
- Yixin Zhao
- School of Fundamental Physics and Mathematical Sciences, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China
- Institute of Theoretical Physics, CAS Key Laboratory for Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China
- University of Chinese Academy of Sciences, Beijing 100190, China
| | - Ying Tang
- University of Electronic Sciences and Technology of China, Institute of Fundamental and Frontier Sciences, Chengdu 611731, China
- University of Electronic Science and Technology of China, Key Laboratory of Quantum Physics and Photonic Quantum Information, Ministry of Education, Chengdu 611731, China
| | - Pan Zhang
- School of Fundamental Physics and Mathematical Sciences, Hangzhou Institute for Advanced Study, UCAS, Hangzhou 310024, China
- Institute of Theoretical Physics, CAS Key Laboratory for Theoretical Physics, Chinese Academy of Sciences, Beijing 100190, China
| |
Collapse
|
12
|
Johnson SR, Fu X, Viknander S, Goldin C, Monaco S, Zelezniak A, Yang KK. Computational scoring and experimental evaluation of enzymes generated by neural networks. Nat Biotechnol 2025; 43:396-405. [PMID: 38653796 PMCID: PMC11919684 DOI: 10.1038/s41587-024-02214-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 03/20/2024] [Indexed: 04/25/2024]
Abstract
In recent years, generative protein sequence models have been developed to sample novel sequences. However, predicting whether generated proteins will fold and function remains challenging. We evaluate a set of 20 diverse computational metrics to assess the quality of enzyme sequences produced by three contrasting generative models: ancestral sequence reconstruction, a generative adversarial network and a protein language model. Focusing on two enzyme families, we expressed and purified over 500 natural and generated sequences with 70-90% identity to the most similar natural sequences to benchmark computational metrics for predicting in vitro enzyme activity. Over three rounds of experiments, we developed a computational filter that improved the rate of experimental success by 50-150%. The proposed metrics and models will drive protein engineering research by serving as a benchmark for generative protein sequence models and helping to select active variants for experimental testing.
Collapse
Affiliation(s)
| | - Xiaozhi Fu
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Sandra Viknander
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | - Clara Goldin
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden
| | | | - Aleksej Zelezniak
- Department of Life Sciences, Chalmers University of Technology, Gothenburg, Sweden.
- Institute of Biotechnology, Life Sciences Centre, Vilnius University, Vilnius, Lithuania.
- Randall Centre for Cell & Molecular Biophysics, King's College London, Guy's Campus, London, UK.
| | | |
Collapse
|
13
|
Gu J, Mu W, Xu Y, Nie Y. From discovery to application: Enabling technology-based optimizing carbonyl reductases biocatalysis for active pharmaceutical ingredient synthesis. Biotechnol Adv 2025; 79:108496. [PMID: 39647674 DOI: 10.1016/j.biotechadv.2024.108496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 10/04/2024] [Accepted: 11/30/2024] [Indexed: 12/10/2024]
Abstract
The catalytic conversion of chiral alcohols and corresponding carbonyl compounds by carbonyl reductases (alcohol dehydrogenases), which are NAD(P) or NAD(P)H-dependent oxidoreductases, has attracted considerable attention. However, existing carbonyl reductases are insufficient to meet the demands of diverse industrial applications; hence, new enzymes with functions that can expand the toolbox of biocatalysts are urgently required. Developing precisely controlled chiral biocatalysts is of great significance for the efficient development of a broad spectrum of active pharmaceutical ingredients via biosynthesis. In this review, we summarized methods for discovering novel natural carbonyl reductases from various perspectives. Furthermore, advances in protein engineering, utilizing known sequence and structural information as well as catalytic dynamics mechanisms to improve potential functions, are also addressed. The exponential growth in data-driven tools over the past decade has made it possible to de novo design carbonyl reductases. Additionally, various applications of these high-performance carbonyl reductases and different strategies for coenzyme regeneration involving photocatalysis during the reaction process were reviewed. These advancements will bring new opportunities and challenges to the fields of green chemistry and biosynthesis in the future.
Collapse
Affiliation(s)
- Jie Gu
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China; School of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| | - Wanmeng Mu
- School of Food Science and Technology, Jiangnan University, Wuxi 214122, China; State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| | - Yan Xu
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China; State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| | - Yao Nie
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China.
| |
Collapse
|
14
|
He XH, Li JR, Xu J, Shan H, Shen SY, Gao SH, Xu HE. AI-driven antibody design with generative diffusion models: current insights and future directions. Acta Pharmacol Sin 2025; 46:565-574. [PMID: 39349764 PMCID: PMC11845702 DOI: 10.1038/s41401-024-01380-y] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/26/2024] [Accepted: 08/15/2024] [Indexed: 02/23/2025]
Abstract
Therapeutic antibodies are at the forefront of biotherapeutics, valued for their high target specificity and binding affinity. Despite their potential, optimizing antibodies for superior efficacy presents significant challenges in both monetary and time costs. Recent strides in computational and artificial intelligence (AI), especially generative diffusion models, have begun to address these challenges, offering novel approaches for antibody design. This review delves into specific diffusion-based generative methodologies tailored for antibody design tasks, de novo antibody design, and optimization of complementarity-determining region (CDR) loops, along with their evaluation metrics. We aim to provide an exhaustive overview of this burgeoning field, making it an essential resource for leveraging diffusion-based generative models in antibody design endeavors.
Collapse
Affiliation(s)
- Xin-Heng He
- State Key Laboratory of Drug Research and CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Jun-Rui Li
- State Key Laboratory of Drug Research and CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
| | - James Xu
- Cascade Pharma, Shanghai, 201318, China
| | - Hong Shan
- State Key Laboratory of Drug Research and CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
| | - Shi-Yi Shen
- State Key Laboratory of Drug Research and CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China
- University of Chinese Academy of Sciences, Beijing, 100049, China
| | - Si-Han Gao
- School of Pharmacy, Fudan University, Shanghai, 201203, China
| | - H Eric Xu
- State Key Laboratory of Drug Research and CAS Key Laboratory of Receptor Research, Shanghai Institute of Materia Medica, Chinese Academy of Sciences, Shanghai, 201203, China.
- University of Chinese Academy of Sciences, Beijing, 100049, China.
- Cascade Pharma, Shanghai, 201318, China.
| |
Collapse
|
15
|
Carbone A, Decelle A, Rosset L, Seoane B. Fast and Functional Structured Data Generators Rooted in Out-of-Equilibrium Physics. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:1309-1316. [PMID: 39527442 DOI: 10.1109/tpami.2024.3495999] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Abstract
In this study, we address the challenge of using energy-based models to produce high-quality, label-specific data in complex structured datasets, such as population genetics, RNA or protein sequences data. Traditional training methods encounter difficulties due to inefficient Markov chain Monte Carlo mixing, which affects the diversity of synthetic data and increases generation times. To address these issues, we use a novel training algorithm that exploits non-equilibrium effects. This approach, applied to the Restricted Boltzmann Machine, improves the model's ability to correctly classify samples and generate high-quality synthetic data in only a few sampling steps. The effectiveness of this method is demonstrated by its successful application to five different types of data: handwritten digits, mutations of human genomes classified by continental origin, functionally characterized sequences of an enzyme protein family, homologous RNA sequences from specific taxonomies and real classical piano pieces classified by their composer.
Collapse
|
16
|
Elkin ME, Zhu X. Paying attention to the SARS-CoV-2 dialect : a deep neural network approach to predicting novel protein mutations. Commun Biol 2025; 8:98. [PMID: 39838059 PMCID: PMC11751191 DOI: 10.1038/s42003-024-07262-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Accepted: 11/13/2024] [Indexed: 01/23/2025] Open
Abstract
Predicting novel mutations has long-lasting impacts on life science research. Traditionally, this problem is addressed through wet-lab experiments, which are often expensive and time consuming. The recent advancement in neural language models has provided stunning results in modeling and deciphering sequences. In this paper, we propose a Deep Novel Mutation Search (DNMS) method, using deep neural networks, to model protein sequence for mutation prediction. We use SARS-CoV-2 spike protein as the target and use a protein language model to predict novel mutations. Different from existing research which is often limited to mutating the reference sequence for prediction, we propose a parent-child mutation prediction paradigm where a parent sequence is modeled for mutation prediction. Because mutations introduce changing context to the underlying sequence, DNMS models three aspects of the protein sequences: semantic changes, grammatical changes, and attention changes, each modeling protein sequence aspects from shifting of semantics, grammar coherence, and amino-acid interactions in latent space. A ranking approach is proposed to combine all three aspects to capture mutations demonstrating evolving traits, in accordance with real-world SARS-CoV-2 spike protein sequence evolution. DNMS can be adopted for an early warning variant detection system, creating public health awareness of future SARS-CoV-2 mutations.
Collapse
Affiliation(s)
- Magdalyn E Elkin
- Dept. Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA.
| | - Xingquan Zhu
- Dept. Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA.
| |
Collapse
|
17
|
Li Y, Li F, Duan Z, Liu R, Jiao W, Wu H, Zhu F, Xue W. SYNBIP 2.0: epitopes mapping, sequence expansion and scaffolds discovery for synthetic binding protein innovation. Nucleic Acids Res 2025; 53:D595-D603. [PMID: 39413165 PMCID: PMC11701522 DOI: 10.1093/nar/gkae893] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 09/18/2024] [Accepted: 09/26/2024] [Indexed: 10/18/2024] Open
Abstract
Synthetic binding proteins (SBPs) represent a pivotal class of artificially engineered proteins, meticulously crafted to exhibit targeted binding properties and specific functions. Here, the SYNBIP database, a comprehensive resource for SBPs, has been significantly updated. These enhancements include (i) featuring 3D structures of 899 SBP-target complexes to illustrate the binding epitopes of SBPs, (ii) using the structures of SBPs in the monomer or complex forms with target proteins, their sequence space has been expanded five times to 12 025 by integrating a structure-based protein generation framework and a protein property prediction tool, (iii) offering detailed information on 78 473 newly identified SBP-like scaffolds from the RCSB Protein Data Bank, and an additional 16 401 555 ones from the AlphaFold Protein Structure Database, and (iv) the database is regularly updated, incorporating 153 new SBPs. Furthermore, the structural models of all SBPs have been enhanced through the application of the AlphaFold2, with their clinical statuses concurrently refreshed. Additionally, the design methods employed for each SBP are now prominently featured in the database. In sum, SYNBIP 2.0 is designed to provide researchers with essential SBP data, facilitating their innovation in research, diagnosis and therapy. SYNBIP 2.0 is now freely accessible at https://idrblab.org/synbip/.
Collapse
Affiliation(s)
- Yanlin Li
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Fengcheng Li
- Children’s Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, 3333 Binsheng Road, Hangzhou, Zhejiang 310052, China
- College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, China
| | - Zixin Duan
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Ruihan Liu
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Wantong Jiao
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Haibo Wu
- School of Life Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, China
| | - Weiwei Xue
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| |
Collapse
|
18
|
Wang W, Shuai Y, Zeng M, Fan W, Li M. DPFunc: accurately predicting protein function via deep learning with domain-guided structure information. Nat Commun 2025; 16:70. [PMID: 39746897 PMCID: PMC11697396 DOI: 10.1038/s41467-024-54816-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 11/21/2024] [Indexed: 01/04/2025] Open
Abstract
Computational methods for predicting protein function are of great significance in understanding biological mechanisms and treating complex diseases. However, existing computational approaches of protein function prediction lack interpretability, making it difficult to understand the relations between protein structures and functions. In this study, we propose a deep learning-based solution, named DPFunc, for accurate protein function prediction with domain-guided structure information. DPFunc can detect significant regions in protein structures and accurately predict corresponding functions under the guidance of domain information. It outperforms current state-of-the-art methods and achieves a significant improvement over existing structure-based methods. Detailed analyses demonstrate that the guidance of domain information contributes to DPFunc for protein function prediction, enabling our method to detect key residues or regions in protein structures, which are closely related to their functions. In summary, DPFunc serves as an effective tool for large-scale protein function prediction, which pushes the border of protein understanding in biological systems.
Collapse
Affiliation(s)
- Wenkang Wang
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Yunyan Shuai
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Wei Fan
- Nuffield Department of Women's and Reproductive Health, University of Oxford, Oxford, OX39DU, UK
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
19
|
Lobzaev E, Herrera MA, Kasprzyk M, Stracquadanio G. Protein engineering using variational free energy approximation. Nat Commun 2024; 15:10447. [PMID: 39617781 PMCID: PMC11609274 DOI: 10.1038/s41467-024-54814-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Accepted: 11/20/2024] [Indexed: 05/17/2025] Open
Abstract
Engineering proteins is a challenging task requiring the exploration of a vast design space. Traditionally, this is achieved using Directed Evolution (DE), which is a laborious process. Generative deep learning, instead, can learn biological features of functional proteins from sequence and structural datasets and return novel variants. However, most models do not generate thermodynamically stable proteins, thus leading to many non-functional variants. Here we propose a model called PRotein Engineering by Variational frEe eNergy approximaTion (PREVENT), which generates stable and functional variants by learning the sequence and thermodynamic landscape of a protein. We evaluate PREVENT by designing 40 variants of the conditionally essential E. coli phosphotransferase N-acetyl-L-glutamate kinase (EcNAGK). We find 85% of the variants to be functional, with 55% of them showing similar growth rate compared to the wildtype enzyme, despite harbouring up to 9 mutations. Our results support a new approach that can significantly accelerate protein engineering.
Collapse
Affiliation(s)
- Evgenii Lobzaev
- School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom
- chool of Informatics, The University of Edinburgh, Edinburgh, United Kingdom
| | - Michael A Herrera
- School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom
| | - Martyna Kasprzyk
- School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom
| | | |
Collapse
|
20
|
Jiang R, Yue Z, Shang L, Wang D, Wei N. PEZy-miner: An artificial intelligence driven approach for the discovery of plastic-degrading enzyme candidates. Metab Eng Commun 2024; 19:e00248. [PMID: 39310048 PMCID: PMC11414552 DOI: 10.1016/j.mec.2024.e00248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 07/14/2024] [Accepted: 09/03/2024] [Indexed: 09/25/2024] Open
Abstract
Plastic waste has caused a global environmental crisis. Biocatalytic depolymerization mediated by enzymes has emerged as an efficient and sustainable alternative for plastic treatment and recycling. However, it is challenging and time-consuming to discover novel plastic-degrading enzymes using conventional cultivation-based or omics methods. There is a growing interest in developing effective computational methods to identify new enzymes with desirable plastic degradation functionalities by exploring the ever-increasing databases of protein sequences. In this study, we designed an innovative machine learning-based framework, named PEZy-Miner, to mine for enzymes with high potential in degrading plastics of interest. Two datasets integrating information from experimentally verified enzymes and homologs with unknown plastic-degrading activity were created respectively, covering eleven types of plastic substrates. Protein language models and binary classification models were developed to predict enzymatic degradation of plastics along with confidence and uncertainty estimation. PEZy-Miner exhibited high prediction accuracy and stability when validated on experimentally verified enzymes. Furthermore, by masking the experimentally verified enzymes and blending them into homolog dataset, PEZy-Miner effectively concentrated the experimentally verified entries by 14∼30 times while shortlisting promising plastic-degrading enzyme candidates. We applied PEZy-Miner to 0.1 million putative sequences, out of which 27 new sequences were identified with high confidence. This study provided a new computational tool for mining and recommending promising new plastic-degrading enzymes.
Collapse
Affiliation(s)
- Renjing Jiang
- Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, Urbana, IL, 61801, United States
| | - Zhenrui Yue
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Lanyu Shang
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Dong Wang
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL, 61820, United States
| | - Na Wei
- Department of Civil and Environmental Engineering, University of Illinois Urbana-Champaign, Urbana, IL, 61801, United States
| |
Collapse
|
21
|
Chen M, Mei S, Fan J, Wang M. Opportunities and challenges of diffusion models for generative AI. Natl Sci Rev 2024; 11:nwae348. [PMID: 39554240 PMCID: PMC11562846 DOI: 10.1093/nsr/nwae348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2024] [Revised: 07/03/2024] [Accepted: 07/07/2024] [Indexed: 11/19/2024] Open
Abstract
Diffusion models, a powerful and universal generative artificial intelligence technology, have achieved tremendous success and opened up new possibilities in diverse applications. In these applications, diffusion models provide flexible high-dimensional data modeling, and act as a sampler for generating new samples under active control towards task-desired properties. Despite the significant empirical success, theoretical underpinnings of diffusion models are very limited, potentially slowing down principled methodological innovations for further harnessing and improving diffusion models. In this paper, we review emerging applications of diffusion models to highlight their sample generation capabilities under various control goals. At the same time, we dive into the unique working flow of diffusion models through the lens of stochastic processes. We identify theoretical challenges in analyzing diffusion models, owing to their complicated training procedure and interaction with the underlying data distribution. To address these challenges, we overview several promising advances, demonstrating diffusion models as an efficient distribution learner and a sampler. Furthermore, we introduce a new avenue in high-dimensional structured optimization through diffusion models, where searching for solutions is reformulated as a conditional sampling problem and solved by diffusion models. Lastly, we discuss future directions about diffusion models. The purpose of this paper is to provide a well-rounded exposure for stimulating forward-looking theories and methods of diffusion models.
Collapse
Affiliation(s)
- Minshuo Chen
- Department of Electrical and Computer Engineering, Princeton University, Princeton 08544, USA
| | - Song Mei
- Department of Statistics, University of California, Berkeley, Berkeley 94720, USA
| | - Jianqing Fan
- Department of Operations Research and Financial Engineering, Princeton University, Princeton 08544, USA
| | - Mengdi Wang
- Department of Electrical and Computer Engineering, Princeton University, Princeton 08544, USA
| |
Collapse
|
22
|
Ahmed FS, Aly S, Liu X. NABP-BERT: NANOBODY®-antigen binding prediction based on bidirectional encoder representations from transformers (BERT) architecture. Brief Bioinform 2024; 26:bbae518. [PMID: 39688476 DOI: 10.1093/bib/bbae518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 08/23/2024] [Accepted: 12/10/2024] [Indexed: 12/18/2024] Open
Abstract
Antibody-mediated immunity is crucial in the vertebrate immune system. Nanobodies, also known as VHH or single-domain antibodies (sdAbs), are emerging as promising alternatives to full-length antibodies due to their compact size, precise target selectivity, and stability. However, the limited availability of nanobodies (Nbs) for numerous antigens (Ags) presents a significant obstacle to their widespread application. Understanding the interactions between Nbs and Ags is essential for enhancing their binding affinities and specificities. Experimental identification of these interactions is often costly and time-intensive. To address this issue, we introduce NABP-BERT, a deep-learning model based on the BERT architecture, designed to predict NANOBODY®-Ag binding solely from sequence information. Furthermore, we have developed a general pretrained model with transfer capabilities suitable for protein-related tasks, including protein-protein interaction tasks. NABP-BERT focuses on the surrounding amino acid contexts and outperforms existing methods, achieving an AUROC of 0.986 and an AUPR of 0.985.
Collapse
Affiliation(s)
- Fatma S Ahmed
- Department of Computer Science and Technology, Xiamen University, Xiamen 361005, China
- Department of Electrical Engineering, Aswan University, Aswan 81542, Egypt
| | - Saleh Aly
- Department of Information Technology, Majmaah University, Majmaah 11952, Saudi Arabia
| | - Xiangrong Liu
- Department of Computer Science and Technology, Xiamen University, Xiamen 361005, China
| |
Collapse
|
23
|
Praljak N, Yeh H, Moore M, Socolich M, Ranganathan R, Ferguson AL. Natural Language Prompts Guide the Design of Novel Functional Protein Sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.11.622734. [PMID: 39605414 PMCID: PMC11601239 DOI: 10.1101/2024.11.11.622734] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The advent of natural language interaction with machines has ushered in new innovations in text-guided generation of images, audio, video, and more. In this arena, we introduce Bio logical M ulti- M odal M odel ( BioM3 ), as a novel framework for designing functional proteins via natural language prompts. This framework integrates natural language with protein design through a three-stage process: aligning protein and text representations in a joint embedding space learned using contrastive learning, refinement of the text embeddings, and conditional generation of protein sequences via a discrete autoregressive diffusion model. BioM3 synthe-sizes protein sequences with detailed descriptions of the protein structure, lineage, and function from text annotations to enable the conditional generation of novel sequences with desired attributes through natural language prompts. We present in silico validation of the model predictions for subcellular localization prediction, reaction classification, remote homology detection, scaffold in-painting, and structural plausibility, and in vivo and in vitro experimental tests of natural language prompt-designed synthetic analogs of Src-homology 3 (SH3) domain proteins that mediate signaling in the Sho1 osmotic stress response pathway in baker's yeast. BioM3 possesses state-of-the-art performance in zero-shot prediction and homology detection tasks, and generates proteins with native-like tertiary folds and wild-type levels of experimentally assayed function.
Collapse
|
24
|
Rix G, Williams RL, Hu VJ, Spinner H, Pisera A(O, Marks DS, Liu CC. Continuous evolution of user-defined genes at 1 million times the genomic mutation rate. Science 2024; 386:eadm9073. [PMID: 39509492 PMCID: PMC11750425 DOI: 10.1126/science.adm9073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 09/10/2024] [Indexed: 11/15/2024]
Abstract
When nature evolves a gene over eons at scale, it produces a diversity of homologous sequences with patterns of conservation and change that contain rich structural, functional, and historical information about the gene. However, natural gene diversity accumulates slowly and likely excludes large regions of functional sequence space, limiting the information that is encoded and extractable. We introduce upgraded orthogonal DNA replication (OrthoRep) systems that radically accelerate the evolution of chosen genes under selection in yeast. When applied to a maladapted biosynthetic enzyme, we obtained collections of extensively diverged sequences with patterns that revealed structural and environmental constraints shaping the enzyme's activity. Our upgraded OrthoRep systems should support the discovery of factors influencing gene evolution, uncover previously unknown regions of fitness landscapes, and find broad applications in biomolecular engineering.
Collapse
Affiliation(s)
- Gordon Rix
- Department of Molecular Biology and Biochemistry, University of California; Irvine, CA, 92617, USA
| | - Rory L. Williams
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
| | - Vincent J. Hu
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
| | - Han Spinner
- Department of Systems Biology, Harvard Medical School; Boston, MA, 02115, USA
| | | | - Debora S. Marks
- Department of Systems Biology, Harvard Medical School; Boston, MA, 02115, USA
- Broad Institute of Harvard and MIT; Cambridge, MA, 02142, USA
| | - Chang C. Liu
- Department of Molecular Biology and Biochemistry, University of California; Irvine, CA, 92617, USA
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
- Department of Chemistry, University of California; Irvine, CA, 92617, USA
- Center for Synthetic Biology, University of California; Irvine, CA, 92617, USA
| |
Collapse
|
25
|
Bist PS, Tayara H, Chong KT. Generative AI in the Advancement of Viral Therapeutics for Predicting and Targeting Immune-Evasive SARS-CoV-2 Mutations. IEEE J Biomed Health Inform 2024; 28:6974-6982. [PMID: 39042543 DOI: 10.1109/jbhi.2024.3432649] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/25/2024]
Abstract
The emergence of immune-evasive mutations in the SARS-CoV-2 spike protein is consistently challenging existing vaccines and therapies, making precise prediction of their escape potential a critical imperative. Artificial Intelligence(AI) holds great promise for deciphering the intricate language of protein. Here, we employed a Generative Adversarial Network to decipher the hidden escape pathways within the spike protein by generating spikes that closely resemble natural ones. Through comprehensive analysis, we demonstrated that generated sequences capture natural escape characteristics. Moreover, incorporating these sequences into an AI-based escape prediction model significantly enhanced its performance, achieving a 7% increase in detecting natural escape mutations on the experimentally validated Greaney dataset. Similar improvements were observed on other datasets, demonstrating the model's generalizability. Precisely predicting immune-evasive spikes not only enables the design of strategically targeted therapies but also has the potential to expedite future viral therapeutics. This breakthrough carries profound implications for shaping a more resilient future against viral threats.
Collapse
|
26
|
Turnbull OM, Oglic D, Croasdale-Wood R, Deane CM. p-IgGen: a paired antibody generative language model. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae659. [PMID: 39520401 DOI: 10.1093/bioinformatics/btae659] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 10/04/2024] [Accepted: 11/08/2024] [Indexed: 11/16/2024]
Abstract
SUMMARY A key challenge in antibody drug discovery is designing novel sequences that are free from developability issues-such as aggregation, polyspecificity, poor expression, or low solubility. Here, we present p-IgGen, a protein language model for paired heavy-light chain antibody generation. The model generates diverse, antibody-like sequences with pairing properties found in natural antibodies. We also create a finetuned version of p-IgGen that biases the model to generate antibodies with 3D biophysical properties that fall within distributions seen in clinical-stage therapeutic antibodies. AVAILABILITY AND IMPLEMENTATION The model and inference code are freely available at www.github.com/oxpig/p-IgGen. Cleaned training data are deposited at doi.org/10.5281/zenodo.13880874.
Collapse
Affiliation(s)
- Oliver M Turnbull
- Department of Statistics, University of Oxford, Oxford, OX1 3LB, United Kingdom
| | - Dino Oglic
- Centre for AI, Biopharmaceuticals R&D, AstraZeneca, Cambridge, CB2 0AA, United Kingdom
| | | | - Charlotte M Deane
- Department of Statistics, University of Oxford, Oxford, OX1 3LB, United Kingdom
| |
Collapse
|
27
|
Lobzaev E, Stracquadanio G. Dirichlet latent modelling enables effective learning and sampling of the functional protein design space. Nat Commun 2024; 15:9309. [PMID: 39468034 PMCID: PMC11519351 DOI: 10.1038/s41467-024-53622-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 10/14/2024] [Indexed: 10/30/2024] Open
Abstract
Engineering proteins with desired functions and biochemical properties is pivotal for biotechnology and drug discovery. While computational methods based on evolutionary information are reducing the experimental burden by designing targeted libraries of functional variants, they still have a low success rate when the desired protein has few or very remote homologous sequences. Here we propose an autoregressive model, called Temporal Dirichlet Variational Autoencoder (TDVAE), which exploits the mathematical properties of the Dirichlet distribution and temporal convolution to efficiently learn high-order information from a functionally related, possibly remotely similar, set of sequences. TDVAE is highly accurate in predicting the effects of amino acid mutations, while being significantly 90% smaller than the other state-of-the-art models. We then use TDVAE to design variants of the human alpha galactosidase enzymes as potential treatment for Fabry disease. Our model builds a library of diverse variants which retain sequence, biochemical and structural properties of the wildtype protein, suggesting they could be suitable for enzyme replacement therapy. Taken together, our results show the importance of accurate sequence modelling and the potential of autoregressive models as protein engineering and analysis tools.
Collapse
Affiliation(s)
- Evgenii Lobzaev
- School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom
- School of Informatics, The University of Edinburgh, Edinburgh, United Kingdom
| | | |
Collapse
|
28
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
29
|
Mantena S, Pillai PP, Petros BA, Welch NL, Myhrvold C, Sabeti PC, Metsky HC. Model-directed generation of artificial CRISPR-Cas13a guide RNA sequences improves nucleic acid detection. Nat Biotechnol 2024:10.1038/s41587-024-02422-w. [PMID: 39394482 DOI: 10.1038/s41587-024-02422-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2023] [Accepted: 09/04/2024] [Indexed: 10/13/2024]
Abstract
CRISPR guide RNA sequences deriving exactly from natural sequences may not perform optimally in every application. Here we implement and evaluate algorithms for designing maximally fit, artificial CRISPR-Cas13a guides with multiple mismatches to natural sequences that are tailored for diagnostic applications. These guides offer more sensitive detection of diverse pathogens and discrimination of pathogen variants compared with guides derived directly from natural sequences and illuminate design principles that broaden Cas13a targeting.
Collapse
Affiliation(s)
- Sreekar Mantena
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Statistics, Harvard University, Cambridge, MA, USA
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, USA
| | | | - Brittany A Petros
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Health Sciences and Technology, Harvard Medical School and Massachusetts Institute of Technology, Cambridge, MA, USA
- MD-PhD Program, Harvard/Massachusetts Institute of Technology, Boston, MA, USA
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | | | - Cameron Myhrvold
- Department of Molecular Biology, Princeton University, Princeton, NJ, USA
- Department of Chemical and Biological Engineering, Princeton University, Princeton, NJ, USA
- Omenn-Darling Bioengineering Institute, Princeton University, Princeton, NJ, USA
- Department of Chemistry, Princeton University, Princeton, NJ, USA
| | - Pardis C Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Howard Hughes Medical Institute, Chevy Chase, MD, USA.
- Department of Organismic and Evolutionary Biology, Harvard University, Cambridge, MA, USA.
- Department of Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| | | |
Collapse
|
30
|
Deichmann M, Hansson FG, Jensen ED. Yeast-based screening platforms to understand and improve human health. Trends Biotechnol 2024; 42:1258-1272. [PMID: 38677901 DOI: 10.1016/j.tibtech.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 04/01/2024] [Accepted: 04/03/2024] [Indexed: 04/29/2024]
Abstract
Detailed molecular understanding of the human organism is essential to develop effective therapies. Saccharomyces cerevisiae has been used extensively for acquiring insights into important aspects of human health, such as studying genetics and cell-cell communication, elucidating protein-protein interaction (PPI) networks, and investigating human G protein-coupled receptor (hGPCR) signaling. We highlight recent advances and opportunities of yeast-based technologies for cost-efficient chemical library screening on hGPCRs, accelerated deciphering of PPI networks with mating-based screening and selection, and accurate cell-cell communication with human immune cells. Overall, yeast-based technologies constitute an important platform to support basic understanding and innovative applications towards improving human health.
Collapse
Affiliation(s)
- Marcus Deichmann
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Frederik G Hansson
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark
| | - Emil D Jensen
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, DK-2800 Kongens Lyngby, Denmark.
| |
Collapse
|
31
|
Xie X, Gui L, Qiao B, Wang G, Huang S, Zhao Y, Sun S. Deep learning in template-free de novo biosynthetic pathway design of natural products. Brief Bioinform 2024; 25:bbae495. [PMID: 39373052 PMCID: PMC11456888 DOI: 10.1093/bib/bbae495] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 09/12/2024] [Accepted: 09/20/2024] [Indexed: 10/08/2024] Open
Abstract
Natural products (NPs) are indispensable in drug development, particularly in combating infections, cancer, and neurodegenerative diseases. However, their limited availability poses significant challenges. Template-free de novo biosynthetic pathway design provides a strategic solution for NP production, with deep learning standing out as a powerful tool in this domain. This review delves into state-of-the-art deep learning algorithms in NP biosynthesis pathway design. It provides an in-depth discussion of databases like Kyoto Encyclopedia of Genes and Genomes (KEGG), Reactome, and UniProt, which are essential for model training, along with chemical databases such as Reaxys, SciFinder, and PubChem for transfer learning to expand models' understanding of the broader chemical space. It evaluates the potential and challenges of sequence-to-sequence and graph-to-graph translation models for accurate single-step prediction. Additionally, it discusses search algorithms for multistep prediction and deep learning algorithms for predicting enzyme function. The review also highlights the pivotal role of deep learning in improving catalytic efficiency through enzyme engineering, which is essential for enhancing NP production. Moreover, it examines the application of large language models in pathway design, enzyme discovery, and enzyme engineering. Finally, it addresses the challenges and prospects associated with template-free approaches, offering insights into potential advancements in NP biosynthesis pathway design.
Collapse
Affiliation(s)
- Xueying Xie
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Lin Gui
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Baixue Qiao
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shan Huang
- Department of Neurology, The Second Affiliated Hospital, Harbin Medical University, No. 246 Xuefu Road, Nangang District,Harbin 150081, China
| | - Yuming Zhao
- College of Computer and Control Engineering, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| | - Shanwen Sun
- Key Laboratory of Saline-Alkali Vegetation Ecology Restoration, Ministry of Education (Northeast Forestry University), No. 26 Hexing Road, Xiangfang District, Harbin 150001, China
- College of Life Science, Northeast Forestry University, No. 26 Hexing Road, Xiangfang District, Harbin 150040, China
| |
Collapse
|
32
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic Language Models: Opportunities and Challenges. ARXIV 2024:arXiv:2407.11435v2. [PMID: 39070037 PMCID: PMC11275703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of Natural Language Processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic Language Models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley
| | - Yun S. Song
- Computer Science Division, University of California, Berkeley
- Department of Statistics, University of California, Berkeley
- Center for Computational Biology, University of California, Berkeley
| |
Collapse
|
33
|
Draizen EJ, Veretnik S, Mura C, Bourne PE. Deep generative models of protein structure uncover distant relationships across a continuous fold space. Nat Commun 2024; 15:8094. [PMID: 39294145 PMCID: PMC11410806 DOI: 10.1038/s41467-024-52020-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Accepted: 08/23/2024] [Indexed: 09/20/2024] Open
Abstract
Our views of fold space implicitly rest upon many assumptions that impact how we analyze, interpret and understand protein structure, function and evolution. For instance, is there an optimal granularity in viewing protein structural similarities (e.g., architecture, topology or some other level)? Similarly, the discrete/continuous dichotomy of fold space is central, but remains unresolved. Discrete views of fold space bin similar folds into distinct, non-overlapping groups; unfortunately, such binning can miss remote relationships. While hierarchical systems like CATH are indispensable resources, less heuristic and more conceptually flexible approaches could enable more nuanced explorations of fold space. Building upon an Urfold model of protein structure, here we present a deep generative modeling framework, termed DeepUrfold, for analyzing protein relationships at scale. DeepUrfold's learned embeddings occupy high-dimensional latent spaces that can be distilled for a given protein in terms of an amalgamated representation uniting sequence, structure and biophysical properties. This approach is structure-guided, versus being purely structure-based, and DeepUrfold learns representations that, in a sense, define superfamilies. Deploying DeepUrfold with CATH reveals evolutionarily-remote relationships that evade existing methodologies, and suggests a mostly-continuous view of fold space-a view that extends beyond simple geometric similarity, towards the realm of integrated sequence ↔ structure ↔ function properties.
Collapse
Affiliation(s)
- Eli J Draizen
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
| | - Stella Veretnik
- School of Data Science, University of Virginia, Charlottesville, VA, USA
| | - Cameron Mura
- School of Data Science, University of Virginia, Charlottesville, VA, USA.
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA.
| | - Philip E Bourne
- School of Data Science, University of Virginia, Charlottesville, VA, USA
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA, USA
| |
Collapse
|
34
|
Liu S, Shi T, Yu J, Li R, Lin H, Deng K. Research on Bitter Peptides in the Field of Bioinformatics: A Comprehensive Review. Int J Mol Sci 2024; 25:9844. [PMID: 39337334 PMCID: PMC11432553 DOI: 10.3390/ijms25189844] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 09/06/2024] [Accepted: 09/09/2024] [Indexed: 09/30/2024] Open
Abstract
Bitter peptides are small molecular peptides produced by the hydrolysis of proteins under acidic, alkaline, or enzymatic conditions. These peptides can enhance food flavor and offer various health benefits, with attributes such as antihypertensive, antidiabetic, antioxidant, antibacterial, and immune-regulating properties. They show significant potential in the development of functional foods and the prevention and treatment of diseases. This review introduces the diverse sources of bitter peptides and discusses the mechanisms of bitterness generation and their physiological functions in the taste system. Additionally, it emphasizes the application of bioinformatics in bitter peptide research, including the establishment and improvement of bitter peptide databases, the use of quantitative structure-activity relationship (QSAR) models to predict bitterness thresholds, and the latest advancements in classification prediction models built using machine learning and deep learning algorithms for bitter peptide identification. Future research directions include enhancing databases, diversifying models, and applying generative models to advance bitter peptide research towards deepening and discovering more practical applications.
Collapse
Affiliation(s)
| | | | | | | | - Hao Lin
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (S.L.); (T.S.); (J.Y.); (R.L.)
| | - Kejun Deng
- School of Life Science and Technology, Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 610054, China; (S.L.); (T.S.); (J.Y.); (R.L.)
| |
Collapse
|
35
|
R VS, Choudhuri S, Ghosh B. Hybrid Diffusion Model for Stable, Affinity-Driven, Receptor-Aware Peptide Generation. J Chem Inf Model 2024; 64:6912-6925. [PMID: 39193724 DOI: 10.1021/acs.jcim.4c01020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The convergence of biotechnology and artificial intelligence has the potential to transform drug development, especially in the field of therapeutic peptide design. Peptides are short chains of amino acids with diverse therapeutic applications that offer several advantages over small molecular drugs, such as targeted therapy and minimal side effects. However, limited oral bioavailability and enzymatic degradation have limited their effectiveness. With advances in deep learning techniques, innovative approaches to peptide design have become possible. In this work, we demonstrate HYDRA, a hybrid deep learning approach that leverages the distribution modeling capabilities of a diffusion model and combines it with a binding affinity maximization algorithm that can be used for de novo design of peptide binders for various target receptors. As an application, we have used our approach to design therapeutic peptides targeting proteins expressed by Plasmodium falciparum erythrocyte membrane protein 1 (PfEMP1) genes. The ability of HYDRA to generate peptides conditioned on the target receptor's binding sites makes it a promising approach for developing effective therapies for malaria and other diseases.
Collapse
Affiliation(s)
- Vishva Saravanan R
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500032, India
| | - Soham Choudhuri
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500032, India
| | - Bhaswar Ghosh
- Center for Computational Natural Sciences and Bioinformatics, International Institute of Information Technology, Hyderabad 500032, India
| |
Collapse
|
36
|
Xie X, Valiente PA, Lee JS, Kim J, Kim PM. Antibody-SGM, a Score-Based Generative Model for Antibody Heavy-Chain Design. J Chem Inf Model 2024; 64:6745-6757. [PMID: 39189360 DOI: 10.1021/acs.jcim.4c00711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/28/2024]
Abstract
Traditional computational methods for antibody design involved random mutagenesis followed by energy function assessment for candidate selection. Recently, diffusion models have garnered considerable attention as cutting-edge generative models, lauded for their remarkable performance. However, these methods often focus solely on the backbone or sequence, resulting in the incomplete depiction of the overall structure and necessitating additional techniques to predict the missing component. This study presents Antibody-SGM, an innovative joint structure-sequence diffusion model that addresses the limitations of existing protein backbone generation models. Unlike previous models, Antibody-SGM successfully integrates sequence-specific attributes and functional properties into the generation process. Our methodology generates full-atom native-like antibody heavy chains by refining the generation to create valid pairs of sequences and structures, starting with random sequences and structural properties. The versatility of our method is demonstrated through various applications, including the design of full-atom antibodies, antigen-specific CDR design, antibody heavy chains optimization, validation with Alphafold3, and the identification of crucial antibody sequences and structural features. Antibody-SGM also optimizes protein function through active inpainting learning, allowing simultaneous sequence and structure optimization. These improvements demonstrate the promise of our strategy for protein engineering and significantly increase the power of protein design models.
Collapse
Affiliation(s)
- Xuezhi Xie
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Pedro A Valiente
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Jin Sub Lee
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Jisun Kim
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| | - Philip M Kim
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario M5S 3E1, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario M5S 3E1, Canada
| |
Collapse
|
37
|
Capponi S, Wang S. AI in cellular engineering and reprogramming. Biophys J 2024; 123:2658-2670. [PMID: 38576162 PMCID: PMC11393708 DOI: 10.1016/j.bpj.2024.04.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 03/19/2024] [Accepted: 04/01/2024] [Indexed: 04/06/2024] Open
Abstract
During the last decade, artificial intelligence (AI) has increasingly been applied in biophysics and related fields, including cellular engineering and reprogramming, offering novel approaches to understand, manipulate, and control cellular function. The potential of AI lies in its ability to analyze complex datasets and generate predictive models. AI algorithms can process large amounts of data from single-cell genomics and multiomic technologies, allowing researchers to gain mechanistic insights into the control of cell identity and function. By integrating and interpreting these complex datasets, AI can help identify key molecular events and regulatory pathways involved in cellular reprogramming. This knowledge can inform the design of precision engineering strategies, such as the development of new transcription factor and signaling molecule cocktails, to manipulate cell identity and drive authentic cell fate across lineage boundaries. Furthermore, when used in combination with computational methods, AI can accelerate and improve the analysis and understanding of the intricate relationships between genes, proteins, and cellular processes. In this review article, we explore the current state of AI applications in biophysics with a specific focus on cellular engineering and reprogramming. Then, we showcase a couple of recent applications where we combined machine learning with experimental and computational techniques. Finally, we briefly discuss the challenges and prospects of AI in cellular engineering and reprogramming, emphasizing the potential of these technologies to revolutionize our ability to engineer cells for a variety of applications, from disease modeling and drug discovery to regenerative medicine and biomanufacturing.
Collapse
Affiliation(s)
- Sara Capponi
- IBM Almaden Research Center, San Jose, California; Center for Cellular Construction, San Francisco, California.
| | - Shangying Wang
- Bay Area Institute of Science, Altos Labs, Redwood City, California.
| |
Collapse
|
38
|
He H, He B, Guan L, Zhao Y, Jiang F, Chen G, Zhu Q, Chen CYC, Li T, Yao J. De novo generation of SARS-CoV-2 antibody CDRH3 with a pre-trained generative large language model. Nat Commun 2024; 15:6867. [PMID: 39127753 PMCID: PMC11316817 DOI: 10.1038/s41467-024-50903-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 07/23/2024] [Indexed: 08/12/2024] Open
Abstract
Artificial Intelligence (AI) techniques have made great advances in assisting antibody design. However, antibody design still heavily relies on isolating antigen-specific antibodies from serum, which is a resource-intensive and time-consuming process. To address this issue, we propose a Pre-trained Antibody generative large Language Model (PALM-H3) for the de novo generation of artificial antibodies heavy chain complementarity-determining region 3 (CDRH3) with desired antigen-binding specificity, reducing the reliance on natural antibodies. We also build a high-precision model antigen-antibody binder (A2binder) that pairs antigen epitope sequences with antibody sequences to predict binding specificity and affinity. PALM-H3-generated antibodies exhibit binding ability to SARS-CoV-2 antigens, including the emerging XBB variant, as confirmed through in-silico analysis and in-vitro assays. The in-vitro assays validate that PALM-H3-generated antibodies achieve high binding affinity and potent neutralization capability against spike proteins of SARS-CoV-2 wild-type, Alpha, Delta, and the emerging XBB variant. Meanwhile, A2binder demonstrates exceptional predictive performance on binding specificity for various epitopes and variants. Furthermore, by incorporating the attention mechanism inherent in the Roformer architecture into the PALM-H3 model, we improve its interpretability, providing crucial insights into the fundamental principles of antibody design.
Collapse
Affiliation(s)
- Haohuai He
- AI Lab, Tencent, Shenzhen, 518052, China
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
| | - Bing He
- AI Lab, Tencent, Shenzhen, 518052, China.
| | - Lei Guan
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Xi'an, China
| | - Yu Zhao
- AI Lab, Tencent, Shenzhen, 518052, China
| | - Feng Jiang
- AI Lab, Tencent, Shenzhen, 518052, China
| | - Guanxing Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
| | - Qingge Zhu
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Xi'an, China
| | - Calvin Yu-Chian Chen
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- State Key Laboratory of Chemical Oncogenomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, 518055, China.
- Department of Medical Research, China Medical University Hospital, Taichung, 40447, Taiwan.
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, 41354, Taiwan.
- Guangdong L-Med Biotechnology Co. Ltd, Meizhou, 514699, Guangdong, China.
| | - Ting Li
- State Key Laboratory of Holistic Integrative Management of Gastrointestinal Cancers and National Clinical Research Center for Digestive Diseases, Xijing Hospital of Digestive Diseases, Xi'an, China.
| | | |
Collapse
|
39
|
Eid FE, Chen AT, Chan KY, Huang Q, Zheng Q, Tobey IG, Pacouret S, Brauer PP, Keyes C, Powell M, Johnston J, Zhao B, Lage K, Tarantal AF, Chan YA, Deverman BE. Systematic multi-trait AAV capsid engineering for efficient gene delivery. Nat Commun 2024; 15:6602. [PMID: 39097583 PMCID: PMC11297966 DOI: 10.1038/s41467-024-50555-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Accepted: 07/16/2024] [Indexed: 08/05/2024] Open
Abstract
Broadening gene therapy applications requires manufacturable vectors that efficiently transduce target cells in humans and preclinical models. Conventional selections of adeno-associated virus (AAV) capsid libraries are inefficient at searching the vast sequence space for the small fraction of vectors possessing multiple traits essential for clinical translation. Here, we present Fit4Function, a generalizable machine learning (ML) approach for systematically engineering multi-trait AAV capsids. By leveraging a capsid library that uniformly samples the manufacturable sequence space, reproducible screening data are generated to train accurate sequence-to-function models. Combining six models, we designed a multi-trait (liver-targeted, manufacturable) capsid library and validated 88% of library variants on all six predetermined criteria. Furthermore, the models, trained only on mouse in vivo and human in vitro Fit4Function data, accurately predicted AAV capsid variant biodistribution in macaque. Top candidates exhibited production yields comparable to AAV9, efficient murine liver transduction, up to 1000-fold greater human hepatocyte transduction, and increased enrichment relative to AAV9 in a screen for liver transduction in macaques. The Fit4Function strategy ultimately makes it possible to predict cross-species traits of peptide-modified AAV capsids and is a critical step toward assembling an ML atlas that predicts AAV capsid performance across dozens of traits.
Collapse
Affiliation(s)
- Fatma-Elzahraa Eid
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Systems and Computer Engineering, Faculty of Engineering, Al-Azhar University, Cairo, Egypt.
| | - Albert T Chen
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ken Y Chan
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Qin Huang
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Qingxia Zheng
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Isabelle G Tobey
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Simon Pacouret
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Pamela P Brauer
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Casey Keyes
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Megan Powell
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jencilin Johnston
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Binhui Zhao
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kasper Lage
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Surgery, Massachusetts General Hospital, Boston, MA, USA
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Institute of Biological Psychiatry, Mental Health Center St. Hans, Mental Health Services, Copenhagen, Denmark
| | - Alice F Tarantal
- Departments of Pediatrics and Cell Biology and Human Anatomy, School of Medicine, and California National Primate Research Center, University of California, Davis, CA, USA
| | - Yujia A Chan
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Benjamin E Deverman
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
40
|
Simon E, Swanson K, Zou J. Language models for biological research: a primer. Nat Methods 2024; 21:1422-1429. [PMID: 39122951 DOI: 10.1038/s41592-024-02354-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 06/18/2024] [Indexed: 08/12/2024]
Abstract
Language models are playing an increasingly important role in many areas of artificial intelligence (AI) and computational biology. In this primer, we discuss the ways in which language models, both those based on natural language and those based on biological sequences, can be applied to biological research. This primer is primarily intended for biologists interested in using these cutting-edge AI technologies in their applications. We provide guidance on best practices and key resources for adapting language models for biology.
Collapse
Affiliation(s)
- Elana Simon
- Department of Biomedical Data Science, Stanford University, Stanford, USA
| | - Kyle Swanson
- Department of Computer Science, Stanford University, Stanford, USA
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, USA.
- Department of Computer Science, Stanford University, Stanford, USA.
- Chan-Zuckerberg Biohub, San Francisco, USA.
| |
Collapse
|
41
|
Bashour H, Smorodina E, Pariset M, Zhong J, Akbar R, Chernigovskaya M, Lê Quý K, Snapkow I, Rawat P, Krawczyk K, Sandve GK, Gutierrez-Marcos J, Gutierrez DNZ, Andersen JT, Greiff V. Biophysical cartography of the native and human-engineered antibody landscapes quantifies the plasticity of antibody developability. Commun Biol 2024; 7:922. [PMID: 39085379 PMCID: PMC11291509 DOI: 10.1038/s42003-024-06561-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2024] [Accepted: 07/05/2024] [Indexed: 08/02/2024] Open
Abstract
Designing effective monoclonal antibody (mAb) therapeutics faces a multi-parameter optimization challenge known as "developability", which reflects an antibody's ability to progress through development stages based on its physicochemical properties. While natural antibodies may provide valuable guidance for mAb selection, we lack a comprehensive understanding of natural developability parameter (DP) plasticity (redundancy, predictability, sensitivity) and how the DP landscapes of human-engineered and natural antibodies relate to one another. These gaps hinder fundamental developability profile cartography. To chart natural and engineered DP landscapes, we computed 40 sequence- and 46 structure-based DPs of over two million native and human-engineered single-chain antibody sequences. We find lower redundancy among structure-based compared to sequence-based DPs. Sequence DP sensitivity to single amino acid substitutions varied by antibody region and DP, and structure DP values varied across the conformational ensemble of antibody structures. We show that sequence DPs are more predictable than structure-based ones across different machine-learning tasks and embeddings, indicating a constrained sequence-based design space. Human-engineered antibodies localize within the developability and sequence landscapes of natural antibodies, suggesting that human-engineered antibodies explore mere subspaces of the natural one. Our work quantifies the plasticity of antibody developability, providing a fundamental resource for multi-parameter therapeutic mAb design.
Collapse
Affiliation(s)
- Habib Bashour
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
- School of Life Sciences, University of Warwick, Coventry, UK.
| | - Eva Smorodina
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | | | - Jahn Zhong
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Division of Genetics, Department Biology, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany
| | - Rahmad Akbar
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Maria Chernigovskaya
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Khang Lê Quý
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | - Igor Snapkow
- Department of Chemical Toxicology, Norwegian Institute of Public Health, Oslo, Norway
| | - Puneet Rawat
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
| | | | | | | | | | - Jan Terje Andersen
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Department of Pharmacology, University of Oslo and Oslo University Hospital, Oslo, Norway
- Precision Immunotherapy Alliance (PRIMA), University of Oslo, Oslo, Norway
| | - Victor Greiff
- Department of Immunology, University of Oslo and Oslo University Hospital, Oslo, Norway.
| |
Collapse
|
42
|
Kantroo P, Wagner GP, Machta BB. Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602754. [PMID: 39026871 PMCID: PMC11257618 DOI: 10.1101/2024.07.09.602754] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
Protein language models trained on the masked language modeling objective learn to predict the identity of hidden amino acid residues within a sequence using the remaining observable sequence as context. They do so by embedding the residues into a high dimensional space that encapsulates the relevant contextual cues. These embedding vectors serve as an informative context-sensitive representation that not only aids with the defined training objective, but can also be used for other tasks by downstream models. We propose a scheme to use the embeddings of an unmasked sequence to estimate the corresponding masked probability vectors for all the positions in a single forward pass through the language model. This One Fell Swoop (OFS) approach allows us to efficiently estimate the pseudo-perplexity of the sequence, a measure of the model's uncertainty in its predictions, that can also serve as a fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as well as the true pseudo-perplexity at fitness estimation, and more notably it defines a new state of the art on the ProteinGym Indels benchmark. The strong performance of the fitness measure prompted us to investigate if it could be used to detect the elevated stability reported in reconstructed ancestral sequences. We find that this measure ranks ancestral reconstructions as more fit than extant sequences. Finally, we show that the computational efficiency of the technique allows for the use of Monte Carlo methods that can rapidly explore functional sequence space.
Collapse
Affiliation(s)
- Pranav Kantroo
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| | - Günter P. Wagner
- Emeritus, Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT-06520, USA
- Department of Evolutionary Biology, University of Vienna, Djerassi Platz 1, A-1030 Vienna, Austria
- Hagler Institute for Advanced Studies, Texas A&M, College Station, TX-77843, USA
| | - Benjamin B. Machta
- Department of Physics, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| |
Collapse
|
43
|
Kantroo P, Wagner GP, Machta BB. Pseudo-perplexity in One Fell Swoop for Protein Fitness Estimation. ARXIV 2024:arXiv:2407.07265v1. [PMID: 39040648 PMCID: PMC11261985] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
Protein language models trained on the masked language modeling objective learn to predict the identity of hidden amino acid residues within a sequence using the remaining observable sequence as context. They do so by embedding the residues into a high dimensional space that encapsulates the relevant contextual cues. These embedding vectors serve as an informative context-sensitive representation that not only aids with the defined training objective, but can also be used for other tasks by downstream models. We propose a scheme to use the embeddings of an unmasked sequence to estimate the corresponding masked probability vectors for all the positions in a single forward pass through the language model. This One Fell Swoop (OFS) approach allows us to efficiently estimate the pseudo-perplexity of the sequence, a measure of the model's uncertainty in its predictions, that can also serve as a fitness estimate. We find that ESM2 OFS pseudo-perplexity performs nearly as well as the true pseudo-perplexity at fitness estimation, and more notably it defines a new state of the art on the ProteinGym Indels benchmark. The strong performance of the fitness measure prompted us to investigate if it could be used to detect the elevated stability reported in reconstructed ancestral sequences. We find that this measure ranks ancestral reconstructions as more fit than extant sequences. Finally, we show that the computational efficiency of the technique allows for the use of Monte Carlo methods that can rapidly explore functional sequence space.
Collapse
Affiliation(s)
- Pranav Kantroo
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| | - Günter P. Wagner
- Emeritus, Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT-06520, USA
- Department of Evolutionary Biology, University of Vienna, Djerassi Platz 1, A-1030 Vienna, Austria
- Hagler Institute for Advanced Studies, Texas A&M, College Station, TX-77843, USA
| | - Benjamin B. Machta
- Department of Physics, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| |
Collapse
|
44
|
Shanker VR, Bruun TU, Hie BL, Kim PS. Unsupervised evolution of protein and antibody complexes with a structure-informed language model. Science 2024; 385:46-53. [PMID: 38963838 PMCID: PMC11616794 DOI: 10.1126/science.adk8946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 05/29/2024] [Indexed: 07/06/2024]
Abstract
Large language models trained on sequence information alone can learn high-level principles of protein design. However, beyond sequence, the three-dimensional structures of proteins determine their specific function, activity, and evolvability. Here, we show that a general protein language model augmented with protein structure backbone coordinates can guide evolution for diverse proteins without the need to model individual functional tasks. We also demonstrate that ESM-IF1, which was only trained on single-chain structures, can be extended to engineer protein complexes. Using this approach, we screened about 30 variants of two therapeutic clinical antibodies used to treat severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection. We achieved up to 25-fold improvement in neutralization and 37-fold improvement in affinity against antibody-escaped viral variants of concern BQ.1.1 and XBB.1.5, respectively. These findings highlight the advantage of integrating structural information to identify efficient protein evolution trajectories without requiring any task-specific training data.
Collapse
Affiliation(s)
- Varun R. Shanker
- Stanford Biophysics Program, Stanford University School of Medicine; Stanford, CA 94305, USA
- Stanford Medical Scientist Training Program, Stanford University School of Medicine; Stanford CA 94305, USA
- Sarafan ChEM-H, Stanford University; Stanford, CA 94305, USA
| | - Theodora U.J. Bruun
- Stanford Medical Scientist Training Program, Stanford University School of Medicine; Stanford CA 94305, USA
- Sarafan ChEM-H, Stanford University; Stanford, CA 94305, USA
- Department of Biochemistry, Stanford University School of Medicine; Stanford, CA 94305, USA
| | - Brian L. Hie
- Sarafan ChEM-H, Stanford University; Stanford, CA 94305, USA
- Department of Biochemistry, Stanford University School of Medicine; Stanford, CA 94305, USA
| | - Peter S. Kim
- Sarafan ChEM-H, Stanford University; Stanford, CA 94305, USA
- Department of Biochemistry, Stanford University School of Medicine; Stanford, CA 94305, USA
- Chan Zuckerberg Biohub, San Francisco, CA 94158, USA
| |
Collapse
|
45
|
Bai Q, Xu T, Huang J, Pérez-Sánchez H. Geometric deep learning methods and applications in 3D structure-based drug design. Drug Discov Today 2024; 29:104024. [PMID: 38759948 DOI: 10.1016/j.drudis.2024.104024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 05/02/2024] [Accepted: 05/10/2024] [Indexed: 05/19/2024]
Abstract
3D structure-based drug design (SBDD) is considered a challenging and rational way for innovative drug discovery. Geometric deep learning is a promising approach that solves the accurate model training of 3D SBDD through building neural network models to learn non-Euclidean data, such as 3D molecular graphs and manifold data. Here, we summarize geometric deep learning methods and applications that contain 3D molecular representations, equivariant graph neural networks (EGNNs), and six generative model methods [diffusion model, flow-based model, generative adversarial networks (GANs), variational autoencoder (VAE), autoregressive models, and energy-based models]. Our review provides insights into geometric deep learning methods and advanced applications of 3D SBDD that will be of relevance for the drug discovery community.
Collapse
Affiliation(s)
- Qifeng Bai
- School of Basic Medical Sciences, Lanzhou University, Lanzhou 730000, Gansu, PR China.
| | | | - Junzhou Huang
- Department of Computer Science and Engineering, the University of Texas at Arlington, Arlington, TX 76019, USA
| | - Horacio Pérez-Sánchez
- Structural Bioinformatics and High Performance Computing Research Group (BIO-HPC), Computer Engineering Department, UCAM Universidad Católica de Murcia, Murcia 30107, Spain.
| |
Collapse
|
46
|
Chen H, Fan X, Zhu S, Pei Y, Zhang X, Zhang X, Liu L, Qian F, Tian B. Accurate prediction of CDR-H3 loop structures of antibodies with deep learning. eLife 2024; 12:RP91512. [PMID: 38921957 PMCID: PMC11208048 DOI: 10.7554/elife.91512] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/27/2024] Open
Abstract
Accurate prediction of the structurally diverse complementarity determining region heavy chain 3 (CDR-H3) loop structure remains a primary and long-standing challenge for antibody modeling. Here, we present the H3-OPT toolkit for predicting the 3D structures of monoclonal antibodies and nanobodies. H3-OPT combines the strengths of AlphaFold2 with a pre-trained protein language model and provides a 2.24 Å average RMSDCα between predicted and experimentally determined CDR-H3 loops, thus outperforming other current computational methods in our non-redundant high-quality dataset. The model was validated by experimentally solving three structures of anti-VEGF nanobodies predicted by H3-OPT. We examined the potential applications of H3-OPT through analyzing antibody surface properties and antibody-antigen interactions. This structural prediction tool can be used to optimize antibody-antigen binding and engineer therapeutic antibodies with biophysical properties for specialized drug administration route.
Collapse
Affiliation(s)
- Hedi Chen
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Xiaoyu Fan
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Shuqian Zhu
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Yuchan Pei
- Tsinghua Institute of Multidisciplinary Biomedical Research, Tsinghua UniversityBeijingChina
| | - Xiaochun Zhang
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Xiaonan Zhang
- Department of Natural Language Processing, Baidu International Technology (Shenzhen) Co LtdShenzhenChina
| | - Lihang Liu
- Department of Natural Language Processing, Baidu International Technology (Shenzhen) Co LtdShenzhenChina
| | - Feng Qian
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| | - Boxue Tian
- MOE Key Laboratory of Bioinformatics, State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua UniversityBeijingChina
| |
Collapse
|
47
|
Cocco S, Posani L, Monasson R. Functional effects of mutations in proteins can be predicted and interpreted by guided selection of sequence covariation information. Proc Natl Acad Sci U S A 2024; 121:e2312335121. [PMID: 38889151 PMCID: PMC11214004 DOI: 10.1073/pnas.2312335121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2023] [Accepted: 04/21/2024] [Indexed: 06/20/2024] Open
Abstract
Predicting the effects of one or more mutations to the in vivo or in vitro properties of a wild-type protein is a major computational challenge, due to the presence of epistasis, that is, of interactions between amino acids in the sequence. We introduce a computationally efficient procedure to build minimal epistatic models to predict mutational effects by combining evolutionary (homologous sequence) and few mutational-scan data. Mutagenesis measurements guide the selection of links in a sparse graphical model, while the parameters on the nodes and the edges are inferred from sequence data. We show, on 10 mutational scans, that our pipeline exhibits performances comparable to state-of-the-art deep networks trained on many more data, while requiring much less parameters and being hence more interpretable. In particular, the identified interactions adapt to the wild-type protein and to the fitness or biochemical property experimentally measured, mostly focus on key functional sites, and are not necessarily related to structural contacts. Therefore, our method is able to extract information relevant for one mutational experiment from homologous sequence data reflecting the multitude of structural and functional constraints acting on proteins throughout evolution.
Collapse
Affiliation(s)
- Simona Cocco
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Lorenzo Posani
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| | - Rémi Monasson
- Laboratory of Physics of the Ecole Normale Supérieure, CNRS UMR8023 and Paris Sciences & Lettres (PSL) Research, Sorbonne Université, 75005Paris, France
| |
Collapse
|
48
|
Fram B, Su Y, Truebridge I, Riesselman AJ, Ingraham JB, Passera A, Napier E, Thadani NN, Lim S, Roberts K, Kaur G, Stiffler MA, Marks DS, Bahl CD, Khan AR, Sander C, Gauthier NP. Simultaneous enhancement of multiple functional properties using evolution-informed protein design. Nat Commun 2024; 15:5141. [PMID: 38902262 PMCID: PMC11190266 DOI: 10.1038/s41467-024-49119-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Accepted: 05/24/2024] [Indexed: 06/22/2024] Open
Abstract
A major challenge in protein design is to augment existing functional proteins with multiple property enhancements. Altering several properties likely necessitates numerous primary sequence changes, and novel methods are needed to accurately predict combinations of mutations that maintain or enhance function. Models of sequence co-variation (e.g., EVcouplings), which leverage extensive information about various protein properties and activities from homologous protein sequences, have proven effective for many applications including structure determination and mutation effect prediction. We apply EVcouplings to computationally design variants of the model protein TEM-1 β-lactamase. Nearly all the 14 experimentally characterized designs were functional, including one with 84 mutations from the nearest natural homolog. The designs also had large increases in thermostability, increased activity on multiple substrates, and nearly identical structure to the wild type enzyme. This study highlights the efficacy of evolutionary models in guiding large sequence alterations to generate functional diversity for protein design applications.
Collapse
Affiliation(s)
- Benjamin Fram
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
| | - Yang Su
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Ian Truebridge
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Adam J Riesselman
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Program in Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - John B Ingraham
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Alessandro Passera
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Research Institute of Molecular Pathology (IMP), Vienna BioCenter (VBC), Campus-Vienna-Biocenter 1, 1030, Vienna, Austria
| | - Eve Napier
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
| | - Nicole N Thadani
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Apriori Bio, Cambridge, MA, USA
| | - Samuel Lim
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Kristen Roberts
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Gurleen Kaur
- Selux Diagnostics Inc., 56 Roland Street, Charlestown, MA, USA
| | - Michael A Stiffler
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Dyno Therapeutics, 343 Arsenal Street, Watertown, MA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Christopher D Bahl
- Institute for Protein Innovation, Boston, MA, USA
- Division of Hematology/Oncology, Boston Children's Hospital, Harvard Medical School, Boston, MA, USA
- AI Proteins, Boston, MA, USA
| | - Amir R Khan
- School of Biochemistry and Immunology, Trinity College Dublin, Dublin 2, Ireland
- Division of Newborn Medicine, Boston Children's Hospital, Boston, MA, USA
| | - Chris Sander
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nicholas P Gauthier
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
49
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
50
|
Jing H, Gao Z, Xu S, Shen T, Peng Z, He S, You T, Ye S, Lin W, Sun S. Accurate prediction of antibody function and structure using bio-inspired antibody language model. Brief Bioinform 2024; 25:bbae245. [PMID: 38797969 PMCID: PMC11128484 DOI: 10.1093/bib/bbae245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2023] [Revised: 04/08/2024] [Accepted: 05/07/2024] [Indexed: 05/29/2024] Open
Abstract
In recent decades, antibodies have emerged as indispensable therapeutics for combating diseases, particularly viral infections. However, their development has been hindered by limited structural information and labor-intensive engineering processes. Fortunately, significant advancements in deep learning methods have facilitated the precise prediction of protein structure and function by leveraging co-evolution information from homologous proteins. Despite these advances, predicting the conformation of antibodies remains challenging due to their unique evolution and the high flexibility of their antigen-binding regions. Here, to address this challenge, we present the Bio-inspired Antibody Language Model (BALM). This model is trained on a vast dataset comprising 336 million 40% nonredundant unlabeled antibody sequences, capturing both unique and conserved properties specific to antibodies. Notably, BALM showcases exceptional performance across four antigen-binding prediction tasks. Moreover, we introduce BALMFold, an end-to-end method derived from BALM, capable of swiftly predicting full atomic antibody structures from individual sequences. Remarkably, BALMFold outperforms those well-established methods like AlphaFold2, IgFold, ESMFold and OmegaFold in the antibody benchmark, demonstrating significant potential to advance innovative engineering and streamline therapeutic antibody development by reducing the need for unnecessary trials. The BALMFold structure prediction server is freely available at https://beamlab-sh.com/models/BALMFold.
Collapse
Affiliation(s)
- Hongtai Jing
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
| | - Zhengtao Gao
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Sheng Xu
- Shanghai AI Laboratory, Shanghai 200232, China
| | - Tao Shen
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Zelixir Biotech, Shanghai 201206, China
| | - Zhangzhi Peng
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shwai He
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Tao You
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
| | - Shuang Ye
- Department of Gynecologic Oncology, Fudan University Shanghai Cancer Center, Shanghai 200032, China
- Department of Oncology, Shanghai Medical College, Fudan University, Shanghai 200032, China
| | - Wei Lin
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai 200433, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai 200032, China
- Shanghai AI Laboratory, Shanghai 200232, China
- School of Mathematical Sciences and Shanghai Center for Mathematical Sciences, Fudan University, Shanghai 200433, China
| | - Siqi Sun
- Research Institute of Intelligent Complex Systems, Fudan University, Shanghai 200433, China
- Shanghai AI Laboratory, Shanghai 200232, China
| |
Collapse
|