1
|
Tremmel R, Honore A, Park Y, Zhou Y, Xiao M, Lauschke VM. Machine learning models for pharmacogenomic variant effect predictions - recent developments and future frontiers. Pharmacogenomics 2025:1-12. [PMID: 40401639 DOI: 10.1080/14622416.2025.2504863] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2025] [Accepted: 05/08/2025] [Indexed: 05/23/2025] Open
Abstract
Pharmacogenomic variations in genes involved in drug disposition and in drug targets is a major determinant of inter-individual differences in drug response and toxicity. While the effects of common variants are well established, millions of rare variations remain functionally uncharacterized, posing a challenge for the implementation of precision medicine. Recent advances in machine learning (ML) have significantly enhanced the prediction of variant effects by considering DNA as well as protein sequences, as well as their evolutionary conservation and haplotype structures. Emerging deep learning models utilize techniques to capture evolutionary conservation and biophysical properties, and ensemble approaches that integrate multiple predictive models exhibit increased accuracy, robustness, and interpretability. This review explores the current landscape of ML-based variant effect predictors. We discuss key methodological differences and highlight their strengths and limitations for pharmacogenomic applications. We furthermore discuss emerging methodologies for the prediction of substrate-specificity and for consideration of variant epistasis. Combined, these tools improve the functional effect prediction of drug-related variants and offer a viable strategy that could in the foreseeable future translate comprehensive genomic information into pharmacogenetic recommendations.
Collapse
Affiliation(s)
- Roman Tremmel
- Dr Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany
- University of Tübingen, Tübingen, Germany
| | - Antoine Honore
- Division of Information Science and Engineering, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Yoomi Park
- Department of Physiology and Pharmacology and Center for Molecular Medicine, Karolinska Institutet and University Hospital, Stockholm, Sweden
- Medical Research Center, Seoul National University College of Medicine, Seoul, South Korea
| | - Yitian Zhou
- Department of Physiology and Pharmacology and Center for Molecular Medicine, Karolinska Institutet and University Hospital, Stockholm, Sweden
| | - Ming Xiao
- Division of Information Science and Engineering, KTH Royal Institute of Technology, Stockholm, Sweden
| | - Volker M Lauschke
- Dr Margarete Fischer-Bosch Institute of Clinical Pharmacology, Stuttgart, Germany
- University of Tübingen, Tübingen, Germany
- Department of Physiology and Pharmacology and Center for Molecular Medicine, Karolinska Institutet and University Hospital, Stockholm, Sweden
- Department of Pharmacy, The Second Xiangya Hospital, Central South University, Changsha, China
| |
Collapse
|
2
|
Tekpinar M, David L, Henry T, Carbone A. PRESCOTT: a population aware, epistatic, and structural model accurately predicts missense effects. Genome Biol 2025; 26:113. [PMID: 40329382 PMCID: PMC12054230 DOI: 10.1186/s13059-025-03581-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2024] [Accepted: 04/17/2025] [Indexed: 05/08/2025] Open
Abstract
Predicting the functional impact of point mutations is a critical challenge in genomics. PRESCOTT reconstructs complete mutational landscapes, identifies mutation-sensitive regions, and categorizes missense variants as benign, pathogenic, or variants of uncertain significance. Leveraging protein sequences, structural models, and population-specific allele frequencies, PRESCOTT surpasses existing methods in classifying ClinVar variants, the ACMG dataset, and over 1800 proteins from the Human Protein Dataset. Its online server facilitates mutation effect predictions for any protein and variant, and includes a database of over 19,000 human proteins, ready for population-specific analyses. Open access to residue-specific scores offers transparency and valuable insights for genomic medicine.
Collapse
Affiliation(s)
- Mustafa Tekpinar
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France
| | - Laurent David
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France
| | - Thomas Henry
- Centre International de Recherche en Infectiologie (CIRI), Inserm U1111, Université Claude Bernard Lyon 1, CNRS, UMR5308, ENS de Lyon, Univ Lyon, Lyon, 69007, France
| | - Alessandra Carbone
- Department of Computational, Quantitative and Synthetic Biology (CQSB), Sorbonne Université, CNRS, IBPS, UMR 7238, Paris, 75005, France.
- Institut Universitaire de France (IUF), Paris, France.
| |
Collapse
|
3
|
Tan Y, Zhou B, Zheng L, Fan G, Hong L. Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability. eLife 2025; 13:RP98033. [PMID: 40314227 PMCID: PMC12048155 DOI: 10.7554/elife.98033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025] Open
Abstract
Protein engineering is a pivotal aspect of synthetic biology, involving the modification of amino acids within existing protein sequences to achieve novel or enhanced functionalities and physical properties. Accurate prediction of protein variant effects requires a thorough understanding of protein sequence, structure, and function. Deep learning methods have demonstrated remarkable performance in guiding protein modification for improved functionality. However, existing approaches predominantly rely on protein sequences, which face challenges in efficiently encoding the geometric aspects of amino acids' local environment and often fall short in capturing crucial details related to protein folding stability, internal molecular interactions, and bio-functions. Furthermore, there lacks a fundamental evaluation for developed methods in predicting protein thermostability, although it is a key physical property that is frequently investigated in practice. To address these challenges, this article introduces a novel pre-training framework that integrates sequential and geometric encoders for protein primary and tertiary structures. This framework guides mutation directions toward desired traits by simulating natural selection on wild-type proteins and evaluates variant effects based on their fitness to perform specific functions. We assess the proposed approach using three benchmarks comprising over 300 deep mutational scanning assays. The prediction results showcase exceptional performance across extensive experiments compared to other zero-shot learning methods, all while maintaining a minimal cost in terms of trainable parameters. This study not only proposes an effective framework for more accurate and comprehensive predictions to facilitate efficient protein engineering, but also enhances the in silico assessment system for future deep learning models to better align with empirical requirements. The PyTorch implementation is available at https://github.com/ai4protein/ProtSSN.
Collapse
Affiliation(s)
- Yang Tan
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
| | - Bingxin Zhou
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| | - Lirong Zheng
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
| | - Liang Hong
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| |
Collapse
|
4
|
Biswas A, Choudhuri I, Huang K, Sun Q, Sali A, Echeverria I, Haldane A, Levy RM, Lyumkis D. Evolutionary Sequence and Structural Basis for the Epistatic Origins of Drug Resistance in HIV. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.04.30.651576. [PMID: 40364913 PMCID: PMC12073831 DOI: 10.1101/2025.04.30.651576] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2025]
Abstract
The emergence of drug resistance in the human immunodeficiency virus (HIV) remains a formidable challenge to the long-term efficacy of antiretroviral therapy (ART). A growing body of evidence highlights the critical role of epistasis, the dependence of mutational effects on the sequence context, in shaping the fitness landscape of HIV under ART-induced selection pressure. However, the biophysical origins of the epistatic interactions involved in engendering drug-resistance mutations (DRMs) remain unclear. Are the mutational correlations "intrinsic" to the properties of the protein, or do they arise because of drug binding? We use a Potts sequence-covariation statistical energy model built on patient-derived HIV-1 protein sequences to construct computational double mutant cycles that probe pairwise epistasis for all observed mutations across the three major HIV drug-target enzymes. We find that the strongest epistatic effects occur between mutations at residue positions that frequently mutate during the course of ART, termed resistance-associated positions. To investigate the structural origins of the strongest epistatic interactions, we perform ∼100 free energy perturbation molecular dynamics simulations, revealing that the primary contribution to the pairwise epistasis between DRMs arises from cooperative effects on protein stability and folding as an intrinsic consequence of the protein mutational landscape. The results collectively reinforce a mechanism of resistance evolution whereby viruses escape drug pressure by selectively engendering mutations at "intrinsically" coupled sites, allowing them to cooperatively ameliorate fitness detriments incurred by individual DRMs. Significance Epistasis refers to the phenomenon where the effect of a mutation on protein structure and function is dependent on the genetic sequence background of the mutation, resulting in the combined effect of mutations being non-additive. Epistasis plays a significant role in the evolution of drug resistance in viruses such as HIV under therapeutic selection pressure. We combine a protein sequence coevolutionary model and molecular dynamics free energy simulations to identify and probe the mechanistic origins of the strongest epistatic interactions connecting HIV drug-resistance mutations. The work establishes a foundation to probe the molecular bases of epistasis and predict the evolution of resistance predicated on the knowledge of epistatic interaction networks.
Collapse
|
5
|
Wu J, Qiu Y, Lyashenko E, Torregrosa T, Pfister EL, Ryan MJ, Mueller C, Choudhury SR. Prediction of Adeno-Associated Virus Fitness with a Protein Language-Based Machine Learning Model. Hum Gene Ther 2025; 36:823-829. [PMID: 40241334 DOI: 10.1089/hum.2024.227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/18/2025] Open
Abstract
Adeno-associated virus (AAV)-based therapeutics have the potential to transform the lives of patients by delivering one-time treatments for a variety of diseases. However, a critical challenge to their widespread adoption and distribution is the high cost of goods. Reducing manufacturing costs by developing AAV capsids with improved yield, or fitness, is key to making gene therapies more affordable. AAV fitness is largely determined by the amino acid sequence of the capsid, however, engineered AAVs are rarely optimized for manufacturability. Here, we report a state-of-the art machine learning (ML) model that predicts the fitness of AAV2 capsid mutants based on the amino acid sequence of the capsid monomer. By combining a protein language model (PLM) and classical ML techniques, our model achieved a significantly high prediction accuracy (Pearson correlation = 0.818) for capsid fitness. Importantly, tests on completely independent datasets showed robustness and generalizability of our model, even for multimutant AAV capsids. Our accurate ML-based model can be used as a surrogate for laborious in vitro experiments, thus saving time and resources, and can be deployed to increase the fitness of clinical AAV capsids to make gene therapies economically viable for patients.
Collapse
Affiliation(s)
- Jason Wu
- Genomic Medicine Unit, Sanofi, Waltham, Massachusetts, USA
| | - Yu Qiu
- Large Molecule Research, Sanofi, Cambridge, Massachusetts, USA
| | | | | | | | - Michael J Ryan
- Genomic Medicine Unit, Sanofi, Waltham, Massachusetts, USA
| | | | | |
Collapse
|
6
|
Komp E, Phillips C, Lee LM, Fallin SM, Alanzi HN, Zorman M, McCully ME, Beck DAC. Neural network conditioned to produce thermophilic protein sequences can increase thermal stability. Sci Rep 2025; 15:14124. [PMID: 40268970 PMCID: PMC12019596 DOI: 10.1038/s41598-025-90828-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2024] [Accepted: 02/17/2025] [Indexed: 04/25/2025] Open
Abstract
This work presents Neural Optimization for Melting-temperature Enabled by Leveraging Translation (NOMELT), a novel approach for designing and ranking high-temperature stable proteins using neural machine translation. The model, trained on over 4 million protein homologous pairs from organisms adapted to different temperatures, demonstrates promising capability in targeting thermal stability. A designed variant of the Drosophila melanogaster Engrailed Homeodomain shows a melting temperature increase of 15.5 K. Furthermore, NOMELT achieves zero-shot predictive capabilities in ranking experimental melting and half-activation temperatures across a number of protein families. It achieves this without requiring extensive homology data or massive training datasets as do existing zero-shot predictors by specifically learning thermophilicity, as opposed to all natural variation. These findings underscore the potential of leveraging organismal growth temperatures in context-dependent design of proteins for enhanced thermal stability.
Collapse
Affiliation(s)
- Evan Komp
- Chemical Engineering, University of Washington, Seattle, WA, USA.
| | | | - Lauren M Lee
- Department of Biology, Santa Clara University, Santa Clara, CA, USA
| | - Shayna M Fallin
- Department of Biology, Santa Clara University, Santa Clara, CA, USA
| | - Humood N Alanzi
- Chemical Engineering, University of Washington, Seattle, WA, USA
| | - Marlo Zorman
- Chemical Engineering, University of Washington, Seattle, WA, USA
| | | | - David A C Beck
- Chemical Engineering, University of Washington, Seattle, WA, USA.
- eScience Institute, University of Washington, Seattle, WA, USA.
- Computer Science, University of Washington, Seattle, WA, USA.
| |
Collapse
|
7
|
Sandhu M, Chen JZ, Matthews DS, Spence MA, Pulsford SB, Gall B, Kaczmarski JA, Nichols J, Tokuriki N, Jackson CJ. Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains. Biochemistry 2025; 64:1673-1684. [PMID: 40132127 DOI: 10.1021/acs.biochem.4c00673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2025]
Abstract
Proteins evolve through complex sequence spaces, with fitness landscapes serving as a conceptual framework that links sequence to function. Fitness landscapes can be smooth, where multiple similarly accessible evolutionary paths are available, or rugged, where the presence of multiple local fitness optima complicate evolution and prediction. Indeed, many proteins, especially those with complex functions or under multiple selection pressures, exist on rugged fitness landscapes. Here we discuss the theoretical framework that underpins our understanding of fitness landscapes, alongside recent work that has advanced our understanding─particularly the biophysical basis for smoothness versus ruggedness. Finally, we address the rapid advances that have been made in computational and experimental exploration and exploitation of fitness landscapes, and how these can identify efficient routes to protein optimization.
Collapse
Affiliation(s)
- Mahakaran Sandhu
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - John Z Chen
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| | - Dana S Matthews
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Matthew A Spence
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Sacha B Pulsford
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Barnabas Gall
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Joe A Kaczmarski
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| | - James Nichols
- Biological Data Science Institute, Australian National University, Canberra ACT 2601, Australia
| | - Nobuhiko Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- Biological Data Science Institute, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| |
Collapse
|
8
|
Livesey BJ, Badonyi M, Dias M, Frazer J, Kumar S, Lindorff-Larsen K, McCandlish DM, Orenbuch R, Shearer CA, Muffley L, Foreman J, Glazer AM, Lehner B, Marks DS, Roth FP, Rubin AF, Starita LM, Marsh JA. Guidelines for releasing a variant effect predictor. Genome Biol 2025; 26:97. [PMID: 40234898 PMCID: PMC11998465 DOI: 10.1186/s13059-025-03572-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Accepted: 04/08/2025] [Indexed: 04/17/2025] Open
Abstract
Computational methods for assessing the likely impacts of mutations, known as variant effect predictors (VEPs), are widely used in the assessment and interpretation of human genetic variation, as well as in other applications like protein engineering. Many different VEPs have been released, and there is tremendous variability in their underlying algorithms, outputs, and the ways in which the methodologies and predictions are shared. This leads to considerable difficulties for users trying to navigate the selection and application of VEPs. Here, to address these issues, we provide guidelines and recommendations for the release of novel VEPs.
Collapse
Affiliation(s)
- Benjamin J Livesey
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Mihaly Badonyi
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Mafalda Dias
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Jonathan Frazer
- Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Sushant Kumar
- Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, ON, Canada
| | - Kresten Lindorff-Larsen
- Department of Biology, Linderstrøm-Lang Centre for Protein Science, University of Copenhagen, Copenhagen, Denmark
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, NY, USA
| | - Rose Orenbuch
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | | | - Lara Muffley
- Department of Genome Sciences, University of Washingtonand the, Brotman Baty Institute for Precision Medicine , Seattle, WA, USA
| | - Julia Foreman
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | | | - Ben Lehner
- Wellcome Sanger Institute, Cambridge, UK
- Universitat Pompeu Fabra (UPF), Barcelona, Spain
- Institució Catalana de Recerca I Estudis Avançats (ICREA), Barcelona, Spain
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Boston, MA, USA
| | - Frederick P Roth
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Alan F Rubin
- Bioinformatics Division, Walterand , Eliza Hall Institute of Medical Research, Parkville, Australia
- Department of Medical Biology, University of Melbourne, Parkville, Australia
| | - Lea M Starita
- Department of Genome Sciences, University of Washingtonand the, Brotman Baty Institute for Precision Medicine , Seattle, WA, USA
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
9
|
Yadalam PK, Ardila CM. Deep Neural Networks Based on Sp7 Protein Sequence Prediction in Peri-Implant Bone Formation. Int J Dent 2025; 2025:7583275. [PMID: 40231202 PMCID: PMC11996267 DOI: 10.1155/ijod/7583275] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2024] [Accepted: 03/15/2025] [Indexed: 04/16/2025] Open
Abstract
Objective: Peri-implant bone regeneration is crucial for dental implant success, particularly in managing peri-implantitis, which causes inflammation and bone loss. SP7 (Osterix) is vital for osteoblast differentiation and bone matrix formation. Advances in deep neural networks (DNNs) offer new ways to analyze protein sequences, potentially improving our understanding of SP7's role in bone formation. This study aims to develop and utilize DNNs to predict the SP7 protein sequence and understand its role in peri-implant bone formation. Materials: and Methods: Sequences were retrieved from UniProt IDs Q8TDD2 and Q9V3Z2 using the UniProt dataset. The sequences were Sp7 fasta sequences. These sequences were located, and their quality was assessed. We built an architecture that can handle a wide range of input sequences using a DNN technique, with computing needs based on the length of the input sequences. Results: Protein sequences were analyzed using a DNN architecture with ADAM optimizer over 50 epochs, achieving a sensitivity of 0.89 and a specificity of 0.82. The receiver operating characteristic (ROC) curve demonstrated high true-positive rates and low false-positive rates, indicating robust model performance. Precision-recall analysis underscored the model's effectiveness in handling imbalanced data, with significant area under the curve (AUC-PR). Epoch plots highlighted consistent model accuracy throughout training, confirming its reliability for protein sequence analysis. Conclusion: The DNN employed with ADAM optimizer demonstrated robust performance in analyzing protein sequences, achieving an accuracy of 0.85 and high sensitivity and specificity. The ROC curve highlighted the model's effectiveness in distinguishing true positives from false positives, which is essential for reliable protein classification. These findings suggest that the developed model is promising for enhancing predictive capabilities in computational biology and biomedical research, particularly in protein function prediction and therapeutic development applications.
Collapse
Affiliation(s)
- Pradeep Kumar Yadalam
- Department of Periodontics, Saveetha Dental College, SIMATS, Saveetha University, Chennai, Tamil Nadu, India
| | - Carlos M. Ardila
- Department of Periodontics, Saveetha Dental College, SIMATS, Saveetha University, Chennai, Tamil Nadu, India
- Department of Basic Sciences, Biomedical Stomatology Research Group, Faculty of Dentistry, University of Antioquia, Medellín, Colombia
| |
Collapse
|
10
|
Ortega FM, Hossain F, Volobouev VV, Meloni G, Torabifard H, Morcos F. Generative Landscapes and Dynamics to Design Multidomain Artificial Transmembrane Transporters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.28.645293. [PMID: 40236216 PMCID: PMC11996383 DOI: 10.1101/2025.03.28.645293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
Protein design is challenging as it requires simultaneous consideration of interconnected factors, such as fold, dynamics, and function. These evolutionary constraints are encoded in protein sequences and can be learned through the latent generative landscape (LGL) framework to predict functional sequences by leveraging evolutionary patterns, enabling exploration of uncharted sequence space. By simulating designed proteins through molecular dynamics (MD), we gain deeper insights into the interdependencies governing structure and dynamics. We present a synergized workflow combining LGL with MD and biochemical characterization, allowing us to explore the sequence space effectively. This approach has been applied to design and characterize two artificial multidomain ATP-driven transmembrane copper transporters, with native-like functionality. This integrative approach proved effective in unraveling the intricate relationships between sequence, structure, and function.
Collapse
|
11
|
Benegas G, Ye C, Albors C, Li JC, Song YS. Genomic language models: opportunities and challenges. Trends Genet 2025; 41:286-302. [PMID: 39753409 DOI: 10.1016/j.tig.2024.11.013] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2024] [Revised: 11/21/2024] [Accepted: 11/21/2024] [Indexed: 04/10/2025]
Abstract
Large language models (LLMs) are having transformative impacts across a wide range of scientific fields, particularly in the biomedical sciences. Just as the goal of natural language processing is to understand sequences of words, a major objective in biology is to understand biological sequences. Genomic language models (gLMs), which are LLMs trained on DNA sequences, have the potential to significantly advance our understanding of genomes and how DNA elements at various scales interact to give rise to complex functions. To showcase this potential, we highlight key applications of gLMs, including functional constraint prediction, sequence design, and transfer learning. Despite notable recent progress, however, developing effective and efficient gLMs presents numerous challenges, especially for species with large, complex genomes. Here, we discuss major considerations for developing and evaluating gLMs.
Collapse
Affiliation(s)
- Gonzalo Benegas
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Chengzhong Ye
- Department of Statistics, University of California, Berkeley, CA, USA
| | - Carlos Albors
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Jianan Canal Li
- Computer Science Division, University of California, Berkeley, CA, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, USA; Department of Statistics, University of California, Berkeley, CA, USA; Center for Computational Biology, University of California, Berkeley, CA, USA.
| |
Collapse
|
12
|
Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C, Wang J. Foundation models in bioinformatics. Natl Sci Rev 2025; 12:nwaf028. [PMID: 40078374 PMCID: PMC11900445 DOI: 10.1093/nsr/nwaf028] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 12/17/2024] [Accepted: 01/08/2025] [Indexed: 03/14/2025] Open
Abstract
With the adoption of foundation models (FMs), artificial intelligence (AI) has become increasingly significant in bioinformatics and has successfully addressed many historical challenges, such as pre-training frameworks, model evaluation and interpretability. FMs demonstrate notable proficiency in managing large-scale, unlabeled datasets, because experimental procedures are costly and labor intensive. In various downstream tasks, FMs have consistently achieved noteworthy results, demonstrating high levels of accuracy in representing biological entities. A new era in computational biology has been ushered in by the application of FMs, focusing on both general and specific biological issues. In this review, we introduce recent advancements in bioinformatics FMs employed in a variety of downstream tasks, including genomics, transcriptomics, proteomics, drug discovery and single-cell analysis. Our aim is to assist scientists in selecting appropriate FMs in bioinformatics, according to four model types: language FMs, vision FMs, graph FMs and multimodal FMs. In addition to understanding molecular landscapes, AI technology can establish the theoretical and practical foundation for continued innovation in molecular biology.
Collapse
Affiliation(s)
- Fei Guo
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Xiangjiang Laboratory, Changsha 410083, China
| | - Renchu Guan
- Key Laboratory for Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk 23529, USA
| | - Qi Liu
- School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xiaowo Wang
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Can Yang
- Department of Mathematics, State Key Laboratory of Molecular Neuroscience, and Big Data Bio-Intelligence Lab, The Hong Kong University of Science and Technology, Hong Kong, China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Xiangjiang Laboratory, Changsha 410083, China
| |
Collapse
|
13
|
Bjerregaard A, Groth PM, Hauberg S, Krogh A, Boomsma W. Foundation models of protein sequences: A brief overview. Curr Opin Struct Biol 2025; 91:103004. [PMID: 39983412 DOI: 10.1016/j.sbi.2025.103004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 01/24/2025] [Accepted: 01/26/2025] [Indexed: 02/23/2025]
Abstract
Protein sequence models have evolved from simple statistics of aligned families to versatile foundation models of evolutionary scale. Enabled by self-supervised learning and an abundance of protein sequence data, such foundation models now play a central role in protein science. They facilitate rich representations, powerful generative design, and fine-tuning across diverse domains. In this review, we trace modeling developments and categorize them into methodological trends over the modalities they describe and the contexts they condition upon. Following a brief historical overview, we focus our attention on the most recent trends and outline future perspectives.
Collapse
Affiliation(s)
- Andreas Bjerregaard
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Novonesis, Kgs, Lyngby, Denmark
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs, Lyngby, Denmark
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
14
|
Shukla D, Martin J, Morcos F, Potoyan DA. Thermal Adaptation of Cytosolic Malate Dehydrogenase Revealed by Deep Learning and Coevolutionary Analysis. J Chem Theory Comput 2025; 21:3277-3287. [PMID: 40079215 PMCID: PMC11948321 DOI: 10.1021/acs.jctc.4c01774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2024] [Revised: 03/06/2025] [Accepted: 03/07/2025] [Indexed: 03/14/2025]
Abstract
Protein evolution has shaped enzymes that maintain stability and function across diverse thermal environments. While sequence variation, thermal stability and conformational dynamics are known to influence an enzyme's thermal adaptation, how these factors collectively govern stability and function across diverse temperatures remains unresolved. Cytosolic malate dehydrogenase (cMDH), a citric acid cycle enzyme, is an ideal model for studying these mechanisms due to its temperature-sensitive flexibility and broad presence in species from diverse thermal environments. In this study, we employ techniques inspired by deep learning and statistical mechanics to uncover how sequence variation and conformational dynamics shape patterns of cMDH's thermal adaptation. By integrating coevolutionary models with variational autoencoders (VAE), we generate a latent generative landscape (LGL) of the cMDH sequence space, enabling us to explore mutational pathways and predict fitness using direct coupling analysis (DCA). Structure predictions via AlphaFold and molecular dynamics simulations further illuminate how variations in hydrophobic interactions and conformational flexibility contribute to the thermal stability of warm- and cold-adapted cMDH orthologs. Notably, we identify the ratio of hydrophobic contacts between two regions as a predictive order parameter for thermal stability features, providing a quantitative metric for understanding cMDH dynamics across temperatures. The integrative computational framework employed in this study provides mechanistic insights into protein adaptation at both sequence and structural levels, offering unique perspectives on the evolution of thermal stability and creating avenues for the rational design of proteins with optimized thermal properties.
Collapse
Affiliation(s)
- Divyanshu Shukla
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| | - Jonathan Martin
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
| | - Faruck Morcos
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
- Departments
of Bioengineering and Physics, UT Dallas, Richardson, TX 75080, United States
- Center
for
Systems Biology, UT Dallas, Richardson, TX 75080, United States
| | - Davit A. Potoyan
- Department
of Chemistry, Iowa State University, Ames, Iowa 50011, United States
- Department
of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa 50011, United States
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| |
Collapse
|
15
|
Thomas N, Belanger D, Xu C, Lee H, Hirano K, Iwai K, Polic V, Nyberg KD, Hoff KG, Frenz L, Emrich CA, Kim JW, Chavarha M, Ramanan A, Agresti JJ, Colwell LJ. Engineering highly active nuclease enzymes with machine learning and high-throughput screening. Cell Syst 2025; 16:101236. [PMID: 40081373 DOI: 10.1016/j.cels.2025.101236] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Revised: 09/17/2024] [Accepted: 02/19/2025] [Indexed: 03/16/2025]
Abstract
Optimizing enzymes to function in novel chemical environments is a central goal of synthetic biology, but optimization is often hindered by a rugged fitness landscape and costly experiments. In this work, we present TeleProt, a machine learning (ML) framework that blends evolutionary and experimental data to design diverse protein libraries, and employ it to improve the catalytic activity of a nuclease enzyme that degrades biofilms that accumulate on chronic wounds. After multiple rounds of high-throughput experiments, TeleProt found a significantly better top-performing enzyme than directed evolution (DE), had a better hit rate at finding diverse, high-activity variants, and was even able to design a high-performance initial library using no prior experimental data. We have released a dataset of 55,000 nuclease variants, one of the most extensive genotype-phenotype enzyme activity landscapes to date, to drive further progress in ML-guided design. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Neil Thomas
- X, the Moonshot Factory, Mountain View, CA 94043, USA.
| | | | | | | | | | | | | | | | | | | | | | - Jun W Kim
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Abi Ramanan
- X, the Moonshot Factory, Mountain View, CA 94043, USA
| | | | - Lucy J Colwell
- Google DeepMind, Cambridge, MA 02142, USA; Department of Chemistry, University of Cambridge, Cambridge CB2 1EW, UK.
| |
Collapse
|
16
|
Orenbuch R, Shearer CA, Kollasch AW, Spinner HD, Hopf TA, van Niekerk L, Franceschi D, Dias M, Frazer J, Marks DS. Proteome-wide model for human disease genetics. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2023.11.27.23299062. [PMID: 38076790 PMCID: PMC10705666 DOI: 10.1101/2023.11.27.23299062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Identifying variants driving disease accelerates both genetic diagnosis and therapeutic development, but missense variants still present a bottleneck as their effects are less straightforward than truncations or nonsense mutations. While computational prediction methods are sufficiently accurate to be of clinical value for variants in known disease genes, they do not generalize well to other genes as the scores are not calibrated across the proteome 1-6 . To address this, we developed a deep generative model, popEVE, that combines evolutionary information with population sequence data 7 and achieves state-of-the-art performance on a suite of proteome-wide prediction tasks, without overestimating the prevalence of deleterious variants in the population. popEVE identifies 442 genes in a developmental disorder cohort 8 , including evidence of 123 novel candidates, many without the need for cohort-wide enrichment. Candidate genes are functionally similar to known developmental disorder genes and case variants tend to fall in functionally important regions of these genes. Finally, we show that these findings can be reproduced from analysis of the patient exomes alone, demonstrating that popEVE provides a new avenue for genetic analysis in situations where traditional methods fail, including genetic diagnosis of rare-as-one diseases, even in the absence of parent sequencing.
Collapse
|
17
|
Ligero M, El Nahhas OSM, Aldea M, Kather JN. Artificial intelligence-based biomarkers for treatment decisions in oncology. Trends Cancer 2025; 11:232-244. [PMID: 39814650 DOI: 10.1016/j.trecan.2024.12.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 11/29/2024] [Accepted: 12/02/2024] [Indexed: 01/18/2025]
Abstract
The development of new therapeutic strategies such as immune checkpoint inhibitors (ICIs) and targeted therapies has increased the complexity of the treatment landscape for solid tumors. At the current rate of annual FDA approvals, the potential treatment options could increase by tenfold over the next 5 years. The cost of personalized medicine technologies limits its accessibility, thus increasing socioeconomic disparities in the treated population. In this review we describe artificial intelligence (AI)-based solutions - including deep learning (DL) methods for routine medical imaging and large language models (LLMs) for electronic health records (EHRs) - to support cancer treatment decisions with cost-effective biomarkers. We address the current limitations of these technologies and propose the next steps towards their adoption in routine clinical practice.
Collapse
Affiliation(s)
- Marta Ligero
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Dresden University of Technology (TUD), Dresden, Germany
| | - Omar S M El Nahhas
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Dresden University of Technology (TUD), Dresden, Germany
| | - Mihaela Aldea
- Department of Cancer Medicine, Institut Gustave Roussy, Université Paris-Saclay, F-94805, Villejuif, France; Thoracic Oncology, Dana Farber Cancer Institute, Boston, MA, USA
| | - Jakob Nikolas Kather
- Else Kroener Fresenius Center for Digital Health, Medical Faculty Carl Gustav Carus, Dresden University of Technology (TUD), Dresden, Germany; Department of Medicine I, University Hospital Dresden, Dresden, Germany; Medical Oncology, National Center for Tumor Diseases (NCT), University Hospital Heidelberg, Heidelberg, Germany.
| |
Collapse
|
18
|
Sun Y, Shen Y. Structure-informed protein language models are robust predictors for variant effects. Hum Genet 2025; 144:209-225. [PMID: 39117802 PMCID: PMC12068927 DOI: 10.1007/s00439-024-02695-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Accepted: 07/20/2024] [Indexed: 08/10/2024]
Abstract
Emerging variant effect predictors, protein language models (pLMs) learn evolutionary distribution of functional sequences to capture fitness landscape. Considering that variant effects are manifested through biological contexts beyond sequence (such as structure), we first assess how much structure context is learned in sequence-only pLMs and affecting variant effect prediction. And we establish a need to inject into pLMs protein structural context purposely and controllably. We thus introduce a framework of structure-informed pLMs (SI-pLMs), by extending masked sequence denoising to cross-modality denoising for both sequence and structure. Numerical results over deep mutagenesis scanning benchmarks show that our SI-pLMs, even when using smaller models and less data, are robustly top performers against competing methods including other pLMs, which shows that introducing biological context can be more effective at capturing fitness landscape than simply using larger models or bigger data. Case studies reveal that, compared to sequence-only pLMs, SI-pLMs can be better at capturing fitness landscape because (a) learned embeddings of low/high-fitness sequences can be more separable and (b) learned amino-acid distributions of functionally and evolutionarily conserved residues can be of much lower entropy, thus much more conserved, than other residues. Our SI-pLMs are applicable to revising any sequence-only pLMs through model architecture and training objectives. They do not require structure data as model inputs for variant effect prediction and only use structures as context provider and model regularizer during training.
Collapse
Affiliation(s)
- Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843, Texas, USA.
- Department of Computer Science and Engineering, Texas A&M University, College Station, 77843, Texas, USA.
- Institute of Biosciences and Technology and Department of Translational Medical Sciences, Texas A&M University, Houston, 77030, Texas, USA.
| |
Collapse
|
19
|
Gu J, Mu W, Xu Y, Nie Y. From discovery to application: Enabling technology-based optimizing carbonyl reductases biocatalysis for active pharmaceutical ingredient synthesis. Biotechnol Adv 2025; 79:108496. [PMID: 39647674 DOI: 10.1016/j.biotechadv.2024.108496] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 10/04/2024] [Accepted: 11/30/2024] [Indexed: 12/10/2024]
Abstract
The catalytic conversion of chiral alcohols and corresponding carbonyl compounds by carbonyl reductases (alcohol dehydrogenases), which are NAD(P) or NAD(P)H-dependent oxidoreductases, has attracted considerable attention. However, existing carbonyl reductases are insufficient to meet the demands of diverse industrial applications; hence, new enzymes with functions that can expand the toolbox of biocatalysts are urgently required. Developing precisely controlled chiral biocatalysts is of great significance for the efficient development of a broad spectrum of active pharmaceutical ingredients via biosynthesis. In this review, we summarized methods for discovering novel natural carbonyl reductases from various perspectives. Furthermore, advances in protein engineering, utilizing known sequence and structural information as well as catalytic dynamics mechanisms to improve potential functions, are also addressed. The exponential growth in data-driven tools over the past decade has made it possible to de novo design carbonyl reductases. Additionally, various applications of these high-performance carbonyl reductases and different strategies for coenzyme regeneration involving photocatalysis during the reaction process were reviewed. These advancements will bring new opportunities and challenges to the fields of green chemistry and biosynthesis in the future.
Collapse
Affiliation(s)
- Jie Gu
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China; School of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| | - Wanmeng Mu
- School of Food Science and Technology, Jiangnan University, Wuxi 214122, China; State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| | - Yan Xu
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China; State Key Laboratory of Food Science and Technology, Jiangnan University, Wuxi 214122, China
| | - Yao Nie
- Lab of Brewing Microbiology and Applied Enzymology, School of Biotechnology and Key laboratory of Industrial Biotechnology of Ministry of Education, Jiangnan University, Wuxi 214122, China.
| |
Collapse
|
20
|
Ozkan S, Padilla N, de la Cruz X. QAFI: a novel method for quantitative estimation of missense variant impact using protein-specific predictors and ensemble learning. Hum Genet 2025; 144:191-208. [PMID: 39048855 PMCID: PMC11976337 DOI: 10.1007/s00439-024-02692-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Accepted: 07/14/2024] [Indexed: 07/27/2024]
Abstract
Next-generation sequencing (NGS) has revolutionized genetic diagnostics, yet its application in precision medicine remains incomplete, despite significant advances in computational tools for variant annotation. Many variants remain unannotated, and existing tools often fail to accurately predict the range of impacts that variants have on protein function. This limitation restricts their utility in relevant applications such as predicting disease severity and onset age. In response to these challenges, a new generation of computational models is emerging, aimed at producing quantitative predictions of genetic variant impacts. However, the field is still in its early stages, and several issues need to be addressed, including improved performance and better interpretability. This study introduces QAFI, a novel methodology that integrates protein-specific regression models within an ensemble learning framework, utilizing conservation-based and structure-related features derived from AlphaFold models. Our findings indicate that QAFI significantly enhances the accuracy of quantitative predictions across various proteins. The approach has been rigorously validated through its application in the CAGI6 contest, focusing on ARSA protein variants, and further tested on a comprehensive set of clinically labeled variants, demonstrating its generalizability and robust predictive power. The straightforward nature of our models may also contribute to better interpretability of the results.
Collapse
Affiliation(s)
- Selen Ozkan
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Natàlia Padilla
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain
| | - Xavier de la Cruz
- Research Unit in Clinical and Translational Bioinformatics, Vall d'Hebron Institute of Research (VHIR), Universitat Autònoma de Barcelona, Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
21
|
Zhang J, Kinch L, Katsonis P, Lichtarge O, Jagota M, Song YS, Sun Y, Shen Y, Kuru N, Dereli O, Adebali O, Alladin MA, Pal D, Capriotti E, Turina MP, Savojardo C, Martelli PL, Babbi G, Casadio R, Pucci F, Rooman M, Cia G, Tsishyn M, Strokach A, Hu Z, van Loggerenberg W, Roth FP, Radivojac P, Brenner SE, Cong Q, Grishin NV. Assessing predictions on fitness effects of missense variants in HMBS in CAGI6. Hum Genet 2025; 144:173-189. [PMID: 39110250 PMCID: PMC12085147 DOI: 10.1007/s00439-024-02680-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2023] [Accepted: 05/17/2024] [Indexed: 02/21/2025]
Abstract
This paper presents an evaluation of predictions submitted for the "HMBS" challenge, a component of the sixth round of the Critical Assessment of Genome Interpretation held in 2021. The challenge required participants to predict the effects of missense variants of the human HMBS gene on yeast growth. The HMBS enzyme, critical for the biosynthesis of heme in eukaryotic cells, is highly conserved among eukaryotes. Despite the application of a variety of algorithms and methods, the performance of predictors was relatively similar, with Kendall's tau correlation coefficients between predictions and experimental scores around 0.3 for a majority of submissions. Notably, the median correlation (≥ 0.34) observed among these predictors, especially the top predictions from different groups, was greater than the correlation observed between their predictions and the actual experimental results. Most predictors were moderately successful in distinguishing between deleterious and benign variants, as evidenced by an area under the receiver operating characteristic (ROC) curve (AUC) of approximately 0.7 respectively. Compared with the recent two rounds of CAGI competitions, we noticed more predictors outperformed the baseline predictor, which is solely based on the amino acid frequencies. Nevertheless, the overall accuracy of predictions is still far short of positive control, which is derived from experimental scores, indicating the necessity for considerable improvements in the field. The most inaccurately predicted variants in this round were associated with the insertion loop, which is absent in many orthologs, suggesting the predictors still heavily rely on the information from multiple sequence alignment.
Collapse
Affiliation(s)
- Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Lisa Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Milind Jagota
- Computer Science Division, University of California, Berkeley, CA, 94720, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, 94720, USA
- Department of Statistics, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Nurdan Kuru
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Onur Dereli
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Ogun Adebali
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Muttaqi Ahmad Alladin
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - Debnath Pal
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Maria Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Gabriel Cia
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Matsvei Tsishyn
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Alexey Strokach
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Warren van Loggerenberg
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA
- Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, M5G 1X5, Canada
| | - Frederick P Roth
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA
- Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, M5G 1X5, Canada
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
- Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
22
|
Gerasimavicius L, Teichmann SA, Marsh JA. Leveraging protein structural information to improve variant effect prediction. Curr Opin Struct Biol 2025; 92:103023. [PMID: 39987793 DOI: 10.1016/j.sbi.2025.103023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 12/17/2024] [Accepted: 02/05/2025] [Indexed: 02/25/2025]
Abstract
Despite massive sequencing efforts, understanding the difference between human pathogenic and benign variants remains a challenge. Computational variant effect predictors (VEPs) have emerged as essential tools for assessing the impact of genetic variants, although their performance varies. Initially, sequence-based methods dominated the field, but recent advances, particularly in protein structure prediction technologies like AlphaFold, have led to an increased utilization of structural information by VEPs aimed at scoring human missense variants. This review highlights the progress in integrating structural information into VEPs, showcasing novel models such as AlphaMissense, PrimateAI-3D, and CPT-1 that demonstrate improved variant evaluation. Structural data offers more interpretability, especially for non-loss-of-function variants, and provides insights into complex variant interactions in vivo. As the field advances, utilizing biomolecular complex structures will be pivotal for future VEP development, with recent breakthroughs in protein-ligand and protein-nucleic acid complex prediction offering new avenues.
Collapse
Affiliation(s)
- Lukas Gerasimavicius
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom
| | - Sarah A Teichmann
- Cambridge Stem Cell Institute & Dept Medicine, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, United Kingdom; Canadian Institute for Advanced Research, Toronto, Canada
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, United Kingdom.
| |
Collapse
|
23
|
Li Z, Luo Y. Rewiring protein sequence and structure generative models to enhance protein stability prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.13.638154. [PMID: 40027759 PMCID: PMC11870403 DOI: 10.1101/2025.02.13.638154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Predicting changes in protein thermostability due to amino acid substitutions is essential for understanding human diseases and engineering useful proteins for clinical and industrial applications. While recent advances in protein generative models, which learn probability distributions over amino acids conditioned on structural or evolutionary sequence contexts, have shown impressive performance in predicting various protein properties without task-specific training, their strong unsupervised prediction ability does not extend to all protein functions. In particular, their potential to improve protein stability prediction remains underexplored. In this work, we present SPURS, a novel deep learning framework that adapts and integrates two general-purpose protein generative models-a protein language model (ESM) and an inverse folding model (ProteinMPNN)-into an effective stability predictor. SPURS employs a lightweight neural network module to rewire per-residue structure representations learned by ProteinMPNN into the attention layers of ESM, thereby informing and enhancing ESM's sequence representation learning. This rewiring strategy enables SPURS to harness evolutionary patterns from both sequence and structure data, where the sequence like-lihood distribution learned by ESM is conditioned on structure priors encoded by ProteinMPNN to predict mutation effects. We steer this integrated framework to a stability prediction model through supervised training on a recently released mega-scale thermostability dataset. Evaluations across 12 benchmark datasets showed that SPURS delivers accurate, rapid, scalable, and generalizable stability predictions, consistently outperforming current state-of-the-art methods. Notably, SPURS demonstrates remarkable versatility in protein stability and function analyses: when combined with a protein language model, it accurately identifies protein functional sites in an unsupervised manner. Additionally, it enhances current low- N protein fitness prediction models by serving as a stability prior model to improve accuracy. These results highlight SPURS as a powerful tool to advance current protein stability prediction and machine learning-guided protein engineering workflows. The source code of SPURS is available at https://github.com/luo-group/SPURS .
Collapse
|
24
|
Forrest B, Derbel H, Zhao Z, Liu Q. MMRT: MultiMut Recursive Tree for predicting functional effects of high-order protein variants from low-order variants. Comput Struct Biotechnol J 2025; 27:672-681. [PMID: 40070521 PMCID: PMC11894328 DOI: 10.1016/j.csbj.2025.02.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 02/10/2025] [Accepted: 02/17/2025] [Indexed: 03/14/2025] Open
Abstract
Protein sequences primarily determine their stability and functions. Mutations may occur at one, two, or three positions at the same time (low-order variants) or at multiple positions simultaneously (high-order variants), which affect protein functions. So far, low-order variants, such as single variants, double variants, and triple variants, have been well-studied through high-throughput experimental scanning techniques and computational prediction methods. However, research on high-order variants remains limited because of the difficulty of scanning an exponentially large number of potential variant combinations. Nonetheless, studying higher-order variants is crucial for understanding the pathogenesis of complex diseases, advancing protein engineering, and driving precision medicine. In this work, we introduce a novel deep learning model, namely MultiMut Recursive Tree (MMRT), to address this challenge of predicting the functional effects of high-order variants. MMRT integrates deep learning with a recursive tree framework to leverage the information from low-order variants to predict functional effects of high-order variants. We evaluated MMRT on datasets comprising 685,593 high-order variants. Our results (mean Spearman's correlation coefficient 0.55) demonstrated that MMRT outperformed three existing state-of-the-art methods: ESM (evolutionary scale modeling), DeepSequence, and ECNet (evolutionary context-integrated neural network). MMRT thus provides more accurate prediction of the functional effects of high-order protein variants, offering great potential for aiding the interpretation of variants in human disease studies.
Collapse
Affiliation(s)
- Bryce Forrest
- Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
| | - Houssemeddine Derbel
- Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
| | - Zhongming Zhao
- Center for Precision Health, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Qian Liu
- Nevada Institute of Personalized Medicine, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
- School of Life Sciences, College of Sciences, University of Nevada, Las Vegas, 4505 S Maryland Pkwy, Las Vegas, NV 89154, USA
| |
Collapse
|
25
|
Vieira LC, Handojo ML, Wilke CO. Scaling down for efficiency: Medium-sized protein language models perform well at transfer learning on realistic datasets. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.22.624936. [PMID: 39605589 PMCID: PMC11601519 DOI: 10.1101/2024.11.22.624936] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various pLMs across multiple biological datasets to assess the impact of model size on transfer learning. Surprisingly, we found that larger models not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts-ESM-2 15B and ESM C 6B-despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.
Collapse
Affiliation(s)
- Luiz C. Vieira
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America
| | - Morgan L. Handojo
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America
| | - Claus O. Wilke
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, United States of America
| |
Collapse
|
26
|
Shen A, Ektefaie Y, Jain L, Farhat M, Zitnik M. Phyla: Towards a foundation model for phylogenetic inference. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.17.633626. [PMID: 39896621 PMCID: PMC11785049 DOI: 10.1101/2025.01.17.633626] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2025]
Abstract
Deep learning has made strides in modeling protein sequences but often struggles to generalize beyond its training distribution. Current models focus on learning individual sequences through masked language modeling, but effective protein sequence analysis demands the ability to reason across sequences, a critical step in phylogenetic analysis. Training biological foundation models explicitly for intersequence reasoning could enhance their generalizability and performance for phylogenetic inference and other tasks in computational biology. Here, we report an ongoing development of Phyla, an architecture that operates on an explicit, higher-level semantic representation of phylogenetic trees. Phyla employs a hybrid state-space transformer architecture and a novel tree loss function to achieve state-of-the-art performance on sequence reasoning benchmarks and phylogenetic tree reconstruction. To validate Phyla's capabilities, we applied it to reconstruct the tree of life, where Phyla accurately reclassified archaeal organisms, such as Lokiarchaeota, as more closely related to bacteria-aligning with recent phylogenetic insights. Phyla represents a step toward molecular sequence reasoning, emphasizing structured reasoning over memorization and advancing protein sequence analysis and phylogenetic inference.
Collapse
Affiliation(s)
- Andrew Shen
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Department of Computer Science, Northwestern University, Evanston, IL, USA
| | - Yasha Ektefaie
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | | | - Maha Farhat
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Division of Pulmonary and Critical Care, Department of Medicine, Massachusetts General Hospital, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Allston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Data Science Initiative, Cambridge, MA, USA
| |
Collapse
|
27
|
Elkin ME, Zhu X. Paying attention to the SARS-CoV-2 dialect : a deep neural network approach to predicting novel protein mutations. Commun Biol 2025; 8:98. [PMID: 39838059 PMCID: PMC11751191 DOI: 10.1038/s42003-024-07262-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2024] [Accepted: 11/13/2024] [Indexed: 01/23/2025] Open
Abstract
Predicting novel mutations has long-lasting impacts on life science research. Traditionally, this problem is addressed through wet-lab experiments, which are often expensive and time consuming. The recent advancement in neural language models has provided stunning results in modeling and deciphering sequences. In this paper, we propose a Deep Novel Mutation Search (DNMS) method, using deep neural networks, to model protein sequence for mutation prediction. We use SARS-CoV-2 spike protein as the target and use a protein language model to predict novel mutations. Different from existing research which is often limited to mutating the reference sequence for prediction, we propose a parent-child mutation prediction paradigm where a parent sequence is modeled for mutation prediction. Because mutations introduce changing context to the underlying sequence, DNMS models three aspects of the protein sequences: semantic changes, grammatical changes, and attention changes, each modeling protein sequence aspects from shifting of semantics, grammar coherence, and amino-acid interactions in latent space. A ranking approach is proposed to combine all three aspects to capture mutations demonstrating evolving traits, in accordance with real-world SARS-CoV-2 spike protein sequence evolution. DNMS can be adopted for an early warning variant detection system, creating public health awareness of future SARS-CoV-2 mutations.
Collapse
Affiliation(s)
- Magdalyn E Elkin
- Dept. Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA.
| | - Xingquan Zhu
- Dept. Electrical Engineering and Computer Science, Florida Atlantic University, 777 Glades Road, Boca Raton, FL, 33431, USA.
| |
Collapse
|
28
|
Zheng N, Cai Y, Zhang Z, Zhou H, Deng Y, Du S, Tu M, Fang W, Xia X. Tailoring industrial enzymes for thermostability and activity evolution by the machine learning-based iCASE strategy. Nat Commun 2025; 16:604. [PMID: 39799136 PMCID: PMC11724889 DOI: 10.1038/s41467-025-55944-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 01/03/2025] [Indexed: 01/15/2025] Open
Abstract
The pursuit of obtaining enzymes with high activity and stability remains a grail in enzyme evolution due to the stability-activity trade-off. Here, we develop an isothermal compressibility-assisted dynamic squeezing index perturbation engineering (iCASE) strategy to construct hierarchical modular networks for enzymes of varying complexity. Molecular mechanism analysis elucidates that the peak of adaptive evolution is reached through a structural response mechanism among variants. Furthermore, this dynamic response predictive model using structure-based supervised machine learning is established to predict enzyme function and fitness, demonstrating robust performance across different datasets and reliable prediction for epistasis. The universality of the iCASE strategy is validated by four sorts of enzymes with different structures and catalytic types. This machine learning-based iCASE strategy provides guidance for future research on the fitness evolution of enzymes.
Collapse
Affiliation(s)
- Nan Zheng
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Yongchao Cai
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Zehua Zhang
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Huimin Zhou
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Yu Deng
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Shuang Du
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China
| | - Mai Tu
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, PR China
| | - Wei Fang
- School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, PR China
| | - Xiaole Xia
- Key Laboratory of Industrial Biotechnology, Ministry of Education, School of Biotechnology, Jiangnan University, Wuxi, PR China.
- College of Food Science and Engineering, Tianjin University of Science and Technology, Tianjin, PR China.
| |
Collapse
|
29
|
Sun J, Zhu T, Cui Y, Wu B. Structure-based self-supervised learning enables ultrafast protein stability prediction upon mutation. Innovation (N Y) 2025; 6:100750. [PMID: 39872490 PMCID: PMC11763918 DOI: 10.1016/j.xinn.2024.100750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Accepted: 12/02/2024] [Indexed: 01/30/2025] Open
Abstract
Predicting free energy changes (ΔΔG) is essential for enhancing our understanding of protein evolution and plays a pivotal role in protein engineering and pharmaceutical development. While traditional methods offer valuable insights, they are often constrained by computational speed and reliance on biased training datasets. These constraints become particularly evident when aiming for accurate ΔΔG predictions across a diverse array of protein sequences. Herein, we introduce Pythia, a self-supervised graph neural network specifically designed for zero-shot ΔΔG predictions. Our comparative benchmarks demonstrate that Pythia outperforms other self-supervised pretraining models and force field-based approaches while also exhibiting competitive performance with fully supervised models. Notably, Pythia shows strong correlations and achieves a remarkable increase in computational speed of up to 105-fold. We further validated Pythia's performance in predicting the thermostabilizing mutations of limonene epoxide hydrolase, leading to higher experimental success rates. This exceptional efficiency has enabled us to explore 26 million high-quality protein structures, marking a significant advancement in our ability to navigate the protein sequence space and enhance our understanding of the relationships between protein genotype and phenotype. In addition, we established a web server at https://pythia.wulab.xyz to allow users to easily perform such predictions.
Collapse
Affiliation(s)
- Jinyuan Sun
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Tong Zhu
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yinglu Cui
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Bian Wu
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
30
|
Ma Z, Li W, Shen Y, Xu Y, Liu G, Chang J, Li Z, Qin H, Tian B, Gong H, Liu DR, Thuronyi BW, Voigt CA, Zhang S. EvoAI enables extreme compression and reconstruction of the protein sequence space. Nat Methods 2025; 22:102-112. [PMID: 39528677 DOI: 10.1038/s41592-024-02504-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 10/10/2024] [Indexed: 11/16/2024]
Abstract
Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here we establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 1048. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.
Collapse
Affiliation(s)
- Ziyuan Ma
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Wenjie Li
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Yunhao Shen
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Yunxin Xu
- School of Life Sciences, Tsinghua University, Beijing, China
| | - Gengjiang Liu
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Jiamin Chang
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Zeju Li
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Hong Qin
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Boxue Tian
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
- State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Haipeng Gong
- School of Life Sciences, Tsinghua University, Beijing, China
| | - David R Liu
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA, USA
| | - B W Thuronyi
- Department of Chemistry, Williams College, Williamstown, MA, USA
| | - Christopher A Voigt
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Shuyi Zhang
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China.
- State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua University, Beijing, China.
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
31
|
Yin M, Zhou H, Zhu Y, Lin M, Wu Y, Wu J, Xu H, Hsieh CY, Hou T, Chen J, Wu J. Multi-Modal CLIP-Informed Protein Editing. HEALTH DATA SCIENCE 2024; 4:0211. [PMID: 39703565 PMCID: PMC11658819 DOI: 10.34133/hds.0211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Revised: 10/17/2024] [Accepted: 11/12/2024] [Indexed: 12/21/2024]
Abstract
Background: Proteins govern most biological functions essential for life, and achieving controllable protein editing has made great advances in probing natural systems, creating therapeutic conjugates, and generating novel protein constructs. Recently, machine learning-assisted protein editing (MLPE) has shown promise in accelerating optimization cycles and reducing experimental workloads. However, current methods struggle with the vast combinatorial space of potential protein edits and cannot explicitly conduct protein editing using biotext instructions, limiting their interactivity with human feedback. Methods: To fill these gaps, we propose a novel method called ProtET for efficient CLIP-informed protein editing through multi-modality learning. Our approach comprises 2 stages: In the pretraining stage, contrastive learning aligns protein-biotext representations encoded by 2 large language models (LLMs). Subsequently, during the protein editing stage, the fused features from editing instruction texts and original protein sequences serve as the final editing condition for generating target protein sequences. Results: Comprehensive experiments demonstrated the superiority of ProtET in editing proteins to enhance human-expected functionality across multiple attribute domains, including enzyme catalytic activity, protein stability, and antibody-specific binding ability. ProtET improves the state-of-the-art results by a large margin, leading to substantial stability improvements of 16.67% and 16.90%. Conclusions: This capability positions ProtET to advance real-world artificial protein editing, potentially addressing unmet academic, industrial, and clinical needs.
Collapse
Affiliation(s)
- Mingze Yin
- School of Medicine,
Zhejiang University, Hangzhou, China
| | - Hanjing Zhou
- College of Computer Science and Technology,
Zhejiang University, Hangzhou, China
| | - Yiheng Zhu
- College of Computer Science and Technology,
Zhejiang University, Hangzhou, China
| | - Miao Lin
- Medical Big Data Center, Guangdong Provincial People’s Hospital (Guangdong Academy of Medical Sciences),
Southern Medical University, Guangzhou, China
| | - Yixuan Wu
- School of Medicine,
Zhejiang University, Hangzhou, China
| | - Jialu Wu
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Hongxia Xu
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Chang-Yu Hsieh
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Tingjun Hou
- Innovation Institute for Artificial Intelligence in Medicine ofZhejiang University, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, China
| | - Jintai Chen
- AI Thrust, Information Hub, HKUST (Guangzhou), Guangzhou, China
| | - Jian Wu
- Second Affiliated Hospital School of Medicine, Hangzhou, China
- School of Public Health,
Zhejiang University, Hangzhou, China
- Institute of Wenzhou, Wenzhou, China
| |
Collapse
|
32
|
Totaro MG, Vide U, Zausinger R, Winkler A, Oberdorfer G. ESM-scan-A tool to guide amino acid substitutions. Protein Sci 2024; 33:e5221. [PMID: 39565080 PMCID: PMC11577456 DOI: 10.1002/pro.5221] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 09/27/2024] [Accepted: 10/28/2024] [Indexed: 11/21/2024]
Abstract
Protein structure prediction and (re)design have gone through a revolution in the last 3 years. The tremendous progress in these fields has been almost exclusively driven by readily available machine learning algorithms applied to protein folding and sequence design problems. Despite these advancements, predicting site-specific mutational effects on protein stability and function remains an unsolved problem. This is a persistent challenge, mainly because the free energy of large systems is very difficult to compute with absolute accuracy and subtle changes to protein structures are hard to capture with computational models. Here, we describe the implementation and use of ESM-Scan, which uses the ESM zero-shot predictor to scan entire protein sequences for preferential amino acid changes, thus enabling in silico deep mutational scanning experiments. We benchmark ESM-Scan on its predictive capabilities for stability and functionality of sequence changes using three publicly available datasets and proceed by experimentally testing the tool's performance on a challenging test case of a blue-light-activated diguanylate cyclase from Methylotenera species (MsLadC), where it accurately predicted the importance of a highly conserved residue in a region involved in allosteric product inhibition. Our experimental results show that the ESM-zero shot model is capable of inferring the effects of a set of amino acid substitutions in their correlation between predicted fitness and experimental results. ESM-Scan is publicly available at https://huggingface.co/spaces/thaidaev/zsp.
Collapse
Affiliation(s)
| | - Uršula Vide
- Institute of BiochemistryGraz University of TechnologyGrazAustria
| | - Regina Zausinger
- Institute of BiochemistryGraz University of TechnologyGrazAustria
| | - Andreas Winkler
- Institute of BiochemistryGraz University of TechnologyGrazAustria
- BioTechMedGrazAustria
| | - Gustav Oberdorfer
- Institute of BiochemistryGraz University of TechnologyGrazAustria
- BioTechMedGrazAustria
| |
Collapse
|
33
|
Soleymani F, Paquet E, Viktor HL, Michalowski W. Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review. Comput Struct Biotechnol J 2024; 23:2779-2797. [PMID: 39050782 PMCID: PMC11268121 DOI: 10.1016/j.csbj.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 06/13/2024] [Accepted: 06/18/2024] [Indexed: 07/27/2024] Open
Abstract
Recent breakthroughs in deep learning have revolutionized protein sequence and structure prediction. These advancements are built on decades of protein design efforts, and are overcoming traditional time and cost limitations. Diffusion models, at the forefront of these innovations, significantly enhance design efficiency by automating knowledge acquisition. In the field of de novo protein design, the goal is to create entirely novel proteins with predetermined structures. Given the arbitrary positions of proteins in 3-D space, graph representations and their properties are widely used in protein generation studies. A critical requirement in protein modelling is maintaining spatial relationships under transformations (rotations, translations, and reflections). This property, known as equivariance, ensures that predicted protein characteristics adapt seamlessly to changes in orientation or position. Equivariant graph neural networks offer a solution to this challenge. By incorporating equivariant graph neural networks to learn the score of the probability density function in diffusion models, one can generate proteins with robust 3-D structural representations. This review examines the latest deep learning advancements, specifically focusing on frameworks that combine diffusion models with equivariant graph neural networks for protein generation.
Collapse
Affiliation(s)
- Farzan Soleymani
- Telfer School of Management, University of Ottawa, ON, K1N 6N5, Canada
| | - Eric Paquet
- National Research Council, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | - Herna Lydia Viktor
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | | |
Collapse
|
34
|
Thompson MD, Reiner-Link D, Berghella A, Rana BK, Rovati GE, Capra V, Gorvin CM, Hauser AS. G protein-coupled receptor (GPCR) pharmacogenomics. Crit Rev Clin Lab Sci 2024; 61:641-684. [PMID: 39119983 DOI: 10.1080/10408363.2024.2358304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/03/2023] [Accepted: 05/18/2024] [Indexed: 08/10/2024]
Abstract
The field of pharmacogenetics, the investigation of the influence of one or more sequence variants on drug response phenotypes, is a special case of pharmacogenomics, a discipline that takes a genome-wide approach. Massively parallel, next generation sequencing (NGS), has allowed pharmacogenetics to be subsumed by pharmacogenomics with respect to the identification of variants associated with responders and non-responders, optimal drug response, and adverse drug reactions. A plethora of rare and common naturally-occurring GPCR variants must be considered in the context of signals from across the genome. Many fundamentals of pharmacogenetics were established for G protein-coupled receptor (GPCR) genes because they are primary targets for a large number of therapeutic drugs. Functional studies, demonstrating likely-pathogenic and pathogenic GPCR variants, have been integral to establishing models used for in silico analysis. Variants in GPCR genes include both coding and non-coding single nucleotide variants and insertion or deletions (indels) that affect cell surface expression (trafficking, dimerization, and desensitization/downregulation), ligand binding and G protein coupling, and variants that result in alternate splicing encoding isoforms/variable expression. As the breadth of data on the GPCR genome increases, we may expect an increase in the use of drug labels that note variants that significantly impact the clinical use of GPCR-targeting agents. We discuss the implications of GPCR pharmacogenomic data derived from the genomes available from individuals who have been well-phenotyped for receptor structure and function and receptor-ligand interactions, and the potential benefits to patients of optimized drug selection. Examples discussed include the renin-angiotensin system in SARS-CoV-2 (COVID-19) infection, the probable role of chemokine receptors in the cytokine storm, and potential protease activating receptor (PAR) interventions. Resources dedicated to GPCRs, including publicly available computational tools, are also discussed.
Collapse
Affiliation(s)
- Miles D Thompson
- Krembil Brain Institute, Toronto Western Hospital, Toronto, Ontario, Canada
| | - David Reiner-Link
- Department of Drug Design and Pharmacology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Alessandro Berghella
- Department of Drug Design and Pharmacology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Brinda K Rana
- Department of Psychiatry, University of California San Diego, San Diego, CA, USA
| | - G Enrico Rovati
- Department of Pharmacological and Biomolecular Sciences, Università degli Studi di Milano, Milan, Italy
| | - Valerie Capra
- Department of Pharmacological and Biomolecular Sciences, Università degli Studi di Milano, Milan, Italy
| | - Caroline M Gorvin
- Institute of Metabolism and Systems Research (IMSR), University of Birmingham, Birmingham, United Kingdom
| | - Alexander S Hauser
- Department of Drug Design and Pharmacology, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
35
|
James J, Towers S, Foerster J, Steel H. Optimisation strategies for directed evolution without sequencing. PLoS Comput Biol 2024; 20:e1012695. [PMID: 39700257 PMCID: PMC11698521 DOI: 10.1371/journal.pcbi.1012695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Revised: 01/03/2025] [Accepted: 12/04/2024] [Indexed: 12/21/2024] Open
Abstract
Directed evolution can enable engineering of biological systems with minimal knowledge of their underlying sequence-to-function relationships. A typical directed evolution process consists of iterative rounds of mutagenesis and selection that are designed to steer changes in a biological system (e.g. a protein) towards some functional goal. Much work has been done, particularly leveraging advancements in machine learning, to optimise the process of directed evolution. Many of these methods, however, require DNA sequencing and synthesis, making them resource-intensive and incompatible with developments in targeted in vivo mutagenesis. Operating within the experimental constraints of established sorting-based directed evolution techniques (e.g. Fluorescence-Activated Cell Sorting, FACS), we explore approaches for optimisation of directed evolution that could in future be implemented without sequencing information. We then expand our methods to the context of emerging experimental techniques in directed evolution, which allow for single-cell selection based on fitness objectives defined from any combination of measurable traits. Finally, we explore these alternative strategies on the GB1 and TrpB empirical landscapes, demonstrating that they could lead to up to 19-fold and 7-fold increases respectively in the probability of attaining the global fitness peak.
Collapse
Affiliation(s)
- Jessica James
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Sebastian Towers
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Jakob Foerster
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| | - Harrison Steel
- Department of Engineering Science, University of Oxford, Oxford, United Kingdom
| |
Collapse
|
36
|
Harding-Larsen D, Funk J, Madsen NG, Gharabli H, Acevedo-Rocha CG, Mazurenko S, Welner DH. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnol Adv 2024; 77:108459. [PMID: 39366493 DOI: 10.1016/j.biotechadv.2024.108459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/19/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations - primary sequence, 3D structure, and dynamics - to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Collapse
Affiliation(s)
- David Harding-Larsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Jonathan Funk
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Niklas Gesmar Madsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Hani Gharabli
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Carlos G Acevedo-Rocha
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Ditte Hededam Welner
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
37
|
Wang H, Ren Z, Sun J, Chen Y, Bo X, Xue J, Gao J, Ni M. DeepPFP: a multi-task-aware architecture for protein function prediction. Brief Bioinform 2024; 26:bbae579. [PMID: 39905954 PMCID: PMC11794456 DOI: 10.1093/bib/bbae579] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2024] [Revised: 09/14/2024] [Accepted: 01/31/2025] [Indexed: 02/06/2025] Open
Abstract
Deriving protein function from protein sequences poses a significant challenge due to the intricate relationship between sequence and function. Deep learning has made remarkable strides in predicting sequence-function relationships. However, models tailored for specific tasks or protein types encounter difficulties when using transfer learning across domains. This is attributed to the fact that protein function relies heavily on structural characteristics rather than mere sequence information. Consequently, there is a pressing need for a model capable of capturing shared features among diverse sequence-function mapping tasks to address the generalization issue. In this study, we explore the potential of Model-Agnostic Meta-Learning combined with a protein language model called Evolutionary Scale Modeling to tackle this challenge. Our approach involves training the architecture on five out-domain deep mutational scanning (DMS) datasets and evaluating its performance across four key dimensions. Our findings demonstrate that the proposed architecture exhibits satisfactory performance in terms of generalization and employs an effective few-shot learning strategy. To explain further, Compared to the best results, the Pearson's correlation coefficient (PCC) in the final stage increased by ~0.31%. Furthermore, we leverage the trained architecture to predict binding affinity scores of the DMS dataset of SARS-CoV-2 using transfer learning. Notably, training on a subset of the Ube4b dataset with 500 samples resulted in a notable improvement of 0.11 in the PCC. These results underscore the potential of our conceptual architecture as a promising methodology for multi-task protein function prediction.
Collapse
Affiliation(s)
- Han Wang
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Zilin Ren
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, State Key Laboratory of Pathogen and Biosecurity, Key Laboratory of Jilin Province for Zoonosis Prevention and Control, Changchun 130122, China
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Jinghong Sun
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Yongbing Chen
- Changchun Veterinary Research Institute, Chinese Academy of Agricultural Sciences, State Key Laboratory of Pathogen and Biosecurity, Key Laboratory of Jilin Province for Zoonosis Prevention and Control, Changchun 130122, China
- School of Information Science and Technology, Northeast Normal University, Changchun 130117, China
| | - Xiaochen Bo
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - JiGuo Xue
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| | - Jingyang Gao
- College of Information Science and Technology, Beijing University of Chemical Technology, No. 15 North Third Ring East Road, Chaoyang District, Beijing 100029, China
| | - Ming Ni
- Advanced & Interdisciplinary Biotechnology, Academy of Military Medical Sciences, No. 27 Taiping Road, Haidian District, Beijing 100850, China
| |
Collapse
|
38
|
Nishikawa KK, Chen J, Acheson JF, Harbaugh SV, Huss P, Frenkel M, Novy N, Sieren HR, Lodewyk EC, Lee DH, Chávez JL, Fox BG, Raman S. Highly multiplexed design of an allosteric transcription factor to sense new ligands. Nat Commun 2024; 15:10001. [PMID: 39562775 PMCID: PMC11577015 DOI: 10.1038/s41467-024-54260-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Accepted: 11/05/2024] [Indexed: 11/21/2024] Open
Abstract
Allosteric transcription factors (aTF) regulate gene expression through conformational changes induced by small molecule binding. Although widely used as biosensors, aTFs have proven challenging to design for detecting new molecules because mutation of ligand-binding residues often disrupts allostery. Here, we develop Sensor-seq, a high-throughput platform to design and identify aTF biosensors that bind to non-native ligands. We screen a library of 17,737 variants of the aTF TtgR, a regulator of a multidrug exporter, against six non-native ligands of diverse chemical structures - four derivatives of the cancer therapeutic tamoxifen, the antimalarial drug quinine, and the opiate analog naltrexone - as well as two native flavonoid ligands, naringenin and phloretin. Sensor-seq identifies biosensors for each of these ligands with high dynamic range and diverse specificity profiles. The structure of a naltrexone-bound design shows shape-complementary methionine-aromatic interactions driving ligand specificity. To demonstrate practical utility, we develop cell-free detection systems for naltrexone and quinine. Sensor-seq enables rapid and scalable design of new biosensors, overcoming constraints of natural biosensors.
Collapse
Affiliation(s)
- Kyle K Nishikawa
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Jackie Chen
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Justin F Acheson
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Svetlana V Harbaugh
- 711th Human Performance Wing, Air Force Research Laboratory, Wright Patterson Air Force Base, OH, USA
| | - Phil Huss
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Max Frenkel
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Nathan Novy
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Hailey R Sieren
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Dane County Youth Apprenticeship Program, State of Wisconsin Department of Workforce Development, Madison, WI, USA
| | - Ella C Lodewyk
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Dane County Youth Apprenticeship Program, State of Wisconsin Department of Workforce Development, Madison, WI, USA
| | - Daniel H Lee
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Dane County Youth Apprenticeship Program, State of Wisconsin Department of Workforce Development, Madison, WI, USA
| | - Jorge L Chávez
- 711th Human Performance Wing, Air Force Research Laboratory, Wright Patterson Air Force Base, OH, USA
| | - Brian G Fox
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA
| | - Srivatsan Raman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
- Great Lakes Bioenergy Research Center, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
39
|
Rix G, Williams RL, Hu VJ, Spinner H, Pisera A(O, Marks DS, Liu CC. Continuous evolution of user-defined genes at 1 million times the genomic mutation rate. Science 2024; 386:eadm9073. [PMID: 39509492 PMCID: PMC11750425 DOI: 10.1126/science.adm9073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2023] [Accepted: 09/10/2024] [Indexed: 11/15/2024]
Abstract
When nature evolves a gene over eons at scale, it produces a diversity of homologous sequences with patterns of conservation and change that contain rich structural, functional, and historical information about the gene. However, natural gene diversity accumulates slowly and likely excludes large regions of functional sequence space, limiting the information that is encoded and extractable. We introduce upgraded orthogonal DNA replication (OrthoRep) systems that radically accelerate the evolution of chosen genes under selection in yeast. When applied to a maladapted biosynthetic enzyme, we obtained collections of extensively diverged sequences with patterns that revealed structural and environmental constraints shaping the enzyme's activity. Our upgraded OrthoRep systems should support the discovery of factors influencing gene evolution, uncover previously unknown regions of fitness landscapes, and find broad applications in biomolecular engineering.
Collapse
Affiliation(s)
- Gordon Rix
- Department of Molecular Biology and Biochemistry, University of California; Irvine, CA, 92617, USA
| | - Rory L. Williams
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
| | - Vincent J. Hu
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
| | - Han Spinner
- Department of Systems Biology, Harvard Medical School; Boston, MA, 02115, USA
| | | | - Debora S. Marks
- Department of Systems Biology, Harvard Medical School; Boston, MA, 02115, USA
- Broad Institute of Harvard and MIT; Cambridge, MA, 02142, USA
| | - Chang C. Liu
- Department of Molecular Biology and Biochemistry, University of California; Irvine, CA, 92617, USA
- Department of Biomedical Engineering, University of California; Irvine, CA, 92617, USA
- Department of Chemistry, University of California; Irvine, CA, 92617, USA
- Center for Synthetic Biology, University of California; Irvine, CA, 92617, USA
| |
Collapse
|
40
|
Blaabjerg LM, Jonsson N, Boomsma W, Stein A, Lindorff-Larsen K. SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions. Nat Commun 2024; 15:9646. [PMID: 39511177 PMCID: PMC11544099 DOI: 10.1038/s41467-024-53982-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 10/28/2024] [Indexed: 11/15/2024] Open
Abstract
The ability to predict how amino acid changes affect proteins has a wide range of applications including in disease variant classification and protein engineering. Many existing methods focus on learning from patterns found in either protein sequences or protein structures. Here, we present a method for integrating information from sequence and structure in a single model that we term SSEmb (Sequence Structure Embedding). SSEmb combines a graph representation for the protein structure with a transformer model for processing multiple sequence alignments. We show that by integrating both types of information we obtain a variant effect prediction model that is robust when sequence information is scarce. We also show that SSEmb learns embeddings of the sequence and structure that are useful for other downstream tasks such as to predict protein-protein binding sites. We envisage that SSEmb may be useful both for variant effect predictions and as a representation for learning to predict protein properties that depend on sequence and structure.
Collapse
Affiliation(s)
- Lasse M Blaabjerg
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark
| | - Nicolas Jonsson
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark
| | - Wouter Boomsma
- Center for Basic Machine Learning Research in Life Science, Department of Computer Science, University of Copenhagen, Copenhagen N, Denmark.
| | - Amelie Stein
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark.
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark.
| |
Collapse
|
41
|
Li M. Enhancing protein stability prediction with geometric learning and pre-training strategies. NATURE COMPUTATIONAL SCIENCE 2024; 4:807-808. [PMID: 39516376 DOI: 10.1038/s43588-024-00724-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2024]
Affiliation(s)
- Minghui Li
- Department of Bioinformatics, School of Basic Medical Science, Suzhou Medical College of Soochow University, Suzhou, China.
- MOE Key Laboratory of Geriatric Diseases and Immunology, School of Basic Medical Science, Suzhou Medical College of Soochow University, Suzhou, China.
| |
Collapse
|
42
|
Xu Y, Liu D, Gong H. Improving the prediction of protein stability changes upon mutations by geometric learning and a pre-training strategy. NATURE COMPUTATIONAL SCIENCE 2024; 4:840-850. [PMID: 39455825 DOI: 10.1038/s43588-024-00716-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/29/2023] [Accepted: 10/03/2024] [Indexed: 10/28/2024]
Abstract
Accurate prediction of protein mutation effects is of great importance in protein engineering and design. Here we propose GeoStab-suite, a suite of three geometric learning-based models-GeoFitness, GeoDDG and GeoDTm-for the prediction of fitness score, ΔΔG and ΔTm of a protein upon mutations, respectively. GeoFitness engages a specialized loss function to allow supervised training of a unified model using the large amount of multi-labeled fitness data in the deep mutational scanning database. To further improve the downstream tasks of ΔΔG and ΔTm prediction, the encoder of GeoFitness is reutilized as a pre-trained module in GeoDDG and GeoDTm to overcome the challenge of lacking sufficient labeled data. This pre-training strategy, in combination with data expansion, markedly improves model performance and generalizability. In the benchmark test, GeoDDG and GeoDTm outperform the other state-of-the-art methods by at least 30% and 70%, respectively, in terms of the Spearman correlation coefficient.
Collapse
Affiliation(s)
- Yunxin Xu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Di Liu
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China
| | - Haipeng Gong
- MOE Key Laboratory of Bioinformatics, School of Life Sciences, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
43
|
Fawzy M, Marsh JA. Understanding the heterogeneous performance of variant effect predictors across human protein-coding genes. Sci Rep 2024; 14:26114. [PMID: 39478110 PMCID: PMC11526010 DOI: 10.1038/s41598-024-76202-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Accepted: 10/11/2024] [Indexed: 11/02/2024] Open
Abstract
Variant effect predictors (VEPs) are computational tools developed to assess the impacts of genetic mutations, often in terms of likely pathogenicity, employing diverse algorithms and training data. Here, we investigate the performance of 35 VEPs in the discrimination between pathogenic and putatively benign missense variants across 963 human protein-coding genes. We observe considerable gene-level heterogeneity as measured by the widely used area under the receiver operating characteristic curve (AUROC) metric. To investigate the origins of this heterogeneity and the extent to which gene-level VEP performance is predictable, for each VEP, we train random forest models to predict the gene-level AUROC. We find that performance as measured by AUROC is related to factors such as gene function, protein structure, and evolutionary conservation. Notably, intrinsic disorder in proteins emerged as a significant factor influencing apparent VEP performance, often leading to inflated AUROC values due to their enrichment in weakly conserved putatively benign variants. Our results suggest that gene-level features may be useful for identifying genes where VEP predictions are likely to be more or less reliable. However, our work also shows that AUROC, despite being independent of class balance, still has crucial limitations when used for comparing VEP performance across different genes.
Collapse
Affiliation(s)
- Mohamed Fawzy
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK
| | - Joseph A Marsh
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
44
|
Shao B, Yan J. A long-context language model for deciphering and generating bacteriophage genomes. Nat Commun 2024; 15:9392. [PMID: 39477977 PMCID: PMC11525655 DOI: 10.1038/s41467-024-53759-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2024] [Accepted: 10/22/2024] [Indexed: 11/02/2024] Open
Abstract
Inspired by the success of large language models (LLMs), we develop a long-context generative model for genomes. Our multiscale transformer model, megaDNA, is pre-trained on unannotated bacteriophage genomes with nucleotide-level tokenization. We demonstrate the foundational capabilities of our model including the prediction of essential genes, genetic variant effects, regulatory element activity and taxonomy of unannotated sequences. Furthermore, it generates de novo sequences up to 96 K base pairs, which contain potential regulatory elements and annotated proteins with phage-related functions.
Collapse
Affiliation(s)
- Bin Shao
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, 100081, China.
- Department of Molecular and Cellular Biology, Harvard University, Cambridge, MA, 02138, USA.
| | - Jiawei Yan
- Independent researcher, 100 N Gushan Rd, Shanghai, 200135, China
| |
Collapse
|
45
|
Lobzaev E, Stracquadanio G. Dirichlet latent modelling enables effective learning and sampling of the functional protein design space. Nat Commun 2024; 15:9309. [PMID: 39468034 PMCID: PMC11519351 DOI: 10.1038/s41467-024-53622-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 10/14/2024] [Indexed: 10/30/2024] Open
Abstract
Engineering proteins with desired functions and biochemical properties is pivotal for biotechnology and drug discovery. While computational methods based on evolutionary information are reducing the experimental burden by designing targeted libraries of functional variants, they still have a low success rate when the desired protein has few or very remote homologous sequences. Here we propose an autoregressive model, called Temporal Dirichlet Variational Autoencoder (TDVAE), which exploits the mathematical properties of the Dirichlet distribution and temporal convolution to efficiently learn high-order information from a functionally related, possibly remotely similar, set of sequences. TDVAE is highly accurate in predicting the effects of amino acid mutations, while being significantly 90% smaller than the other state-of-the-art models. We then use TDVAE to design variants of the human alpha galactosidase enzymes as potential treatment for Fabry disease. Our model builds a library of diverse variants which retain sequence, biochemical and structural properties of the wildtype protein, suggesting they could be suitable for enzyme replacement therapy. Taken together, our results show the importance of accurate sequence modelling and the potential of autoregressive models as protein engineering and analysis tools.
Collapse
Affiliation(s)
- Evgenii Lobzaev
- School of Biological Sciences, The University of Edinburgh, Edinburgh, United Kingdom
- School of Informatics, The University of Edinburgh, Edinburgh, United Kingdom
| | | |
Collapse
|
46
|
Li C, Wang Y, Bai H, Liu M, Cai Y, Zhang Y, Jia Y, Qu J, Zhang S, Du C. Deep neural network provides personalized treatment recommendations for de novo metastatic breast cancer patients. J Cancer 2024; 15:6668-6685. [PMID: 39668839 PMCID: PMC11632994 DOI: 10.7150/jca.101293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Accepted: 10/20/2024] [Indexed: 12/14/2024] Open
Abstract
Background: It has long been controversial whether surgery should be performed for de novo metastatic breast cancer (dnMBC). The choice and timing of the primary tumor resection for dnMBC patients need to be individualized, but there was no tool to assist clinicians in decision-making. Methods: A 1:1:2 propensity score matching (PSM) was applied to examine the prognosis of dnMBC patients who underwent neoadjuvant systemic therapy followed by surgery (NS), surgery followed by chemotherapy (SC), and chemotherapy without surgery (CW). Then, two deep feed-forward neural network models were constructed to conduct personalized treatment recommendations. Results: The PSM-adjusted data showed that not all the dnMBC patients could benefit from surgery, and the advantages of NS and SC were different among various subgroups. Patients with stage T1-2, and pathological grade II tumors can be operated on directly, whereas those with stage T3-4, pathological grade III/IV diseases require NS. However, patients with grade I diseases, over 80 years of age, or with brain metastases could not benefit from surgery, regardless of whether they received neoadjuvant systemic therapy. Our deep neural network models exhibited high accuracy on both the train and test sets, one model can assist in deciding whether surgery is requested for dnMBC patient, if the surgery is necessary, another model can determine whether neoadjuvant systemic therapy is needed. Conclusion: This study investigated the prognosis of dnMBC patients, and two artificial intelligence (AI) assisted surgery decision-making models were developed to assist clinicians in delivering precision medicine while improving the survival of dnMBC patients.
Collapse
Affiliation(s)
- Chaofan Li
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Yusheng Wang
- Department of Otolaryngology, the Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Haocheng Bai
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Mengjie Liu
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Yifan Cai
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Yu Zhang
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Yiwei Jia
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Jingkun Qu
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Shuqun Zhang
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| | - Chong Du
- The Comprehensive Breast Care Center, The Second Affiliated Hospital of Xi'an Jiaotong University, 157 West Fifth Street, Xi'an, Shaanxi, P. R. China
| |
Collapse
|
47
|
Muir DF, Asper GPR, Notin P, Posner JA, Marks DS, Keiser MJ, Pinney MM. Evolutionary-Scale Enzymology Enables Biochemical Constant Prediction Across a Multi-Peaked Catalytic Landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.10.23.619915. [PMID: 39484523 PMCID: PMC11526920 DOI: 10.1101/2024.10.23.619915] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/03/2024]
Abstract
Quantitatively mapping enzyme sequence-catalysis landscapes remains a critical challenge in understanding enzyme function, evolution, and design. Here, we expand an emerging microfluidic platform to measure catalytic constants-k cat and K M-for hundreds of diverse naturally occurring sequences and mutants of the model enzyme Adenylate Kinase (ADK). This enables us to dissect the sequence-catalysis landscape's topology, navigability, and mechanistic underpinnings, revealing distinct catalytic peaks organized by structural motifs. These results challenge long-standing hypotheses in enzyme adaptation, demonstrating that thermophilic enzymes are not slower than their mesophilic counterparts. Combining the rich representations of protein sequences provided by deep-learning models with our custom high-throughput kinetic data yields semi-supervised models that significantly outperform existing models at predicting catalytic parameters of naturally occurring ADK sequences. Our work demonstrates a promising strategy for dissecting sequence-catalysis landscapes across enzymatic evolution and building family-specific models capable of accurately predicting catalytic constants, opening new avenues for enzyme engineering and functional prediction.
Collapse
Affiliation(s)
- Duncan F Muir
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Program in Biophysics, University of California, San Francisco, San Francisco, CA, USA
| | - Garrison P R Asper
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
| | - Pascal Notin
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Department of Computer Science, University of Oxford, Oxford, UK
| | - Jacob A Posner
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Department of Biology, San Francisco State University, San Francisco, CA, USA
| | - Debora S Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Michael J Keiser
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
- Institute for Neurodegenerative Diseases, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Margaux M Pinney
- Department of Biochemistry and Biophysics, University of California San Francisco, San Francisco, CA, USA
- Valhalla Fellow, University of California San Francisco, San Francisco, CA, USA
| |
Collapse
|
48
|
Marsiglia J, Vaalavirta K, Knight E, Nakamura M, Cong L, Hughes NW. Computationally guided high-throughput engineering of an anti-CRISPR protein for precise genome editing in human cells. CELL REPORTS METHODS 2024; 4:100882. [PMID: 39437714 PMCID: PMC11574282 DOI: 10.1016/j.crmeth.2024.100882] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/13/2023] [Revised: 05/02/2024] [Accepted: 09/23/2024] [Indexed: 10/25/2024]
Abstract
The application of CRISPR-Cas systems to genome editing has revolutionized experimental biology and is an emerging gene and cell therapy modality. CRISPR-Cas systems target off-target regions within the human genome, which is a challenge that must be addressed. Phages have evolved anti-CRISPR proteins (Acrs) to evade CRISPR-Cas-based immunity. Here, we engineer an Acr (AcrIIA4) to increase the precision of CRISPR-Cas-based genome targeting. We developed an approach that leveraged (1) computational guidance, (2) deep mutational scanning, and (3) highly parallel DNA repair measurements within human cells. In a single experiment, ∼10,000 Acr variants were tested. Variants that improved editing precision were tested in additional validation experiments that revealed robust enhancement of gene editing precision and synergy with a high-fidelity version of Cas9. This scalable high-throughput screening framework is a promising methodology to engineer Acrs to increase gene editing precision, which could be used to improve the safety of gene editing-based therapeutics.
Collapse
Affiliation(s)
| | | | | | | | - Le Cong
- Department of Pathology, Stanford University School of Medicine, Stanford, CA 94035, USA; Department of Genetics, Stanford University School of Medicine, Stanford, CA 94035, USA
| | | |
Collapse
|
49
|
Chen WC, Zhou J, McCandlish DM. Density estimation for ordinal biological sequences and its applications. Phys Rev E 2024; 110:044408. [PMID: 39562961 PMCID: PMC11605730 DOI: 10.1103/physreve.110.044408] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 10/03/2024] [Indexed: 11/21/2024]
Abstract
Biological sequences do not come at random. Instead, they appear with particular frequencies that reflect properties of the associated system or phenomenon. Knowing how biological sequences are distributed in sequence space is thus a natural first step toward understanding the underlying mechanisms. Here we propose a method for inferring the probability distribution from which a sample of biological sequences were drawn for the case where the sequences are composed of elements that admit a natural ordering. Our method is based on Bayesian field theory, a physics-based machine learning approach, and can be regarded as a nonparametric extension of the traditional maximum entropy estimate. As an example, we use it to analyze the aneuploidy data pertaining to gliomas from The Cancer Genome Atlas project. In addition, we demonstrate two follow-up analyses that can be performed with the resulting probability distribution. One of them is to investigate the associations among the sequence sites. This provides a way to infer the governing biological grammar. The other is to study the global geometry of the probability landscape, which allows us to look at the problem from an evolutionary point of view. It can be seen that this methodology enables us to learn from a sample of sequences about how a biological system or phenomenon in the real world works.
Collapse
Affiliation(s)
- Wei-Chia Chen
- Department of Physics, National Chung Cheng University, Chiayi 62102, Taiwan, R.O.C
| | - Juannan Zhou
- Department of Biology, University of Florida, Gainesville, Florida 32611, U.S.A
| | - David M. McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York 11724, U.S.A
| |
Collapse
|
50
|
Valbuena R, Nigam A, Tycko J, Suzuki P, Spees K, Aradhana, Arana S, Du P, Patel RA, Bintu L, Kundaje A, Bassik MC. Prediction and design of transcriptional repressor domains with large-scale mutational scans and deep learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.21.614253. [PMID: 39386603 PMCID: PMC11463546 DOI: 10.1101/2024.09.21.614253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
Regulatory proteins have evolved diverse repressor domains (RDs) to enable precise context-specific repression of transcription. However, our understanding of how sequence variation impacts the functional activity of RDs is limited. To address this gap, we generated a high-throughput mutational scanning dataset measuring the repressor activity of 115,000 variant sequences spanning more than 50 RDs in human cells. We identified thousands of clinical variants with loss or gain of repressor function, including TWIST1 HLH variants associated with Saethre-Chotzen syndrome and MECP2 domain variants associated with Rett syndrome. We also leveraged these data to annotate short linear interacting motifs (SLiMs) that are critical for repression in disordered RDs. Then, we designed a deep learning model called TENet ( T ranscriptional E ffector Net work) that integrates sequence, structure and biochemical representations of sequence variants to accurately predict repressor activity. We systematically tested generalization within and across domains with varying homology using the mutational scanning dataset. Finally, we employed TENet within a directed evolution sequence editing framework to tune the activity of both structured and disordered RDs and experimentally test thousands of designs. Our work highlights critical considerations for future dataset design and model training strategies to improve functional variant prioritization and precision design of synthetic regulatory proteins.
Collapse
|