1
|
Thornton EL, Boyle JT, Laohakunakorn N, Regan L. Cell-Free Protein Synthesis as a Method to Rapidly Screen Machine Learning-Generated Protease Variants. ACS Synth Biol 2025; 14:1710-1718. [PMID: 40304425 DOI: 10.1021/acssynbio.5c00062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2025]
Abstract
Machine learning (ML) tools have revolutionized protein structure prediction, engineering, and design, but the best ML tool is only as good as the training data it learns from. To obtain high-quality structural or functional data, protein purification is typically required, which is both time and resource consuming, especially at the scale required to train ML tools. Here, we showcase cell-free protein synthesis as a straightforward and fast tool for screening and scoring the activity of protein variants in ML workflows. We demonstrate the utility of the system by improving the kinetic qualities of a protease. By rapidly screening just 48 random variants to initially sample the fitness landscape, followed by 32 more targeted variants, we identified several protease variants with improved kinetic properties.
Collapse
Affiliation(s)
- Ella Lucille Thornton
- Centre for Engineering Biology, Institute of Quantitative Biology, Biochemistry and Biotechnology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, Scotland
| | - Jeremy T Boyle
- Centre for Engineering Biology, Institute of Quantitative Biology, Biochemistry and Biotechnology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, Scotland
| | - Nadanai Laohakunakorn
- Centre for Engineering Biology, Institute of Quantitative Biology, Biochemistry and Biotechnology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, Scotland
| | - Lynne Regan
- Centre for Engineering Biology, Institute of Quantitative Biology, Biochemistry and Biotechnology, School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3BF, Scotland
| |
Collapse
|
2
|
Bazzi S, Sayyad S. Revealing arginine-cysteine and glycine-cysteine NOS linkages by a systematic re-evaluation of protein structures. Commun Chem 2025; 8:146. [PMID: 40360719 PMCID: PMC12075730 DOI: 10.1038/s42004-025-01535-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2024] [Accepted: 04/23/2025] [Indexed: 05/15/2025] Open
Abstract
Nitrogen-oxygen-sulfur (NOS) linkages act as allosteric redox switches, modulating enzymatic activity in response to redox fluctuations. While NOS linkages in proteins were once assumed to occur only between lysine and cysteine, our investigation shows that these bonds extend beyond the well-studied lysine-NOS-cysteine examples. By systematically analyzing over 86,000 high-resolution X-ray protein structures, we uncovered 69 additional NOS bonds, including arginine-NOS-cysteine and glycine-NOS-cysteine. Our pipeline integrates machine learning, quantum-mechanical calculations, and high-resolution X-ray crystallographic data to systematically detect these subtle covalent interactions and identify key predictive descriptors for their formation. The discovery of these previously unrecognized linkages broadens the scope of protein chemistry and may enable targeted modulation in drug design and protein engineering. Although our study focuses on NOS linkages, the flexibility of this methodology allows for the investigation of a wide range of chemical bonds and covalent modifications, including structurally resolvable posttranslational modifications (PTMs). By revisiting and re-examining well-established protein models, this work underscores how systematic data-driven approaches can uncover hidden aspects of protein chemistry and inspire deeper insights into protein function and stability.
Collapse
Affiliation(s)
- Sophia Bazzi
- Institute of Physical Chemistry, Georg-August University Göttingen, Tammannstraße 6, Göttingen, D-37077, Germany.
| | - Sharareh Sayyad
- Department of Mathematics and Statistics, Washington State University, Pullman, WA, 99164-3113, USA
- Mathematical Institute, Georg-August University Göttingen, Bunsenstraße 3-5, Göttingen, 37073, Germany
| |
Collapse
|
3
|
Wu T, Wei W, Gao C, Wu J, Gao C, Chen X, Liu L, Song W. Synthesis of C-N bonds by nicotinamide-dependent oxidoreductase: an overview. Crit Rev Biotechnol 2025; 45:702-726. [PMID: 39229892 DOI: 10.1080/07388551.2024.2390082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 11/05/2023] [Accepted: 11/25/2023] [Indexed: 09/05/2024]
Abstract
Compounds containing chiral C-N bonds play a vital role in the composition of biologically active natural products and small pharmaceutical molecules. Therefore, the development of efficient and convenient methods for synthesizing compounds containing chiral C-N bonds is a crucial area of research. Nicotinamide-dependent oxidoreductases (NDOs) emerge as promising biocatalysts for asymmetric synthesis of chiral C-N bonds due to their mild reaction conditions, exceptional stereoselectivity, high atom economy, and environmentally friendly nature. This review aims to present the structural characteristics and catalytic mechanisms of various NDOs, including imine reductases/ketimine reductases, reductive aminases, EneIRED, and amino acid dehydrogenases. Additionally, the review highlights protein engineering strategies employed to modify the stereoselectivity, substrate specificity, and cofactor preference of NDOs. Furthermore, the applications of NDOs in synthesizing essential medicinal chemicals, such as noncanonical amino acids and chiral amine compounds, are extensively examined. Finally, the review outlines future perspectives by addressing challenges and discussing the potential of utilizing NDOs to establish efficient biosynthesis platforms for C-N bond synthesis. In conclusion, NDOs provide an economical, efficient, and environmentally friendly toolbox for asymmetric synthesis of C-N bonds, thus contributing significantly to the field of pharmaceutical chemical development.
Collapse
Affiliation(s)
- Tianfu Wu
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, China
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, China
| | - Wanqing Wei
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, China
| | - Changzheng Gao
- Department of Cardiology, Affiliated Hospital of Jiangnan University, Wuxi, China
| | - Jing Wu
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, China
| | - Cong Gao
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, China
| | - Xiulai Chen
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, China
| | - Liming Liu
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, China
| | - Wei Song
- School of Life Sciences and Health Engineering, Jiangnan University, Wuxi, China
| |
Collapse
|
4
|
Mroz AM, Basford AR, Hastedt F, Jayasekera IS, Mosquera-Lois I, Sedgwick R, Ballester PJ, Bocarsly JD, Antonio Del Río Chanona E, Evans ML, Frost JM, Ganose AM, Greenaway RL, Kuok Mimi Hii K, Li Y, Misener R, Walsh A, Zhang D, Jelfs KE. Cross-disciplinary perspectives on the potential for artificial intelligence across chemistry. Chem Soc Rev 2025. [PMID: 40278836 PMCID: PMC12024683 DOI: 10.1039/d5cs00146c] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2025] [Indexed: 04/26/2025]
Abstract
From accelerating simulations and exploring chemical space, to experimental planning and integrating automation within experimental labs, artificial intelligence (AI) is changing the landscape of chemistry. We are seeing a significant increase in the number of publications leveraging these powerful data-driven insights and models to accelerate all aspects of chemical research. For example, how we represent molecules and materials to computer algorithms for predictive and generative models, as well as the physical mechanisms by which we perform experiments in the lab for automation. Here, we present ten diverse perspectives on the impact of AI coming from those with a range of backgrounds from experimental chemistry, computational chemistry, computer science, engineering and across different areas of chemistry, including drug discovery, catalysis, chemical automation, chemical physics, materials chemistry. The ten perspectives presented here cover a range of themes, including AI for computation, facilitating discovery, supporting experiments, and enabling technologies for transformation. We highlight and discuss imminent challenges and ways in which we are redefining problems to accelerate the impact of chemical research via AI.
Collapse
Affiliation(s)
- Austin M Mroz
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
- I-X Centre for AI in Science, Imperial College London, London W12 0BZ, UK
| | - Annabel R Basford
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| | - Friedrich Hastedt
- Department of Chemical Engineering, Imperial College London, London SW7 2AZ, UK
| | | | | | - Ruby Sedgwick
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Pedro J Ballester
- Department of Bioengineering, Imperial College London, London SW7 2AZ, UK
| | - Joshua D Bocarsly
- Department of Chemistry and Texas Center for Superconductivity, University of Houston, Houston, USA
| | | | - Matthew L Evans
- UCLouvain, Institute of Condensed Matter and Nanosciences (IMCN), Chemin des Étoiles 8, Louvain-la-Neuve 1348, Belgium
- Matgenix SRL, A6K Advanced Engineering Center, Charleroi, Belgium
- Datalab Industries Ltd, King's Lynn, Norfolk, UK
| | - Jarvist M Frost
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| | - Alex M Ganose
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| | | | | | - Yingzhen Li
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Ruth Misener
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | - Aron Walsh
- Department of Materials, Imperial College London, London SW7 2AZ, UK
| | - Dandan Zhang
- I-X Centre for AI in Science, Imperial College London, London W12 0BZ, UK
- Department of Bioengineering, Imperial College London, London SW7 2AZ, UK
| | - Kim E Jelfs
- Department of Chemistry, Imperial College London, London W12 0BZ, UK.
| |
Collapse
|
5
|
Sandhu M, Chen JZ, Matthews DS, Spence MA, Pulsford SB, Gall B, Kaczmarski JA, Nichols J, Tokuriki N, Jackson CJ. Computational and Experimental Exploration of Protein Fitness Landscapes: Navigating Smooth and Rugged Terrains. Biochemistry 2025; 64:1673-1684. [PMID: 40132127 DOI: 10.1021/acs.biochem.4c00673] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/27/2025]
Abstract
Proteins evolve through complex sequence spaces, with fitness landscapes serving as a conceptual framework that links sequence to function. Fitness landscapes can be smooth, where multiple similarly accessible evolutionary paths are available, or rugged, where the presence of multiple local fitness optima complicate evolution and prediction. Indeed, many proteins, especially those with complex functions or under multiple selection pressures, exist on rugged fitness landscapes. Here we discuss the theoretical framework that underpins our understanding of fitness landscapes, alongside recent work that has advanced our understanding─particularly the biophysical basis for smoothness versus ruggedness. Finally, we address the rapid advances that have been made in computational and experimental exploration and exploitation of fitness landscapes, and how these can identify efficient routes to protein optimization.
Collapse
Affiliation(s)
- Mahakaran Sandhu
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - John Z Chen
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| | - Dana S Matthews
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Matthew A Spence
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Sacha B Pulsford
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Barnabas Gall
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
| | - Joe A Kaczmarski
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| | - James Nichols
- Biological Data Science Institute, Australian National University, Canberra ACT 2601, Australia
| | - Nobuhiko Tokuriki
- Michael Smith Laboratories, University of British Columbia, Vancouver, British Columbia V6T 1Z4, Canada
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Research School of Chemistry, Australian National University, Canberra ACT 2601, Australia
- Biological Data Science Institute, Australian National University, Canberra ACT 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Research School of Biology, Australian National University, Canberra ACT 2601, Australia
| |
Collapse
|
6
|
Martí-Gómez C, Zhou J, Chen WC, Kinney JB, McCandlish DM. Inference and visualization of complex genotype-phenotype maps with gpmap-tools. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.09.642267. [PMID: 40161830 PMCID: PMC11952336 DOI: 10.1101/2025.03.09.642267] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/02/2025]
Abstract
Multiplex assays of variant effect (MAVEs) allow the functional characterization of an unprecedented number of sequence variants in both gene regulatory regions and protein coding sequences. This has enabled the study of nearly complete combinatorial libraries of mutational variants and revealed the widespread influence of higher-order genetic interactions that arise when multiple mutations are combined. However, the lack of appropriate tools for exploratory analysis of this high-dimensional data limits our overall understanding of the main qualitative properties of complex genotype-phenotype maps. To fill this gap, we have developed gpmap-tools (https://github.com/cmarti/gpmap-tools), a python library that integrates Gaussian process models for inference, phenotypic imputation, and error estimation from incomplete and noisy MAVE data and collections of natural sequences, together with methods for summarizing patterns of higher-order epistasis and non-linear dimensionality reduction techniques that allow visualization of genotype-phenotype maps containing up to millions of genotypes. Here, we used gpmap-tools to study the genotype-phenotype map of the Shine-Dalgarno sequence, a motif that modulates binding of the 16S rRNA to the 5' untranslated region (UTR) of mRNAs through base pair complementarity during translation initiation in prokaryotes. We inferred full combinatorial landscapes containing 262,144 different sequences from the sequences of 5,311 5'UTRs in the E. coli genome and from experimental MAVE data. Visualizations of the inferred landscapes were largely consistent with each other, and unveiled a simple molecular mechanism underlying the highly epistatic genotype-phenotype map of the Shine-Dalgarno sequence.
Collapse
Affiliation(s)
- Carlos Martí-Gómez
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - Juannan Zhou
- Department of Biology, University of Florida, Gainesville, FL, 32611
| | - Wei-Chia Chen
- Department of Physics, National Chung Cheng University, Chiayi 62102, Taiwan, Republic of China
| | - Justin B Kinney
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724
| |
Collapse
|
7
|
Pržulj N, Malod-Dognin N. Simplicity within biological complexity. BIOINFORMATICS ADVANCES 2025; 5:vbae164. [PMID: 39927291 PMCID: PMC11805345 DOI: 10.1093/bioadv/vbae164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/15/2024] [Revised: 10/01/2024] [Accepted: 10/23/2024] [Indexed: 02/11/2025]
Abstract
Motivation Heterogeneous, interconnected, systems-level, molecular (multi-omic) data have become increasingly available and key in precision medicine. We need to utilize them to better stratify patients into risk groups, discover new biomarkers and targets, repurpose known and discover new drugs to personalize medical treatment. Existing methodologies are limited and a paradigm shift is needed to achieve quantitative and qualitative breakthroughs. Results In this perspective paper, we survey the literature and argue for the development of a comprehensive, general framework for embedding of multi-scale molecular network data that would enable their explainable exploitation in precision medicine in linear time. Network embedding methods (also called graph representation learning) map nodes to points in low-dimensional space, so that proximity in the learned space reflects the network's topology-function relationships. They have recently achieved unprecedented performance on hard problems of utilizing few omic data in various biomedical applications. However, research thus far has been limited to special variants of the problems and data, with the performance depending on the underlying topology-function network biology hypotheses, the biomedical applications, and evaluation metrics. The availability of multi-omic data, modern graph embedding paradigms and compute power call for a creation and training of efficient, explainable and controllable models, having no potentially dangerous, unexpected behaviour, that make a qualitative breakthrough. We propose to develop a general, comprehensive embedding framework for multi-omic network data, from models to efficient and scalable software implementation, and to apply it to biomedical informatics, focusing on precision medicine and personalized drug discovery. It will lead to a paradigm shift in the computational and biomedical understanding of data and diseases that will open up ways to solve some of the major bottlenecks in precision medicine and other domains.
Collapse
Affiliation(s)
- Nataša Pržulj
- Computational Biology Department, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, 00000, United Arabic Emirates
- Barcelona Supercomputing Center, Barcelona 08034, Spain
- Department of Computer Science, University College London, London WC1E6BT, United Kingdom
- ICREA, Pg. Lluís Companys 23, Barcelona 08010, Spain
| | | |
Collapse
|
8
|
Landwehr GM, Bogart JW, Magalhaes C, Hammarlund EG, Karim AS, Jewett MC. Accelerated enzyme engineering by machine-learning guided cell-free expression. Nat Commun 2025; 16:865. [PMID: 39833164 PMCID: PMC11747319 DOI: 10.1038/s41467-024-55399-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 12/09/2024] [Indexed: 01/22/2025] Open
Abstract
Enzyme engineering is limited by the challenge of rapidly generating and using large datasets of sequence-function relationships for predictive design. To address this challenge, we develop a machine learning (ML)-guided platform that integrates cell-free DNA assembly, cell-free gene expression, and functional assays to rapidly map fitness landscapes across protein sequence space and optimize enzymes for multiple, distinct chemical reactions. We apply this platform to engineer amide synthetases by evaluating substrate preference for 1217 enzyme variants in 10,953 unique reactions. We use these data to build augmented ridge regression ML models for predicting amide synthetase variants capable of making 9 small molecule pharmaceuticals. Over these nine compounds, ML-predicted enzyme variants demonstrate 1.6- to 42-fold improved activity relative to the parent. Our ML-guided, cell-free framework promises to accelerate enzyme engineering by enabling iterative exploration of protein sequence space to build specialized biocatalysts in parallel.
Collapse
Affiliation(s)
- Grant M Landwehr
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
| | - Jonathan W Bogart
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
| | - Carol Magalhaes
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
| | - Eric G Hammarlund
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
| | - Ashty S Karim
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA.
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA.
| | - Michael C Jewett
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL, USA.
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA.
- Department of Bioengineering, Stanford University, Stanford, CA, USA.
| |
Collapse
|
9
|
Yang J, Lal RG, Bowden JC, Astudillo R, Hameedi MA, Kaur S, Hill M, Yue Y, Arnold FH. Active learning-assisted directed evolution. Nat Commun 2025; 16:714. [PMID: 39821082 PMCID: PMC11739421 DOI: 10.1038/s41467-025-55987-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2024] [Accepted: 01/02/2025] [Indexed: 01/19/2025] Open
Abstract
Directed evolution (DE) is a powerful tool to optimize protein fitness for a specific application. However, DE can be inefficient when mutations exhibit non-additive, or epistatic, behavior. Here, we present Active Learning-assisted Directed Evolution (ALDE), an iterative machine learning-assisted DE workflow that leverages uncertainty quantification to explore the search space of proteins more efficiently than current DE methods. We apply ALDE to an engineering landscape that is challenging for DE: optimization of five epistatic residues in the active site of an enzyme. In three rounds of wet-lab experimentation, we improve the yield of a desired product of a non-native cyclopropanation reaction from 12% to 93%. We also perform computational simulations on existing protein sequence-fitness datasets to support our argument that ALDE can be more effective than DE. Overall, ALDE is a practical and broadly applicable strategy to unlock improved protein engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - Ravi G Lal
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA
| | - James C Bowden
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
- Computer Science, University of California-Berkeley, Berkeley, CA, USA
| | - Raul Astudillo
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA
| | - Mikhail A Hameedi
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA
| | | | - Matthew Hill
- Elegen Corp, 1300 Industrial Road #16, San Carlos, CA, USA
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, CA, USA.
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, CA, USA.
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA.
| |
Collapse
|
10
|
Wu Y, Xie X, Zhu J, Guan L, Li M. Overview and Prospects of DNA Sequence Visualization. Int J Mol Sci 2025; 26:477. [PMID: 39859192 PMCID: PMC11764684 DOI: 10.3390/ijms26020477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2024] [Revised: 12/30/2024] [Accepted: 01/04/2025] [Indexed: 01/27/2025] Open
Abstract
Due to advances in big data technology, deep learning, and knowledge engineering, biological sequence visualization has been extensively explored. In the post-genome era, biological sequence visualization enables the visual representation of both structured and unstructured biological sequence data. However, a universal visualization method for all types of sequences has not been reported. Biological sequence data are rapidly expanding exponentially and the acquisition, extraction, fusion, and inference of knowledge from biological sequences are critical supporting technologies for visualization research. These areas are important and require in-depth exploration. This paper elaborates on a comprehensive overview of visualization methods for DNA sequences from four different perspectives-two-dimensional, three-dimensional, four-dimensional, and dynamic visualization approaches-and discusses the strengths and limitations of each method in detail. Furthermore, this paper proposes two potential future research directions for biological sequence visualization in response to the challenges of inefficient graphical feature extraction and knowledge association network generation in existing methods. The first direction is the construction of knowledge graphs for biological sequence big data, and the second direction is the cross-modal visualization of biological sequences using machine learning methods. This review is anticipated to provide valuable insights and contributions to computational biology, bioinformatics, genomic computing, genetic breeding, evolutionary analysis, and other related disciplines in the fields of biology, medicine, chemistry, statistics, and computing. It has an important reference value in biological sequence recommendation systems and knowledge question answering systems.
Collapse
Affiliation(s)
| | | | | | | | - Mengshan Li
- School of Mathematics and Computer Science, Gannan Normal University, Ganzhou 341000, China; (Y.W.); (X.X.); (J.Z.); (L.G.)
| |
Collapse
|
11
|
Li Y, Li F, Duan Z, Liu R, Jiao W, Wu H, Zhu F, Xue W. SYNBIP 2.0: epitopes mapping, sequence expansion and scaffolds discovery for synthetic binding protein innovation. Nucleic Acids Res 2025; 53:D595-D603. [PMID: 39413165 PMCID: PMC11701522 DOI: 10.1093/nar/gkae893] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2024] [Revised: 09/18/2024] [Accepted: 09/26/2024] [Indexed: 10/18/2024] Open
Abstract
Synthetic binding proteins (SBPs) represent a pivotal class of artificially engineered proteins, meticulously crafted to exhibit targeted binding properties and specific functions. Here, the SYNBIP database, a comprehensive resource for SBPs, has been significantly updated. These enhancements include (i) featuring 3D structures of 899 SBP-target complexes to illustrate the binding epitopes of SBPs, (ii) using the structures of SBPs in the monomer or complex forms with target proteins, their sequence space has been expanded five times to 12 025 by integrating a structure-based protein generation framework and a protein property prediction tool, (iii) offering detailed information on 78 473 newly identified SBP-like scaffolds from the RCSB Protein Data Bank, and an additional 16 401 555 ones from the AlphaFold Protein Structure Database, and (iv) the database is regularly updated, incorporating 153 new SBPs. Furthermore, the structural models of all SBPs have been enhanced through the application of the AlphaFold2, with their clinical statuses concurrently refreshed. Additionally, the design methods employed for each SBP are now prominently featured in the database. In sum, SYNBIP 2.0 is designed to provide researchers with essential SBP data, facilitating their innovation in research, diagnosis and therapy. SYNBIP 2.0 is now freely accessible at https://idrblab.org/synbip/.
Collapse
Affiliation(s)
- Yanlin Li
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Fengcheng Li
- Children’s Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, 3333 Binsheng Road, Hangzhou, Zhejiang 310052, China
- College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, China
| | - Zixin Duan
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Ruihan Liu
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Wantong Jiao
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Haibo Wu
- School of Life Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| | - Feng Zhu
- College of Pharmaceutical Sciences, Zhejiang University, 866 Yuhangtang Road, Hangzhou, Zhejiang 310058, China
| | - Weiwei Xue
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, No. 55 South University Town Road, High-tech Zone, Chongqing 401331, China
| |
Collapse
|
12
|
Zhang Z, Li Z, Wang Q, Wu H, Yang M, Zhao F, Tan M, Han S. A protein fitness predictive framework based on feature combination and intelligent searching. Protein Sci 2024; 33:e5211. [PMID: 39548358 PMCID: PMC11567853 DOI: 10.1002/pro.5211] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2024] [Revised: 09/14/2024] [Accepted: 10/22/2024] [Indexed: 11/17/2024]
Abstract
Machine learning (ML) constructs predictive models by understanding the relationship between protein sequences and their functions, enabling efficient identification of protein sequences with high fitness values without falling into local optima, like directional evolution. However, how to extract the most pertinent functional feature information from a limited number of protein sequences is vital for optimizing the performance of ML models. Here, we propose scut_ProFP (Protein Fitness Predictor), a predictive framework that integrates feature combination and feature selection techniques. Feature combination offers comprehensive sequence information, while feature selection searches for the most beneficial features to enhance model performance, enabling accurate sequence-to-function mapping. Compared to similar frameworks, scut_ProFP demonstrates superior performance and is also competitive with more complex deep learning models-ECNet, EVmutation, and UniRep. In addition, scut_ProFP enables generalization from low-order mutants to high-order mutants. Finally, we utilized scut_ProFP to simulate the engineering of the fluorescent protein CreiLOV and highly enriched mutants with high fluorescence based on only a small number of low-fluorescence mutants. Essentially, the developed method is advantageous for ML in protein engineering, providing an effective approach to data-driven protein engineering. The code and datasets for scut_ProFP are available at https://github.com/Zhang66-star/scut_ProFP.
Collapse
Affiliation(s)
- Zhihui Zhang
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Zhixuan Li
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Qianyue Wang
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Hanlin Wu
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Manli Yang
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| | - Fengguang Zhao
- School of Light Industry and EngineeringSouth China University of TechnologyGuangzhouChina
| | - Mingkui Tan
- School of Software EngineeringSouth China University of TechnologyGuangzhouChina
| | - Shuangyan Han
- Guangdong Key Laboratory of Fermentation and Enzyme Engineering, School of Biology and Biological EngineeringSouth China University of TechnologyGuangzhouChina
| |
Collapse
|
13
|
Blaabjerg LM, Jonsson N, Boomsma W, Stein A, Lindorff-Larsen K. SSEmb: A joint embedding of protein sequence and structure enables robust variant effect predictions. Nat Commun 2024; 15:9646. [PMID: 39511177 PMCID: PMC11544099 DOI: 10.1038/s41467-024-53982-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Accepted: 10/28/2024] [Indexed: 11/15/2024] Open
Abstract
The ability to predict how amino acid changes affect proteins has a wide range of applications including in disease variant classification and protein engineering. Many existing methods focus on learning from patterns found in either protein sequences or protein structures. Here, we present a method for integrating information from sequence and structure in a single model that we term SSEmb (Sequence Structure Embedding). SSEmb combines a graph representation for the protein structure with a transformer model for processing multiple sequence alignments. We show that by integrating both types of information we obtain a variant effect prediction model that is robust when sequence information is scarce. We also show that SSEmb learns embeddings of the sequence and structure that are useful for other downstream tasks such as to predict protein-protein binding sites. We envisage that SSEmb may be useful both for variant effect predictions and as a representation for learning to predict protein properties that depend on sequence and structure.
Collapse
Affiliation(s)
- Lasse M Blaabjerg
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark
| | - Nicolas Jonsson
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark
| | - Wouter Boomsma
- Center for Basic Machine Learning Research in Life Science, Department of Computer Science, University of Copenhagen, Copenhagen N, Denmark.
| | - Amelie Stein
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark.
| | - Kresten Lindorff-Larsen
- Linderstrøm-Lang Centre for Protein Science, Department of Biology, University of Copenhagen, Copenhagen N, Denmark.
| |
Collapse
|
14
|
Sangisetti BR, Pabboju S. Deep fit_predic: a novel integrated pyramid dilation EfficientNet-B3 scheme for fitness prediction system. Comput Methods Biomech Biomed Engin 2024; 27:2009-2023. [PMID: 37865927 DOI: 10.1080/10255842.2023.2269287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2023] [Revised: 09/07/2023] [Accepted: 10/05/2023] [Indexed: 10/24/2023]
Abstract
This study introduces novel deep learning (DL) techniques for effective fitness prediction using a person's health data. Initially, pre-processing is performed in which data cleaning, one-hot encoding and data normalization are performed. The pre-processed data are then fed into the feature selection stage, where the useful features are extracted using the enhanced chameleon swarm (ECham-Sw) optimization technique. Then, a clustering process is performed using Minkowski integrated gravity center clustering (Min-GCC) to cluster the health profiles of each individual. Finally, the Pyramid Dilated EfficientNet-B3 (PyDi-EfficientNet-B3) technique is proposed to predict the fitness of each individual efficiently with enhanced accuracy of 99.8%.
Collapse
Affiliation(s)
- Bhagya Rekha Sangisetti
- Department of Computer Science & Engineering, University College of Engineering, Osmania University, Hyderabad, Telangana, India
- Department of Computer Science & Engineering, Anurag University, Hyderabad, Telangana, India
| | - Suresh Pabboju
- Department of Information Technology, Chaitanya Bharathi Institute of Technology, Hyderabad, Telangana, India
| |
Collapse
|
15
|
Zhang P, Wei L, Li J, Wang X. Artificial intelligence-guided strategies for next-generation biological sequence design. Natl Sci Rev 2024; 11:nwae343. [PMID: 39606146 PMCID: PMC11601974 DOI: 10.1093/nsr/nwae343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/20/2024] [Accepted: 09/25/2024] [Indexed: 11/29/2024] Open
Affiliation(s)
- Pengcheng Zhang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| | - Lei Wei
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| | - Jiaqi Li
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| | - Xiaowo Wang
- Ministry of Education Key Laboratory of Bioinformatics, Center for Synthetic and Systems Biology, Bioinformatics Division at the Beijing National Research Center for Information Science and Technology, Department of Automation, Tsinghua University, China
| |
Collapse
|
16
|
Hilvert D. Spiers Memorial Lecture: Engineering biocatalysts. Faraday Discuss 2024; 252:9-28. [PMID: 39046423 PMCID: PMC11389855 DOI: 10.1039/d4fd00139g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Accepted: 06/26/2024] [Indexed: 07/25/2024]
Abstract
Enzymes are being engineered to catalyze chemical reactions for many practical applications in chemistry and biotechnology. The approaches used are surveyed in this short review, emphasizing methods for accessing reactivities not expressed by native protein scaffolds. The successful generation of completely de novo enzymes that rival the rates and selectivities of their natural counterparts highlights the potential role that designer enzymes may play in the coming years in research, industry, and medicine. Some challenges that need to be addressed to realize this ambitious dream are considered together with possible solutions.
Collapse
Affiliation(s)
- Donald Hilvert
- Laboratory of Organic Chemistry, ETH Zürich, 8093 Zürich, Switzerland.
| |
Collapse
|
17
|
Hollmann F, Sanchis J, Reetz MT. Learning from Protein Engineering by Deconvolution of Multi-Mutational Variants. Angew Chem Int Ed Engl 2024; 63:e202404880. [PMID: 38884594 DOI: 10.1002/anie.202404880] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2024] [Revised: 06/05/2024] [Accepted: 06/06/2024] [Indexed: 06/18/2024]
Abstract
This review analyzes a development in biochemistry, enzymology and biotechnology that originally came as a surprise. Following the establishment of directed evolution of stereoselective enzymes in organic chemistry, the concept of partial or complete deconvolution of selective multi-mutational variants was introduced. Early deconvolution experiments of stereoselective variants led to the finding that mutations can interact cooperatively or antagonistically with one another, not just additively. During the past decade, this phenomenon was shown to be general. In some studies, molecular dynamics (MD) and quantum mechanics/molecular mechanics (QM/MM) computations were performed in order to shed light on the origin of non-additivity at all stages of an evolutionary upward climb. Data of complete deconvolution can be used to construct unique multi-dimensional rugged fitness pathway landscapes, which provide mechanistic insights different from traditional fitness landscapes. Along a related line, biochemists have long tested the result of introducing two point mutations in an enzyme for mechanistic reasons, followed by a comparison of the respective double mutant in so-called double mutant cycles, which originally showed only additive effects, but more recently also uncovered cooperative and antagonistic non-additive effects. We conclude with suggestions for future work, and call for a unified overall picture of non-additivity and epistasis.
Collapse
Affiliation(s)
- Frank Hollmann
- Department of Biotechnology, Delft University of Technology, Van der Maasweg 9, 2629HZ, Delft, Netherlands
| | - Joaquin Sanchis
- Monash Institute of Pharmaceutical Sciences, Monash University, Parkville, Victoria, 3052, Australia
| | - Manfred T Reetz
- Max-Plank-Institut für Kohlenforschung, Kaiser-Wilhelm-Platz 1, 45481, Mülheim, Germany
- Tianjin Institute of Industrial Biotechnology, Chinese Academy of Sciences, Tianjin, 300308, China
| |
Collapse
|
18
|
Yadalam PK, Ramadoss R, Anegundi RV. HyperAttention and Linformer-Based β-catenin Sequence Prediction For Bone Formation. Cureus 2024; 16:e68849. [PMID: 39376879 PMCID: PMC11456985 DOI: 10.7759/cureus.68849] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Accepted: 09/07/2024] [Indexed: 10/09/2024] Open
Abstract
Introduction Beta (β)-catenin, a pivotal protein in bone development and homeostasis, is implicated in various bone disorders. Peptide-based therapeutics offer a promising approach due to their specificity and potential for reduced side effects. Attention networks are widely used for peptide sequence prediction, specifically sequence-to-sequence models. Hence, the current study aims to develop a HyperAttention and informatics-based β-catenin sequence prediction for bone formation. Methods β-catenin protein sequences were downloaded and quality-checked using UniProt and FASTA sequences using DeepBio (Deep Bio Inc., Seoul, South Korea) for predictive analysis. Data was analyzed for duplicates, outliers, and missing values. The data was then split into training and testing sets, with 80% of the data used for training and 20% for testing, and peptide sequences were encoded and subjected to algorithms. Results The HyperAttention and Linformer models perform well in predictive sequence, with HyperAttention correctly predicting 87% of instances and Linformer predicting 89%. Both models have higher sensitivity and specificity, with Linformer showing better identification of 91% of negative instances and slightly better sensitivity. Conclusion The HyperAttention and Linformer models effectively predict peptide sequences with high specificity and sensitivity. Further optimization and development are needed for optimal application and balance between positive and negative instances.
Collapse
Affiliation(s)
- Pradeep Kumar Yadalam
- Periodontics, Saveetha Dental College, Saveetha Institue of Medical and Technical Sciences (SIMATS) Deemed University, Chennai, IND
| | - Ramya Ramadoss
- Oral Pathology and Oral Biology, Saveetha Dental College, Saveetha Institue of Medical and Technical Sciences (SIMATS) Deemed University, Chennai, IND
| | - Raghavendra Vamsi Anegundi
- Periodontics, Saveetha Dental College, Saveetha Institue of Medical and Technical Sciences (SIMATS) Deemed University, Chennai, IND
| |
Collapse
|
19
|
Lian X, Praljak N, Subramanian SK, Wasinger S, Ranganathan R, Ferguson AL. Deep-learning-based design of synthetic orthologs of SH3 signaling domains. Cell Syst 2024; 15:725-737.e7. [PMID: 39106868 PMCID: PMC11879475 DOI: 10.1016/j.cels.2024.07.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 11/12/2023] [Accepted: 07/22/2024] [Indexed: 08/09/2024]
Abstract
Evolution-based deep generative models represent an exciting direction in understanding and designing proteins. An open question is whether such models can learn specialized functional constraints that control fitness in specific biological contexts. Here, we examine the ability of generative models to produce synthetic versions of Src-homology 3 (SH3) domains that mediate signaling in the Sho1 osmotic stress response pathway of yeast. We show that a variational autoencoder (VAE) model produces artificial sequences that experimentally recapitulate the function of natural SH3 domains. More generally, the model organizes all fungal SH3 domains such that locality in the model latent space (but not simply locality in sequence space) enriches the design of synthetic orthologs and exposes non-obvious amino acid constraints distributed near and far from the SH3 ligand-binding site. The ability of generative models to design ortholog-like functions in vivo opens new avenues for engineering protein function in specific cellular contexts and environments.
Collapse
Affiliation(s)
- Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, IL 60637, USA
| | - Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, IL 60637, USA
| | - Subu K Subramanian
- Department of Molecular and Cell Biology, California Institute for Quantitative Biosciences (QB3), and Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sarah Wasinger
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA
| | - Rama Ranganathan
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA; Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, IL 60637, USA.
| | - Andrew L Ferguson
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
20
|
Lewis JC. Identifying and Engineering Flavin Dependent Halogenases for Selective Biocatalysis. Acc Chem Res 2024; 57:2067-2079. [PMID: 39038085 PMCID: PMC11309780 DOI: 10.1021/acs.accounts.4c00172] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/24/2024]
Abstract
Organohalogen compounds are extensively used as building blocks, intermediates, pharmaceuticals, and agrochemicals due to their unique chemical and biological properties. Installing halogen substituents, however, frequently requires functionalized starting materials and multistep functional group interconversion. Several classes of halogenases evolved in nature to enable halogenation of a different classes of substrates; for example, site-selective halogenation of electron rich aromatic compounds is catalyzed by flavin-dependent halogenases (FDHs). Mechanistic studies have shown that these enzymes use FADH2 to reduce O2 to water with concomitant oxidation of X- to HOX (X = Cl, Br, I). This species travels through a tunnel within the enzyme to access the FDH active site. Here, it is believed to interact with an active site lysine proximal to bound substrate, enabling electrophilic halogenation with selectivity imparted via molecular recognition, rather than directing groups or strong electronic activation.The unique selectivity of FDHs led to several early biocatalysis efforts, preparative halogenation was rare, and the hallmark catalyst-controlled selectivity of FDHs did not translate to non-native substrates. FDH engineering was limited to site-directed mutagenesis, which resulted in modest changes in site-selectivity or substrate preference. To address these limitations, we optimized expression conditions for the FDH RebH and its cognate flavin reductase (FRed), RebF. We then showed that RebH could be used for preparative halogenation of non-native substrates with catalyst-controlled selectivity. We reported the first examples in which the stability, substrate scope, and site selectivity of a FDH were improved to synthetically useful levels via directed evolution. X-ray crystal structures of evolved FDHs and reversion mutations showed that random mutations throughout the RebH structure were critical to achieving high levels of activity and selectivity on diverse aromatic substrates, and these data were used in combination with molecular dynamics simulations to develop predictive model for FDH selectivity. Finally, we used family wide genome mining to identify a diverse set of FDHs with novel substrate scope and complementary regioselectivity on large, three-dimensionally complex compounds.The diversity of our evolved and mined FDHs allowed us to pursue synthetic applications beyond simple aromatic halogenation. For example, we established that FDHs catalyze enantioselective reactions involving desymmetrization, atroposelective halogenation, and halocyclization. These results highlight the ability of FDH active sites to tolerate different substrate topologies. This utility was further expanded by our recent studies on the single component FDH/FRed, AetF. While we were initially drawn to AetF because it does not require a separate FRed, we found that it halogenates substrates that are not halogenated efficiently or at all by other FDHs and provides high enantioselectivity for reactions that could only be achieved using RebH variants after extensive mutagenesis. Perhaps most notably, AetF catalyzes site-selective aromatic iodination and enantioselective iodoetherification. Together, these studies highlight the origins of FDH engineering, the utility and limitations of the enzymes developed to date, and the promise of FDHs for an ever-expanding range of biocatalytic halogenation reactions.
Collapse
Affiliation(s)
- Jared C Lewis
- Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States
| |
Collapse
|
21
|
Freschlin CR, Fahlberg SA, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. Nat Commun 2024; 15:6405. [PMID: 39080282 PMCID: PMC11289474 DOI: 10.1038/s41467-024-50712-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 07/13/2024] [Indexed: 08/02/2024] Open
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. We also find that implementing a simple ensemble of convolutional neural networks enables robust design of high-performing variants in the local landscape. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape and how a simple ensembling approach makes protein engineering more robust.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
22
|
Vornholt T, Mutný M, Schmidt GW, Schellhaas C, Tachibana R, Panke S, Ward TR, Krause A, Jeschek M. Enhanced Sequence-Activity Mapping and Evolution of Artificial Metalloenzymes by Active Learning. ACS CENTRAL SCIENCE 2024; 10:1357-1370. [PMID: 39071060 PMCID: PMC11273458 DOI: 10.1021/acscentsci.4c00258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 04/22/2024] [Accepted: 05/02/2024] [Indexed: 07/30/2024]
Abstract
Tailored enzymes are crucial for the transition to a sustainable bioeconomy. However, enzyme engineering is laborious and failure-prone due to its reliance on serendipity. The efficiency and success rates of engineering campaigns may be improved by applying machine learning to map the sequence-activity landscape based on small experimental data sets. Yet, it often proves challenging to reliably model large sequence spaces while keeping the experimental effort tractable. To address this challenge, we present an integrated pipeline combining large-scale screening with active machine learning, which we applied to engineer an artificial metalloenzyme (ArM) catalyzing a new-to-nature hydroamination reaction. Combining lab automation and next-generation sequencing, we acquired sequence-activity data for several thousand ArM variants. We then used Gaussian process regression to model the activity landscape and guide further screening rounds. Critical characteristics of our pipeline include the cost-effective generation of information-rich data sets, the integration of an explorative round to improve the model's performance, and the inclusion of experimental noise. Our approach led to an order-of-magnitude boost in the hit rate while making efficient use of experimental resources. Search strategies like this should find broad utility in enzyme engineering and accelerate the development of novel biocatalysts.
Collapse
Affiliation(s)
- Tobias Vornholt
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
| | - Mojmír Mutný
- Department
of Computer Science, ETH Zurich, Andreasstrasse 5, 8092 Zurich, Switzerland
| | - Gregor W. Schmidt
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
| | - Christian Schellhaas
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
| | - Ryo Tachibana
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, 4058 Basel, Switzerland
| | - Sven Panke
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
| | - Thomas R. Ward
- National
Centre of Competence in Research (NCCR) Molecular Systems Engineering, 4056 Basel,Switzerland
- Department
of Chemistry, University of Basel, Mattenstrasse 24a, 4058 Basel, Switzerland
| | - Andreas Krause
- Department
of Computer Science, ETH Zurich, Andreasstrasse 5, 8092 Zurich, Switzerland
| | - Markus Jeschek
- Department
of Biosystems Science and Engineering, ETH
Zurich, Mattenstrasse 26, 4058 Basel, Switzerland
- Institute
of Microbiology, University of Regensburg, Universitätsstraße 31, 93053 Regensburg, Germany
| |
Collapse
|
23
|
Guan F, Tian X, Zhang R, Zhang Y, Wu N, Sun J, Zhang H, Tu T, Luo H, Yao B, Tian J, Huang H. Enhancing the endo-activity of the thermophilic chitinase to yield chitooligosaccharides with high degrees of polymerization. BIORESOUR BIOPROCESS 2024; 11:29. [PMID: 38647930 PMCID: PMC10991111 DOI: 10.1186/s40643-024-00735-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2023] [Accepted: 01/21/2024] [Indexed: 04/25/2024] Open
Abstract
Thermophilic endo-chitinases are essential for production of highly polymerized chitooligosaccharides, which are advantageous for plant immunity, animal nutrition and health. However, thermophilic endo-chitinases are scarce and the transformation from exo- to endo-activity of chitinases is still a challenging problem. In this study, to enhance the endo-activity of the thermophilic chitinase Chi304, we proposed two approaches for rational design based on comprehensive structural and evolutionary analyses. Four effective single-point mutants were identified among 28 designed mutations. The ratio of (GlcNAc)3 to (GlcNAc)2 quantity (DP3/2) in the hydrolysates of the four single-point mutants undertaking colloidal chitin degradation were 1.89, 1.65, 1.24, and 1.38 times that of Chi304, respectively. When combining to double-point mutants, the DP3/2 proportions produced by F79A/W140R, F79A/M264L, F79A/W272R, and M264L/W272R were 2.06, 1.67, 1.82, and 1.86 times that of Chi304 and all four double-point mutants exhibited enhanced endo-activity. When applied to produce chitooligosaccharides (DP ≥ 3), F79A/W140R accumulated the most (GlcNAc)4, while M264L/W272R was the best to produce (GlcNAc)3, which was 2.28 times that of Chi304. The two mutants had exposed shallower substrate-binding pockets and stronger binding abilities to shape the substrate. Overall, this research offers a practical approach to altering the cutting pattern of a chitinase to generate functional chitooligosaccharides.
Collapse
Affiliation(s)
- Feifei Guan
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Xiaoqian Tian
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
- College of Food Science and Technology, Hebei Agricultural University, Hebei Baoding, 071000, China
| | - Ruohan Zhang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China
| | - Yan Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
- College of Food Science and Technology, Hebei Agricultural University, Hebei Baoding, 071000, China
| | - Ningfeng Wu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China
| | - Jilu Sun
- College of Food Science and Technology, Hebei Agricultural University, Hebei Baoding, 071000, China
| | - Honglian Zhang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China
| | - Tao Tu
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China
| | - Huiying Luo
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China
| | - Bin Yao
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China
| | - Jian Tian
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China.
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, 100081, China.
| | - Huoqing Huang
- Institute of Animal Science, Chinese Academy of Agricultural Sciences, Beijing, 100193, China.
| |
Collapse
|
24
|
Yang J, Li FZ, Arnold FH. Opportunities and Challenges for Machine Learning-Assisted Enzyme Engineering. ACS CENTRAL SCIENCE 2024; 10:226-241. [PMID: 38435522 PMCID: PMC10906252 DOI: 10.1021/acscentsci.3c01275] [Citation(s) in RCA: 25] [Impact Index Per Article: 25.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 12/26/2023] [Accepted: 01/16/2024] [Indexed: 03/05/2024]
Abstract
Enzymes can be engineered at the level of their amino acid sequences to optimize key properties such as expression, stability, substrate range, and catalytic efficiency-or even to unlock new catalytic activities not found in nature. Because the search space of possible proteins is vast, enzyme engineering usually involves discovering an enzyme starting point that has some level of the desired activity followed by directed evolution to improve its "fitness" for a desired application. Recently, machine learning (ML) has emerged as a powerful tool to complement this empirical process. ML models can contribute to (1) starting point discovery by functional annotation of known protein sequences or generating novel protein sequences with desired functions and (2) navigating protein fitness landscapes for fitness optimization by learning mappings between protein sequences and their associated fitness values. In this Outlook, we explain how ML complements enzyme engineering and discuss its future potential to unlock improved engineering outcomes.
Collapse
Affiliation(s)
- Jason Yang
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| | - Frances H. Arnold
- Division
of Chemistry and Chemical Engineering, California
Institute of Technology, Pasadena, California 91125, United States
- Division
of Biology and Biological Engineering, California
Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
25
|
Hassan J, Saeed SM, Deka L, Uddin MJ, Das DB. Applications of Machine Learning (ML) and Mathematical Modeling (MM) in Healthcare with Special Focus on Cancer Prognosis and Anticancer Therapy: Current Status and Challenges. Pharmaceutics 2024; 16:260. [PMID: 38399314 PMCID: PMC10892549 DOI: 10.3390/pharmaceutics16020260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 01/29/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024] Open
Abstract
The use of data-driven high-throughput analytical techniques, which has given rise to computational oncology, is undisputed. The widespread use of machine learning (ML) and mathematical modeling (MM)-based techniques is widely acknowledged. These two approaches have fueled the advancement in cancer research and eventually led to the uptake of telemedicine in cancer care. For diagnostic, prognostic, and treatment purposes concerning different types of cancer research, vast databases of varied information with manifold dimensions are required, and indeed, all this information can only be managed by an automated system developed utilizing ML and MM. In addition, MM is being used to probe the relationship between the pharmacokinetics and pharmacodynamics (PK/PD interactions) of anti-cancer substances to improve cancer treatment, and also to refine the quality of existing treatment models by being incorporated at all steps of research and development related to cancer and in routine patient care. This review will serve as a consolidation of the advancement and benefits of ML and MM techniques with a special focus on the area of cancer prognosis and anticancer therapy, leading to the identification of challenges (data quantity, ethical consideration, and data privacy) which are yet to be fully addressed in current studies.
Collapse
Affiliation(s)
- Jasmin Hassan
- Drug Delivery & Therapeutics Lab, Dhaka 1212, Bangladesh; (J.H.); (S.M.S.)
| | | | - Lipika Deka
- Faculty of Computing, Engineering and Media, De Montfort University, Leicester LE1 9BH, UK;
| | - Md Jasim Uddin
- Department of Pharmaceutical Technology, Faculty of Pharmacy, Universiti Malaya, Kuala Lumpur 50603, Malaysia
| | - Diganta B. Das
- Department of Chemical Engineering, Loughborough University, Loughborough LE11 3TU, UK
| |
Collapse
|
26
|
Praljak N, Lian X, Ranganathan R, Ferguson AL. ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design. ACS Synth Biol 2023; 12:3544-3561. [PMID: 37988083 PMCID: PMC10911954 DOI: 10.1021/acssynbio.3c00261] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.
Collapse
Affiliation(s)
- Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, United States
| | - Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Andrew L Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
27
|
Notin P, Marks DS, Weitzman R, Gal Y. ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.06.570473. [PMID: 38106034 PMCID: PMC10723423 DOI: 10.1101/2023.12.06.570473] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2023]
Abstract
Protein design holds immense potential for optimizing naturally occurring proteins, with broad applications in drug discovery, material design, and sustainability. However, computational methods for protein engineering are confronted with significant challenges, such as an expansive design space, sparse functional regions, and a scarcity of available labels. These issues are further exacerbated in practice by the fact most real-life design scenarios necessitate the simultaneous optimization of multiple properties. In this work, we introduce ProteinNPT, a non-parametric transformer variant tailored to protein sequences and particularly suited to label-scarce and multi-task learning settings. We first focus on the supervised fitness prediction setting and develop several cross-validation schemes which support robust performance assessment. We subsequently reimplement prior top-performing baselines, introduce several extensions of these baselines by integrating diverse branches of the protein engineering literature, and demonstrate that ProteinNPT consistently outperforms all of them across a diverse set of protein property prediction tasks. Finally, we demonstrate the value of our approach for iterative protein design across extensive in silico Bayesian optimization and conditional sampling experiments.
Collapse
Affiliation(s)
| | | | | | - Yarin Gal
- Computer Science, University of Oxford
| |
Collapse
|
28
|
Xie WJ, Warshel A. Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering. Natl Sci Rev 2023; 10:nwad331. [PMID: 38299119 PMCID: PMC10829072 DOI: 10.1093/nsr/nwad331] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 09/27/2023] [Accepted: 10/13/2023] [Indexed: 02/02/2024] Open
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. Generative models could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, catalytic activity and stability, rationalizing the laboratory evolution of de novo enzymes, and decoding protein sequence semantics and their application in enzyme engineering. Notably, the prediction of catalytic activity and stability of enzymes using natural protein sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, Genetics Institute, University of Florida, Gainesville, FL 32610, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
29
|
Fahlberg SA, Freschlin CR, Heinzelman P, Romero PA. Neural network extrapolation to distant regions of the protein fitness landscape. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.08.566287. [PMID: 37987009 PMCID: PMC10659313 DOI: 10.1101/2023.11.08.566287] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Machine learning (ML) has transformed protein engineering by constructing models of the underlying sequence-function landscape to accelerate the discovery of new biomolecules. ML-guided protein design requires models, trained on local sequence-function information, to accurately predict distant fitness peaks. In this work, we evaluate neural networks' capacity to extrapolate beyond their training data. We perform model-guided design using a panel of neural network architectures trained on protein G (GB1)-Immunoglobulin G (IgG) binding data and experimentally test thousands of GB1 designs to systematically evaluate the models' extrapolation. We find each model architecture infers markedly different landscapes from the same data, which give rise to unique design preferences. We find simpler models excel in local extrapolation to design high fitness proteins, while more sophisticated convolutional models can venture deep into sequence space to design proteins that fold but are no longer functional. Our findings highlight how each architecture's inductive biases prime them to learn different aspects of the protein fitness landscape.
Collapse
Affiliation(s)
- Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Pete Heinzelman
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
- Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
30
|
Merzbacher C, Oyarzún DA. Applications of artificial intelligence and machine learning in dynamic pathway engineering. Biochem Soc Trans 2023; 51:1871-1879. [PMID: 37656433 PMCID: PMC10657174 DOI: 10.1042/bst20221542] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2023] [Revised: 08/07/2023] [Accepted: 08/21/2023] [Indexed: 09/02/2023]
Abstract
Dynamic pathway engineering aims to build metabolic production systems embedded with intracellular control mechanisms for improved performance. These control systems enable host cells to self-regulate the temporal activity of a production pathway in response to perturbations, using a combination of biosensors and feedback circuits for controlling expression of heterologous enzymes. Pathway design, however, requires assembling together multiple biological parts into suitable circuit architectures, as well as careful calibration of the function of each component. This results in a large design space that is costly to navigate through experimentation alone. Methods from artificial intelligence (AI) and machine learning are gaining increasing attention as tools to accelerate the design cycle, owing to their ability to identify hidden patterns in data and rapidly screen through large collections of designs. In this review, we discuss recent developments in the application of machine learning methods to the design of dynamic pathways and their components. We cover recent successes and offer perspectives for future developments in the field. The integration of AI into metabolic engineering pipelines offers great opportunities to streamline design and discover control systems for improved production of high-value chemicals.
Collapse
Affiliation(s)
| | - Diego A. Oyarzún
- School of Informatics, University of Edinburgh, Edinburgh, U.K
- The Alan Turing Institute, London, U.K
- School of Biological Sciences, University of Edinburgh, Edinburgh, U.K
| |
Collapse
|
31
|
Xie WJ, Warshel A. Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561808. [PMID: 37873334 PMCID: PMC10592750 DOI: 10.1101/2023.10.10.561808] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. By applying generative models, we could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, activity, and stability, rationalizing the laboratory evolution of de novo enzymes, decoding protein sequence semantics, and its applications in enzyme engineering. Notably, the prediction of enzyme activity and stability using natural enzyme sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
32
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|
33
|
Abstract
The ability to site-selectively modify equivalent functional groups in a molecule has the potential to streamline syntheses and increase product yields by lowering step counts. Enzymes catalyze site-selective transformations throughout primary and secondary metabolism, but leveraging this capability for non-native substrates and reactions requires a detailed understanding of the potential and limitations of enzyme catalysis and how these bounds can be extended by protein engineering. In this review, we discuss representative examples of site-selective enzyme catalysis involving functional group manipulation and C-H bond functionalization. We include illustrative examples of native catalysis, but our focus is on cases involving non-native substrates and reactions often using engineered enzymes. We then discuss the use of these enzymes for chemoenzymatic transformations and target-oriented synthesis and conclude with a survey of tools and techniques that could expand the scope of non-native site-selective enzyme catalysis.
Collapse
Affiliation(s)
- Dibyendu Mondal
- Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States
| | - Harrison M Snodgrass
- Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States
| | - Christian A Gomez
- Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States
| | - Jared C Lewis
- Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States
| |
Collapse
|
34
|
Yang J, Ducharme J, Johnston KE, Li FZ, Yue Y, Arnold FH. DeCOIL: Optimization of Degenerate Codon Libraries for Machine Learning-Assisted Protein Engineering. ACS Synth Biol 2023; 12:2444-2454. [PMID: 37524064 DOI: 10.1021/acssynbio.3c00301] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/02/2023]
Abstract
With advances in machine learning (ML)-assisted protein engineering, models based on data, biophysics, and natural evolution are being used to propose informed libraries of protein variants to explore. Synthesizing these libraries for experimental screens is a major bottleneck, as the cost of obtaining large numbers of exact gene sequences is often prohibitive. Degenerate codon (DC) libraries are a cost-effective alternative for generating combinatorial mutagenesis libraries where mutations are targeted to a handful of amino acid sites. However, existing computational methods to optimize DC libraries to include desired protein variants are not well suited to design libraries for ML-assisted protein engineering. To address these drawbacks, we present DEgenerate Codon Optimization for Informed Libraries (DeCOIL), a generalized method that directly optimizes DC libraries to be useful for protein engineering: to sample protein variants that are likely to have both high fitness and high diversity in the sequence search space. Using computational simulations and wet-lab experiments, we demonstrate that DeCOIL is effective across two specific case studies, with the potential to be applied to many other use cases. DeCOIL offers several advantages over existing methods, as it is direct, easy to use, generalizable, and scalable. With accompanying software (https://github.com/jsunn-y/DeCOIL), DeCOIL can be readily implemented to generate desired informed libraries.
Collapse
Affiliation(s)
- Jason Yang
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Julie Ducharme
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Kadina E Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Francesca-Zhoufan Li
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| | - Yisong Yue
- Division of Engineering and Applied Sciences, California Institute of Technology, Pasadena, California 91125, United States
| | - Frances H Arnold
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California 91125, United States
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California 91125, United States
| |
Collapse
|
35
|
McConnell A, Hackel BJ. Protein engineering via sequence-performance mapping. Cell Syst 2023; 14:656-666. [PMID: 37494931 PMCID: PMC10527434 DOI: 10.1016/j.cels.2023.06.009] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 05/10/2023] [Accepted: 06/21/2023] [Indexed: 07/28/2023]
Abstract
Discovery and evolution of new and improved proteins has empowered molecular therapeutics, diagnostics, and industrial biotechnology. Discovery and evolution both require efficient screens and effective libraries, although they differ in their challenges because of the absence or presence, respectively, of an initial protein variant with the desired function. A host of high-throughput technologies-experimental and computational-enable efficient screens to identify performant protein variants. In partnership, an informed search of sequence space is needed to overcome the immensity, sparsity, and complexity of the sequence-performance landscape. Early in the historical trajectory of protein engineering, these elements aligned with distinct approaches to identify the most performant sequence: selection from large, randomized combinatorial libraries versus rational computational design. Substantial advances have now emerged from the synergy of these perspectives. Rational design of combinatorial libraries aids the experimental search of sequence space, and high-throughput, high-integrity experimental data inform computational design. At the core of the collaborative interface, efficient protein characterization (rather than mere selection of optimal variants) maps sequence-performance landscapes. Such quantitative maps elucidate the complex relationships between protein sequence and performance-e.g., binding, catalytic efficiency, biological activity, and developability-thereby advancing fundamental protein science and facilitating protein discovery and evolution.
Collapse
Affiliation(s)
- Adam McConnell
- Department of Biomedical Engineering, University of Minnesota - Twin Cities, 421 Washington Avenue SE, Minneapolis, MN 55455, USA
| | - Benjamin J Hackel
- Department of Biomedical Engineering, University of Minnesota - Twin Cities, 421 Washington Avenue SE, Minneapolis, MN 55455, USA; Department of Chemical Engineering and Materials Science, University of Minnesota - Twin Cities, 421 Washington Avenue SE, Minneapolis, MN 55455, USA.
| |
Collapse
|
36
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. ARXIV 2023:arXiv:2307.14587v1. [PMID: 37547662 PMCID: PMC10402185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824, MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824, MI, USA
| |
Collapse
|
37
|
Boukid F, Ganeshan S, Wang Y, Tülbek MÇ, Nickerson MT. Bioengineered Enzymes and Precision Fermentation in the Food Industry. Int J Mol Sci 2023; 24:10156. [PMID: 37373305 DOI: 10.3390/ijms241210156] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2023] [Revised: 06/06/2023] [Accepted: 06/13/2023] [Indexed: 06/29/2023] Open
Abstract
Enzymes have been used in the food processing industry for many years. However, the use of native enzymes is not conducive to high activity, efficiency, range of substrates, and adaptability to harsh food processing conditions. The advent of enzyme engineering approaches such as rational design, directed evolution, and semi-rational design provided much-needed impetus for tailor-made enzymes with improved or novel catalytic properties. Production of designer enzymes became further refined with the emergence of synthetic biology and gene editing techniques and a plethora of other tools such as artificial intelligence, and computational and bioinformatics analyses which have paved the way for what is referred to as precision fermentation for the production of these designer enzymes more efficiently. With all the technologies available, the bottleneck is now in the scale-up production of these enzymes. There is generally a lack of accessibility thereof of large-scale capabilities and know-how. This review is aimed at highlighting these various enzyme-engineering strategies and the associated scale-up challenges, including safety concerns surrounding genetically modified microorganisms and the use of cell-free systems to circumvent this issue. The use of solid-state fermentation (SSF) is also addressed as a potentially low-cost production system, amenable to customization and employing inexpensive feedstocks as substrate.
Collapse
Affiliation(s)
- Fatma Boukid
- ClonBio Group Ltd., 6 Fitzwilliam Pl, D02 XE61 Dublin, Ireland
| | | | - Yingxin Wang
- Saskatchewan Food Industry Development Centre, Saskatoon, SK S7M 5V1, Canada
| | | | - Michael T Nickerson
- Department of Food and Bioproduct Sciences, University of Saskatchewan, Saskatoon, SK S7N 5A8, Canada
| |
Collapse
|
38
|
Weinstein JY, Martí-Gómez C, Lipsh-Sokolik R, Hoch SY, Liebermann D, Nevo R, Weissman H, Petrovich-Kopitman E, Margulies D, Ivankov D, McCandlish DM, Fleishman SJ. Designed active-site library reveals thousands of functional GFP variants. Nat Commun 2023; 14:2890. [PMID: 37210560 PMCID: PMC10199939 DOI: 10.1038/s41467-023-38099-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Accepted: 04/13/2023] [Indexed: 05/22/2023] Open
Abstract
Mutations in a protein active site can lead to dramatic and useful changes in protein activity. The active site, however, is sensitive to mutations due to a high density of molecular interactions, substantially reducing the likelihood of obtaining functional multipoint mutants. We introduce an atomistic and machine-learning-based approach, called high-throughput Functional Libraries (htFuncLib), that designs a sequence space in which mutations form low-energy combinations that mitigate the risk of incompatible interactions. We apply htFuncLib to the GFP chromophore-binding pocket, and, using fluorescence readout, recover >16,000 unique designs encoding as many as eight active-site mutations. Many designs exhibit substantial and useful diversity in functional thermostability (up to 96 °C), fluorescence lifetime, and quantum yield. By eliminating incompatible active-site mutations, htFuncLib generates a large diversity of functional sequences. We envision that htFuncLib will be used in one-shot optimization of activity in enzymes, binders, and other proteins.
Collapse
Affiliation(s)
| | - Carlos Martí-Gómez
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Rosalie Lipsh-Sokolik
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Shlomo Yakir Hoch
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Demian Liebermann
- Department of Chemical and Biological Physics, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Reinat Nevo
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Haim Weissman
- Department of Molecular Chemistry and Materials Science, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | | | - David Margulies
- Department of Chemical and Structural Biology, Weizmann Institute of Science, Rehovot, 7610001, Israel
| | - Dmitry Ivankov
- Center of Life Sciences, Skolkovo Institute of Science and Technology, Moscow, Russia
| | - David M McCandlish
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, 11724, USA
| | - Sarel J Fleishman
- Department of Biomolecular Sciences, Weizmann Institute of Science, Rehovot, 7610001, Israel.
| |
Collapse
|
39
|
Gantz M, Neun S, Medcalf EJ, van Vliet LD, Hollfelder F. Ultrahigh-Throughput Enzyme Engineering and Discovery in In Vitro Compartments. Chem Rev 2023; 123:5571-5611. [PMID: 37126602 PMCID: PMC10176489 DOI: 10.1021/acs.chemrev.2c00910] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2022] [Indexed: 05/03/2023]
Abstract
Novel and improved biocatalysts are increasingly sourced from libraries via experimental screening. The success of such campaigns is crucially dependent on the number of candidates tested. Water-in-oil emulsion droplets can replace the classical test tube, to provide in vitro compartments as an alternative screening format, containing genotype and phenotype and enabling a readout of function. The scale-down to micrometer droplet diameters and picoliter volumes brings about a >107-fold volume reduction compared to 96-well-plate screening. Droplets made in automated microfluidic devices can be integrated into modular workflows to set up multistep screening protocols involving various detection modes to sort >107 variants a day with kHz frequencies. The repertoire of assays available for droplet screening covers all seven enzyme commission (EC) number classes, setting the stage for widespread use of droplet microfluidics in everyday biochemical experiments. We review the practicalities of adapting droplet screening for enzyme discovery and for detailed kinetic characterization. These new ways of working will not just accelerate discovery experiments currently limited by screening capacity but profoundly change the paradigms we can probe. By interfacing the results of ultrahigh-throughput droplet screening with next-generation sequencing and deep learning, strategies for directed evolution can be implemented, examined, and evaluated.
Collapse
Affiliation(s)
| | | | | | | | - Florian Hollfelder
- Department of Biochemistry, University of Cambridge, 80 Tennis Court Rd, Cambridge CB2 1GA, U.K.
| |
Collapse
|
40
|
Rabitz H, Russell B, Ho TS. The Surprising Ease of Finding Optimal Solutions for Controlling Nonlinear Phenomena in Quantum and Classical Complex Systems. J Phys Chem A 2023; 127:4224-4236. [PMID: 37142303 DOI: 10.1021/acs.jpca.3c01896] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/06/2023]
Abstract
This Perspective addresses the often observed surprising ease of achieving optimal control of nonlinear phenomena in quantum and classical complex systems. The circumstances involved are wide-ranging, with scenarios including manipulation of atomic scale processes, maximization of chemical and material properties or synthesis yields, Nature's optimization of species' populations by natural selection, and directed evolution. Natural evolution will mainly be discussed in terms of laboratory experiments with microorganisms, and the field is also distinct from the other domains where a scientist specifies the goal(s) and oversees the control process. We use the word "control" in reference to all of the available variables, regardless of the circumstance. The empirical observations on the ease of achieving at least good, if not excellent, control in diverse domains of science raise the question of why this occurs despite the generally inherent complexity of the systems in each scenario. The key to addressing the question lies in examining the associated control landscape, which is defined as the optimization objective as a function of the control variables that can be as diverse as the phenomena under consideration. Controls may range from laser pulses, chemical reagents, chemical processing conditions, out to nucleic acids in the genome and more. This Perspective presents a conjecture, based on present findings, that the systematics of readily finding good outcomes from controlled phenomena may be unified through consideration of control landscapes with the same common set of three underlying assumptions─the existence of an optimal solution, the ability for local movement on the landscape, and the availability of sufficient control resources─whose validity needs assessment in each scenario. In practice, many cases permit using myopic gradient-like algorithms while other circumstances utilize algorithms having some elements of stochasticity or introduced noise, depending on whether the landscape is locally smooth or rough. The overarching observation is that only relatively short searches are required despite the common high dimensionality of the available controls in typical scenarios.
Collapse
Affiliation(s)
- Herschel Rabitz
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Benjamin Russell
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| | - Tak-San Ho
- Department of Chemistry, Princeton University, Princeton, New Jersey 08544, United States
| |
Collapse
|
41
|
Verkhivker G, Alshahrani M, Gupta G, Xiao S, Tao P. From Deep Mutational Mapping of Allosteric Protein Landscapes to Deep Learning of Allostery and Hidden Allosteric Sites: Zooming in on "Allosteric Intersection" of Biochemical and Big Data Approaches. Int J Mol Sci 2023; 24:7747. [PMID: 37175454 PMCID: PMC10178073 DOI: 10.3390/ijms24097747] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2023] [Revised: 04/22/2023] [Accepted: 04/23/2023] [Indexed: 05/15/2023] Open
Abstract
The recent advances in artificial intelligence (AI) and machine learning have driven the design of new expert systems and automated workflows that are able to model complex chemical and biological phenomena. In recent years, machine learning approaches have been developed and actively deployed to facilitate computational and experimental studies of protein dynamics and allosteric mechanisms. In this review, we discuss in detail new developments along two major directions of allosteric research through the lens of data-intensive biochemical approaches and AI-based computational methods. Despite considerable progress in applications of AI methods for protein structure and dynamics studies, the intersection between allosteric regulation, the emerging structural biology technologies and AI approaches remains largely unexplored, calling for the development of AI-augmented integrative structural biology. In this review, we focus on the latest remarkable progress in deep high-throughput mining and comprehensive mapping of allosteric protein landscapes and allosteric regulatory mechanisms as well as on the new developments in AI methods for prediction and characterization of allosteric binding sites on the proteome level. We also discuss new AI-augmented structural biology approaches that expand our knowledge of the universe of protein dynamics and allostery. We conclude with an outlook and highlight the importance of developing an open science infrastructure for machine learning studies of allosteric regulation and validation of computational approaches using integrative studies of allosteric mechanisms. The development of community-accessible tools that uniquely leverage the existing experimental and simulation knowledgebase to enable interrogation of the allosteric functions can provide a much-needed boost to further innovation and integration of experimental and computational technologies empowered by booming AI field.
Collapse
Affiliation(s)
- Gennady Verkhivker
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA 92866, USA; (M.A.); (G.G.)
- Department of Biomedical and Pharmaceutical Sciences, Chapman University School of Pharmacy, Irvine, CA 92618, USA
| | - Mohammed Alshahrani
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA 92866, USA; (M.A.); (G.G.)
| | - Grace Gupta
- Keck Center for Science and Engineering, Graduate Program in Computational and Data Sciences, Schmid College of Science and Technology, Chapman University, Orange, CA 92866, USA; (M.A.); (G.G.)
| | - Sian Xiao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, TX 75275, USA; (S.X.); (P.T.)
| | - Peng Tao
- Department of Chemistry, Center for Research Computing, Center for Drug Discovery, Design, and Delivery (CD4), Southern Methodist University, Dallas, TX 75275, USA; (S.X.); (P.T.)
| |
Collapse
|
42
|
Guo B, Zheng H, Jiang H, Li X, Guan N, Zuo Y, Zhang Y, Yang H, Wang X. Enhanced compound-protein binding affinity prediction by representing protein multimodal information via a coevolutionary strategy. Brief Bioinform 2023; 24:6995409. [PMID: 36682005 DOI: 10.1093/bib/bbac628] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2022] [Revised: 12/12/2022] [Accepted: 12/25/2022] [Indexed: 01/23/2023] Open
Abstract
Due to the lack of a method to efficiently represent the multimodal information of a protein, including its structure and sequence information, predicting compound-protein binding affinity (CPA) still suffers from low accuracy when applying machine-learning methods. To overcome this limitation, in a novel end-to-end architecture (named FeatNN), we develop a coevolutionary strategy to jointly represent the structure and sequence features of proteins and ultimately optimize the mathematical models for predicting CPA. Furthermore, from the perspective of data-driven approach, we proposed a rational method that can utilize both high- and low-quality databases to optimize the accuracy and generalization ability of FeatNN in CPA prediction tasks. Notably, we visually interpret the feature interaction process between sequence and structure in the rationally designed architecture. As a result, FeatNN considerably outperforms the state-of-the-art (SOTA) baseline in virtual drug evaluation tasks, indicating the feasibility of this approach for practical use. FeatNN provides an outstanding method for higher CPA prediction accuracy and better generalization ability by efficiently representing multimodal information of proteins via a coevolutionary strategy.
Collapse
Affiliation(s)
- Binjie Guo
- Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province 310058, China
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
| | - Hanyu Zheng
- Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province 310058, China
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
| | - Haohan Jiang
- Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province 310058, China
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
| | - Xiaodan Li
- Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province 310058, China
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
| | - Naiyu Guan
- Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province 310058, China
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
| | - Yanming Zuo
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
| | - Yicheng Zhang
- Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province 310058, China
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
| | - Hengfu Yang
- School of Computer Science, Hunan First Normal University, Changsha, 410205 Hunan, China
| | - Xuhua Wang
- Department of Neurobiology and Department of Rehabilitation Medicine, First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, Zhejiang Province 310058, China
- Liangzhu Laboratory, MOE Frontier Science Center for Brain Science and Brain-machine Integration, State Key Laboratory of Brain-machine Intelligence, Zhejiang University, 1369 West Wenyi Road, Hangzhou 311121, China
- NHC and CAMS Key Laboratory of Medical Neurobiology, Zhejiang University, Hangzhou 310058, China
- Co-innovation Center of Neuroregeneration, Nantong University, Nantong, 226001 Jiangsu, China
| |
Collapse
|
43
|
Inferring protein fitness landscapes from laboratory evolution experiments. PLoS Comput Biol 2023; 19:e1010956. [PMID: 36857380 PMCID: PMC10010530 DOI: 10.1371/journal.pcbi.1010956] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 03/13/2023] [Accepted: 02/16/2023] [Indexed: 03/02/2023] Open
Abstract
Directed laboratory evolution applies iterative rounds of mutation and selection to explore the protein fitness landscape and provides rich information regarding the underlying relationships between protein sequence, structure, and function. Laboratory evolution data consist of protein sequences sampled from evolving populations over multiple generations and this data type does not fit into established supervised and unsupervised machine learning approaches. We develop a statistical learning framework that models the evolutionary process and can infer the protein fitness landscape from multiple snapshots along an evolutionary trajectory. We apply our modeling approach to dihydrofolate reductase (DHFR) laboratory evolution data and the resulting landscape parameters capture important aspects of DHFR structure and function. We use the resulting model to understand the structure of the fitness landscape and find numerous examples of epistasis but an overall global peak that is evolutionarily accessible from most starting sequences. Finally, we use the model to perform an in silico extrapolation of the DHFR laboratory evolution trajectory and computationally design proteins from future evolutionary rounds.
Collapse
|
44
|
Fernández-Quintero ML, Ljungars A, Waibl F, Greiff V, Andersen JT, Gjølberg TT, Jenkins TP, Voldborg BG, Grav LM, Kumar S, Georges G, Kettenberger H, Liedl KR, Tessier PM, McCafferty J, Laustsen AH. Assessing developability early in the discovery process for novel biologics. MAbs 2023; 15:2171248. [PMID: 36823021 PMCID: PMC9980699 DOI: 10.1080/19420862.2023.2171248] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 01/18/2023] [Indexed: 02/25/2023] Open
Abstract
Beyond potency, a good developability profile is a key attribute of a biological drug. Selecting and screening for such attributes early in the drug development process can save resources and avoid costly late-stage failures. Here, we review some of the most important developability properties that can be assessed early on for biologics. These include the influence of the source of the biologic, its biophysical and pharmacokinetic properties, and how well it can be expressed recombinantly. We furthermore present in silico, in vitro, and in vivo methods and techniques that can be exploited at different stages of the discovery process to identify molecules with liabilities and thereby facilitate the selection of the most optimal drug leads. Finally, we reflect on the most relevant developability parameters for injectable versus orally delivered biologics and provide an outlook toward what general trends are expected to rise in the development of biologics.
Collapse
Affiliation(s)
- Monica L. Fernández-Quintero
- Center for Molecular Biosciences Innsbruck (CMBI), Department of General, Inorganic and Theoretical Chemistry, University of Innsbruck, Innsbruck, Austria
| | - Anne Ljungars
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Franz Waibl
- Center for Molecular Biosciences Innsbruck (CMBI), Department of General, Inorganic and Theoretical Chemistry, University of Innsbruck, Innsbruck, Austria
| | - Victor Greiff
- Department of Immunology, University of Oslo, Oslo, Norway
| | - Jan Terje Andersen
- Department of Immunology, University of Oslo, Oslo University Hospital Rikshospitalet, Oslo, Norway
- Institute of Clinical Medicine and Department of Pharmacology, University of Oslo, Oslo, Norway
| | | | - Timothy P. Jenkins
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Bjørn Gunnar Voldborg
- National Biologics Facility, Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Lise Marie Grav
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Sandeep Kumar
- Biotherapeutics Discovery, Boehringer Ingelheim Pharmaceuticals Inc, Ridgefield, CT, USA
| | - Guy Georges
- Roche Pharma Research and Early Development, Large Molecule Research, Roche Innovation Center Munich, Penzberg, Germany
| | - Hubert Kettenberger
- Roche Pharma Research and Early Development, Large Molecule Research, Roche Innovation Center Munich, Penzberg, Germany
| | - Klaus R. Liedl
- Center for Molecular Biosciences Innsbruck (CMBI), Department of General, Inorganic and Theoretical Chemistry, University of Innsbruck, Innsbruck, Austria
| | - Peter M. Tessier
- Department of Chemical Engineering, Pharmaceutical Sciences and Biomedical Engineering, Biointerfaces Institute, University of Michigan, Ann Arbor, Michigan, USA
| | - John McCafferty
- Department of Medicine, Cambridge Institute of Therapeutic Immunology and Infectious Disease, University of Cambridge, Cambridge, UK
- Maxion Therapeutics, Babraham Research Campus, Cambridge, UK
| | - Andreas H. Laustsen
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| |
Collapse
|
45
|
Leander M, Liu Z, Cui Q, Raman S. Deep mutational scanning and machine learning reveal structural and molecular rules governing allosteric hotspots in homologous proteins. eLife 2022; 11:e79932. [PMID: 36226916 PMCID: PMC9662819 DOI: 10.7554/elife.79932] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Accepted: 10/13/2022] [Indexed: 01/29/2023] Open
Abstract
A fundamental question in protein science is where allosteric hotspots - residues critical for allosteric signaling - are located, and what properties differentiate them. We carried out deep mutational scanning (DMS) of four homologous bacterial allosteric transcription factors (aTFs) to identify hotspots and built a machine learning model with this data to glean the structural and molecular properties of allosteric hotspots. We found hotspots to be distributed protein-wide rather than being restricted to 'pathways' linking allosteric and active sites as is commonly assumed. Despite structural homology, the location of hotspots was not superimposable across the aTFs. However, common signatures emerged when comparing hotspots coincident with long-range interactions, suggesting that the allosteric mechanism is conserved among the homologs despite differences in molecular details. Machine learning with our large DMS datasets revealed global structural and dynamic properties to be a strong predictor of whether a residue is a hotspot than local and physicochemical properties. Furthermore, a model trained on one protein can predict hotspots in a homolog. In summary, the overall allosteric mechanism is embedded in the structural fold of the aTF family, but the finer, molecular details are sequence-specific.
Collapse
Affiliation(s)
- Megan Leander
- Department of Biochemistry, University of Wisconsin-MadisonMadisonUnited States
| | - Zhuang Liu
- Department of Physics, Boston UniversityBostonUnited States
| | - Qiang Cui
- Department of Physics, Boston UniversityBostonUnited States
- Department of Chemistry, Boston UniversityBostonUnited States
| | - Srivatsan Raman
- Department of Biochemistry, University of Wisconsin-MadisonMadisonUnited States
- Department of Bacteriology, University of Wisconsin-MadisonMadisonUnited States
- Department of Chemical and Biological Engineering, University of Wisconsin-MadisonMadisonUnited States
| |
Collapse
|
46
|
Xu P, Zhou K. Editorial overview: Analytical biotechnology for healthcare, strain engineering, biosensing and synthetic biology. Curr Opin Biotechnol 2022; 77:102765. [PMID: 35988531 DOI: 10.1016/j.copbio.2022.102765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Affiliation(s)
- Peng Xu
- Department of Chemical Engineering, Guangdong - Technion, Israel Institute of Technology, Shantou 515063, China.
| | - Kang Zhou
- Department of Chemical and Biomolecular Engineering, National University of Singapore, Singapore 117585, Singapore.
| |
Collapse
|