1
|
Mei H, Wang Z, Yang H, Li X, Xu Y. Network analysis of multivariate time series data in biological systems: methods and applications. Brief Bioinform 2025; 26:bbaf223. [PMID: 40401349 PMCID: PMC12096012 DOI: 10.1093/bib/bbaf223] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2025] [Revised: 04/17/2025] [Accepted: 04/30/2025] [Indexed: 05/23/2025] Open
Abstract
Network analysis has become an essential tool in biological and biomedical research, providing insights into complex biological mechanisms. Since biological systems are inherently time-dependent, incorporating time-varying methods is crucial for capturing temporal changes, adaptive interactions, and evolving dependencies within networks. Our study explores key time-varying methodologies for network structure estimation and network inference based on observed structures. We begin by discussing approaches for estimating network structures from data, focusing on the time-varying Gaussian graphical model, dynamic Bayesian network, and vector autoregression-based causal analysis. Next, we examine analytical techniques that leverage pre-specified or observed networks, including other autoregression-based methods and latent variable models. Furthermore, we explore practical applications and computational tools designed for these methods. By synthesizing these approaches, our study provides a comprehensive evaluation of their strengths and limitations in the context of biological data analysis.
Collapse
Affiliation(s)
- Hao Mei
- Center for Applied Statistics, School of Statistics, Institute of Health Data Science, Renmin University of China, 59 Zhongguancun Street, 100872 Beijing, China
| | - Zhiyuan Wang
- Center for Applied Statistics, School of Statistics, Institute of Health Data Science, Renmin University of China, 59 Zhongguancun Street, 100872 Beijing, China
| | - Hang Yang
- Center for Applied Statistics, School of Statistics, Institute of Health Data Science, Renmin University of China, 59 Zhongguancun Street, 100872 Beijing, China
| | - Xiaoke Li
- Center for Applied Statistics, School of Statistics, Institute of Health Data Science, Renmin University of China, 59 Zhongguancun Street, 100872 Beijing, China
| | - Yaqing Xu
- Department of Epidemiology and Biostatistics, School of Public Health, Shanghai Jiao Tong University School of Medicine, 227 South Chongqing Road, 200025 Shanghai, China
| |
Collapse
|
2
|
Ortega FM, Hossain F, Volobouev VV, Meloni G, Torabifard H, Morcos F. Generative Landscapes and Dynamics to Design Multidomain Artificial Transmembrane Transporters. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.03.28.645293. [PMID: 40236216 PMCID: PMC11996383 DOI: 10.1101/2025.03.28.645293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
Protein design is challenging as it requires simultaneous consideration of interconnected factors, such as fold, dynamics, and function. These evolutionary constraints are encoded in protein sequences and can be learned through the latent generative landscape (LGL) framework to predict functional sequences by leveraging evolutionary patterns, enabling exploration of uncharted sequence space. By simulating designed proteins through molecular dynamics (MD), we gain deeper insights into the interdependencies governing structure and dynamics. We present a synergized workflow combining LGL with MD and biochemical characterization, allowing us to explore the sequence space effectively. This approach has been applied to design and characterize two artificial multidomain ATP-driven transmembrane copper transporters, with native-like functionality. This integrative approach proved effective in unraveling the intricate relationships between sequence, structure, and function.
Collapse
|
3
|
Kohout P, Vasina M, Majerova M, Novakova V, Damborsky J, Bednar D, Marek M, Prokop Z, Mazurenko S. Engineering Dehalogenase Enzymes Using Variational Autoencoder-Generated Latent Spaces and Microfluidics. JACS AU 2025; 5:838-850. [PMID: 40017771 PMCID: PMC11862945 DOI: 10.1021/jacsau.4c01101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 01/23/2025] [Accepted: 01/30/2025] [Indexed: 03/01/2025]
Abstract
Enzymes play a crucial role in sustainable industrial applications, with their optimization posing a formidable challenge due to the intricate interplay among residues. Computational methodologies predominantly rely on evolutionary insights of homologous sequences. However, deciphering the evolutionary variability and complex dependencies among residues presents substantial hurdles. Here, we present a new machine-learning method based on variational autoencoders and evolutionary sampling strategy to address those limitations. We customized our method to generate novel sequences of model enzymes, haloalkane dehalogenases. Three design-build-test cycles improved the solubility of variants from 11% to 75%. Thorough experimental validation including the microfluidic device MicroPEX resulted in 20 multiple-point variants. Nine of them, sharing as little as 67% sequence similarity with the template, showed a melting temperature increase of up to 9 °C and an average improvement of 3 °C. The most stable variant demonstrated a 3.5-fold increase in activity compared to the template. High-quality experimental data collected with 20 variants represent a valuable data set for the critical validation of novel protein design approaches. Python scripts, jupyter notebooks, and data sets are available on GitHub (https://github.com/loschmidt/vae-dehalogenases), and interactive calculations will be possible via https://loschmidt.chemi.muni.cz/fireprotasr/.
Collapse
Affiliation(s)
- Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Michal Vasina
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Marika Majerova
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Veronika Novakova
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - David Bednar
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Martin Marek
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Zbynek Prokop
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| |
Collapse
|
4
|
Yang W, Ji J, Fang G. A metric and its derived protein network for evaluation of ortholog database inconsistency. BMC Bioinformatics 2025; 26:6. [PMID: 39773281 PMCID: PMC11707888 DOI: 10.1186/s12859-024-06023-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Accepted: 12/24/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND Ortholog prediction, essential for various genomic research areas, faces growing inconsistencies amidst the expanding array of ortholog databases. The common strategy of computing consensus orthologs introduces additional arbitrariness, emphasizing the need to examine the causes of such inconsistencies and identify proteins susceptible to prediction errors. RESULTS We introduce the Signal Jaccard Index (SJI), a novel metric rooted in unsupervised genome context clustering, designed to assess protein similarity. Leveraging SJI, we construct a protein network and reveal that peripheral proteins within the network are the primary contributors to inconsistencies in orthology predictions. Furthermore, we show that a protein's degree centrality in the network serves as a strong predictor of its reliability in consensus sets. CONCLUSIONS We present an objective, unsupervised SJI-based network encompassing all proteins, in which its topological features elucidate ortholog prediction inconsistencies. The degree centrality (DC) effectively identifies error-prone orthology assignments without relying on arbitrary parameters. Notably, DC is stable, unaffected by species selection, and well-suited for ortholog benchmarking. This approach transcends the limitations of universal thresholds, offering a robust and quantitative framework to explore protein evolution and functional relationships.
Collapse
Affiliation(s)
- Weijie Yang
- NYU-Shanghai, Shanghai, 200120, China
- Software Engineering Institute, East China Normal University, Shanghai, 200062, China
| | - Jingsi Ji
- NYU-Shanghai, Shanghai, 200120, China
- Software Engineering Institute, East China Normal University, Shanghai, 200062, China
| | - Gang Fang
- NYU-Shanghai, Shanghai, 200120, China.
- Department of Biology, New York University, New York, NY, 10003, USA.
- Software Engineering Institute, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
5
|
Ma Z, Li W, Shen Y, Xu Y, Liu G, Chang J, Li Z, Qin H, Tian B, Gong H, Liu DR, Thuronyi BW, Voigt CA, Zhang S. EvoAI enables extreme compression and reconstruction of the protein sequence space. Nat Methods 2025; 22:102-112. [PMID: 39528677 DOI: 10.1038/s41592-024-02504-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 10/10/2024] [Indexed: 11/16/2024]
Abstract
Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here we establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 1048. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.
Collapse
Affiliation(s)
- Ziyuan Ma
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Wenjie Li
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Yunhao Shen
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Yunxin Xu
- School of Life Sciences, Tsinghua University, Beijing, China
| | - Gengjiang Liu
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Jiamin Chang
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Zeju Li
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Hong Qin
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Boxue Tian
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
- State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua University, Beijing, China
| | - Haipeng Gong
- School of Life Sciences, Tsinghua University, Beijing, China
| | - David R Liu
- Merkin Institute of Transformative Technologies in Healthcare, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Chemistry and Chemical Biology, Harvard University, Cambridge, MA, USA
- Howard Hughes Medical Institute, Harvard University, Cambridge, MA, USA
| | - B W Thuronyi
- Department of Chemistry, Williams College, Williamstown, MA, USA
| | - Christopher A Voigt
- Synthetic Biology Center, Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Shuyi Zhang
- School of Pharmaceutical Sciences, Tsinghua University, Beijing, China.
- State Key Laboratory of Molecular Oncology, School of Pharmaceutical Sciences, Tsinghua University, Beijing, China.
- Center for Synthetic and Systems Biology, Tsinghua University, Beijing, China.
- Beijing Frontier Research Center for Biological Structure, Tsinghua University, Beijing, China.
| |
Collapse
|
6
|
Harding-Larsen D, Funk J, Madsen NG, Gharabli H, Acevedo-Rocha CG, Mazurenko S, Welner DH. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnol Adv 2024; 77:108459. [PMID: 39366493 DOI: 10.1016/j.biotechadv.2024.108459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/19/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations - primary sequence, 3D structure, and dynamics - to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Collapse
Affiliation(s)
- David Harding-Larsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Jonathan Funk
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Niklas Gesmar Madsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Hani Gharabli
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Carlos G Acevedo-Rocha
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Ditte Hededam Welner
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
7
|
Lian X, Praljak N, Subramanian SK, Wasinger S, Ranganathan R, Ferguson AL. Deep-learning-based design of synthetic orthologs of SH3 signaling domains. Cell Syst 2024; 15:725-737.e7. [PMID: 39106868 PMCID: PMC11879475 DOI: 10.1016/j.cels.2024.07.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 11/12/2023] [Accepted: 07/22/2024] [Indexed: 08/09/2024]
Abstract
Evolution-based deep generative models represent an exciting direction in understanding and designing proteins. An open question is whether such models can learn specialized functional constraints that control fitness in specific biological contexts. Here, we examine the ability of generative models to produce synthetic versions of Src-homology 3 (SH3) domains that mediate signaling in the Sho1 osmotic stress response pathway of yeast. We show that a variational autoencoder (VAE) model produces artificial sequences that experimentally recapitulate the function of natural SH3 domains. More generally, the model organizes all fungal SH3 domains such that locality in the model latent space (but not simply locality in sequence space) enriches the design of synthetic orthologs and exposes non-obvious amino acid constraints distributed near and far from the SH3 ligand-binding site. The ability of generative models to design ortholog-like functions in vivo opens new avenues for engineering protein function in specific cellular contexts and environments.
Collapse
Affiliation(s)
- Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, IL 60637, USA
| | - Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, IL 60637, USA
| | - Subu K Subramanian
- Department of Molecular and Cell Biology, California Institute for Quantitative Biosciences (QB3), and Howard Hughes Medical Institute, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sarah Wasinger
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA
| | - Rama Ranganathan
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA; Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, IL 60637, USA.
| | - Andrew L Ferguson
- Pritzker School for Molecular Engineering, University of Chicago, Chicago, IL 60637, USA.
| |
Collapse
|
8
|
Fu X. How deep can we decipher protein evolution with deep learning models. PATTERNS (NEW YORK, N.Y.) 2024; 5:101043. [PMID: 39233697 PMCID: PMC11368669 DOI: 10.1016/j.patter.2024.101043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 09/06/2024]
Abstract
Evolutionary-based machine learning models have emerged as a fascinating approach to mapping the landscape for protein evolution. Lian et al. demonstrated that evolution-based deep generative models, specifically variational autoencoders, can organize SH3 homologs in a hierarchical latent space, effectively distinguishing the specific Sho1SH3 domains.
Collapse
Affiliation(s)
- Xiaozhi Fu
- Department of Life Sciences, Chalmers University of Technology, Kemivägen 10, SE-412 96 Gothenburg, Sweden
| |
Collapse
|
9
|
Wang H, Chen M, Wei X, Xia R, Pei D, Huang X, Han B. Computational tools for plant genomics and breeding. SCIENCE CHINA. LIFE SCIENCES 2024; 67:1579-1590. [PMID: 38676814 DOI: 10.1007/s11427-024-2578-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 03/25/2024] [Indexed: 04/29/2024]
Abstract
Plant genomics and crop breeding are at the intersection of biotechnology and information technology. Driven by a combination of high-throughput sequencing, molecular biology and data science, great advances have been made in omics technologies at every step along the central dogma, especially in genome assembling, genome annotation, epigenomic profiling, and transcriptome profiling. These advances further revolutionized three directions of development. One is genetic dissection of complex traits in crops, along with genomic prediction and selection. The second is comparative genomics and evolution, which open up new opportunities to depict the evolutionary constraints of biological sequences for deleterious variant discovery. The third direction is the development of deep learning approaches for the rational design of biological sequences, especially proteins, for synthetic biology. All three directions of development serve as the foundation for a new era of crop breeding where agronomic traits are enhanced by genome design.
Collapse
Affiliation(s)
- Hai Wang
- State Key Laboratory of Maize Bio-breeding, Frontiers Science Center for Molecular Design Breeding, Joint International Research Laboratory of Crop Molecular Breeding, National Maize Improvement Center, College of Agronomy and Biotechnology, China Agricultural University, Beijing, 100193, China.
- Sanya Institute of China Agricultural University, Sanya, 572025, China.
- Hainan Yazhou Bay Seed Laboratory, Sanya, 572025, China.
| | - Mengjiao Chen
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xin Wei
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Rui Xia
- College of Horticulture, South China Agricultural University, Guangzhou, 510640, China
| | - Dong Pei
- State Key Laboratory of Tree Genetics and Breeding, Key Laboratory of Tree Breeding and Cultivation of the State Forestry and Grassland Administration, Research Institute of Forestry, Chinese Academy of Forestry, Beijing, 100091, China
| | - Xuehui Huang
- Shanghai Key Laboratory of Plant Molecular Sciences, College of Life Sciences, Shanghai Normal University, Shanghai, 200234, China
| | - Bin Han
- National Center for Gene Research, CAS Center for Excellence in Molecular Plant Sciences, Chinese Academy of Sciences, Shanghai, 200233, China
| |
Collapse
|
10
|
Zhou Z, Zhang L, Yu Y, Wu B, Li M, Hong L, Tan P. Enhancing efficiency of protein language models with minimal wet-lab data through few-shot learning. Nat Commun 2024; 15:5566. [PMID: 38956442 PMCID: PMC11219809 DOI: 10.1038/s41467-024-49798-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2024] [Accepted: 06/11/2024] [Indexed: 07/04/2024] Open
Abstract
Accurately modeling the protein fitness landscapes holds great importance for protein engineering. Pre-trained protein language models have achieved state-of-the-art performance in predicting protein fitness without wet-lab experimental data, but their accuracy and interpretability remain limited. On the other hand, traditional supervised deep learning models require abundant labeled training examples for performance improvements, posing a practical barrier. In this work, we introduce FSFP, a training strategy that can effectively optimize protein language models under extreme data scarcity for fitness prediction. By combining meta-transfer learning, learning to rank, and parameter-efficient fine-tuning, FSFP can significantly boost the performance of various protein language models using merely tens of labeled single-site mutants from the target protein. In silico benchmarks across 87 deep mutational scanning datasets demonstrate FSFP's superiority over both unsupervised and supervised baselines. Furthermore, we successfully apply FSFP to engineer the Phi29 DNA polymerase through wet-lab experiments, achieving a 25% increase in the positive rate. These results underscore the potential of our approach in aiding AI-guided protein engineering.
Collapse
Affiliation(s)
- Ziyi Zhou
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Liang Zhang
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Yuanxi Yu
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Banghao Wu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Mingchen Li
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China
- School of Information Science and Engineering, East China University of Science and Technology, Shanghai, 200237, China
| | - Liang Hong
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
- Zhang Jiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai, 201203, China.
| | - Pan Tan
- School of Physics and Astronomy, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai National Center for Applied Mathematics (SJTU Center) & Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, 200240, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, 200232, China.
| |
Collapse
|
11
|
Greener JG, Jamali K. Fast protein structure searching using structure graph embeddings. BIOINFORMATICS ADVANCES 2024; 5:vbaf042. [PMID: 40196750 PMCID: PMC11974391 DOI: 10.1093/bioadv/vbaf042] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/11/2025] [Accepted: 03/03/2025] [Indexed: 04/09/2025]
Abstract
Comparing and searching protein structures independent of primary sequence has proved useful for remote homology detection, function annotation, and protein classification. Fast and accurate methods to search with structures will be essential to make use of the vast databases that have recently become available, in the same way that fast protein sequence searching underpins much of bioinformatics. We train a simple graph neural network using supervised contrastive learning to learn a low-dimensional embedding of protein domains. Availability and implementation The method, called Progres, is available as software at https://github.com/greener-group/progres and as a web server at https://progres.mrc-lmb.cam.ac.uk. It has accuracy comparable to the best current methods and can search the AlphaFold database TED domains in a 10th of a second per query on CPU.
Collapse
Affiliation(s)
- Joe G Greener
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| | - Kiarash Jamali
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| |
Collapse
|
12
|
Zhang S, Ma Z, Li W, Shen Y, Xu Y, Liu G, Chang J, Li Z, Qin H, Tian B, Gong H, Liu D, Thuronyi B, Voigt C. EvoAI enables extreme compression and reconstruction of the protein sequence space. RESEARCH SQUARE 2024:rs.3.rs-3930833. [PMID: 38464127 PMCID: PMC10925456 DOI: 10.21203/rs.3.rs-3930833/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Designing proteins with improved functions requires a deep understanding of how sequence and function are related, a vast space that is hard to explore. The ability to efficiently compress this space by identifying functionally important features is extremely valuable. Here, we first establish a method called EvoScan to comprehensively segment and scan the high-fitness sequence space to obtain anchor points that capture its essential features, especially in high dimensions. Our approach is compatible with any biomolecular function that can be coupled to a transcriptional output. We then develop deep learning and large language models to accurately reconstruct the space from these anchors, allowing computational prediction of novel, highly fit sequences without prior homology-derived or structural information. We apply this hybrid experimental-computational method, which we call EvoAI, to a repressor protein and find that only 82 anchors are sufficient to compress the high-fitness sequence space with a compression ratio of 1048. The extreme compressibility of the space informs both applied biomolecular design and understanding of natural evolution.
Collapse
|
13
|
Cui Y, Chen Y, Sun J, Zhu T, Pang H, Li C, Geng WC, Wu B. Computational redesign of a hydrolase for nearly complete PET depolymerization at industrially relevant high-solids loading. Nat Commun 2024; 15:1417. [PMID: 38360963 PMCID: PMC10869840 DOI: 10.1038/s41467-024-45662-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 01/30/2024] [Indexed: 02/17/2024] Open
Abstract
Biotechnological plastic recycling has emerged as a suitable option for addressing the pollution crisis. A major breakthrough in the biodegradation of poly(ethylene terephthalate) (PET) is achieved by using a LCC variant, which permits 90% conversion at an industrial level. Despite the achievements, its applications have been hampered by the remaining 10% of nonbiodegradable PET. Herein, we address current challenges by employing a computational strategy to engineer a hydrolase from the bacterium HR29. The redesigned variant, TurboPETase, outperforms other well-known PET hydrolases. Nearly complete depolymerization is accomplished in 8 h at a solids loading of 200 g kg-1. Kinetic and structural analysis suggest that the improved performance may be attributed to a more flexible PET-binding groove that facilitates the targeting of more specific attack sites. Collectively, our results constitute a significant advance in understanding and engineering of industrially applicable polyester hydrolases, and provide guidance for further efforts on other polymer types.
Collapse
Affiliation(s)
- Yinglu Cui
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.
| | - Yanchun Chen
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Jinyuan Sun
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Tong Zhu
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Hua Pang
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Chunli Li
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Wen-Chao Geng
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- College of Chemistry, Nankai University, Tianjin, China
| | - Bian Wu
- AIM Center, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China.
| |
Collapse
|
14
|
Champagne SE, Chiang CH, Gemmel PM, Brooks CL, Narayan ARH. Biocatalytic Stereoselective Oxidation of 2-Arylindoles. J Am Chem Soc 2024; 146:2728-2735. [PMID: 38237569 PMCID: PMC11214688 DOI: 10.1021/jacs.3c12393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/01/2024]
Abstract
3-Hydroxyindolenines can be used to access several structural motifs that are featured in natural products and pharmaceutical compounds, yet the chemical synthesis of 3-hydroxyindolenines is complicated by overoxidation, rearrangements, and complex product mixtures. The selectivity possible in enzymatic reactions can overcome these challenges and deliver enantioenriched products. Herein, we present the development of an asymmetric biocatalytic oxidation of 2-arylindole substrates aided by a curated library of flavin-dependent monooxygenases (FDMOs) sampled from an ancestral sequence space, a sequence similarity network, and a deep-learning-based latent space model. From this library of FDMOs, a previously uncharacterized enzyme, Champase, from the Valley fever fungus, Coccidioides immitis strain RS, was found to stereoselectively catalyze the oxidation of a variety of substituted indole substrates. The promiscuity of this enzyme is showcased by the oxidation of a wide variety of substituted 2-arylindoles to afford the respective 3-hydroxyindolenine products in moderate to excellent yields and up to 95:5 er.
Collapse
Affiliation(s)
- Sarah E. Champagne
- Life Sciences Institute, University of Michigan, Ann Arbor, Michigan 48109, USA
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Chang-Hwa Chiang
- Life Sciences Institute, University of Michigan, Ann Arbor, Michigan 48109, USA
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Philipp M. Gemmel
- Life Sciences Institute, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Charles L. Brooks
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, USA
- Enhanced Program in Biophysics, University of Michigan, Ann Arbor, Michigan 48109, USA
- Program in Chemical Biology, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Alison R. H. Narayan
- Life Sciences Institute, University of Michigan, Ann Arbor, Michigan 48109, USA
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan 48109, USA
- Program in Chemical Biology, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
15
|
Qin S, Sun S, Wang Y, Li C, Fu L, Wu M, Yan J, Li W, Lv J, Chen L. Immune, metabolic landscapes of prognostic signatures for lung adenocarcinoma based on a novel deep learning framework. Sci Rep 2024; 14:527. [PMID: 38177198 PMCID: PMC10767103 DOI: 10.1038/s41598-023-51108-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 12/30/2023] [Indexed: 01/06/2024] Open
Abstract
Lung adenocarcinoma (LUAD) is a malignant tumor with high lethality, and the aim of this study was to identify promising biomarkers for LUAD. Using the TCGA-LUAD dataset as a discovery cohort, a novel joint framework VAEjMLP based on variational autoencoder (VAE) and multilayer perceptron (MLP) was proposed. And the Shapley Additive Explanations (SHAP) method was introduced to evaluate the contribution of feature genes to the classification decision, which helped us to develop a biologically meaningful biomarker potential scoring algorithm. Nineteen potential biomarkers for LUAD were identified, which were involved in the regulation of immune and metabolic functions in LUAD. A prognostic risk model for LUAD was constructed by the biomarkers HLA-DRB1, SCGB1A1, and HLA-DRB5 screened by Cox regression analysis, dividing the patients into high-risk and low-risk groups. The prognostic risk model was validated with external datasets. The low-risk group was characterized by enrichment of immune pathways and higher immune infiltration compared to the high-risk group. While, the high-risk group was accompanied by an increase in metabolic pathway activity. There were significant differences between the high- and low-risk groups in metabolic reprogramming of aerobic glycolysis, amino acids, and lipids, as well as in angiogenic activity, epithelial-mesenchymal transition, tumorigenic cytokines, and inflammatory response. Furthermore, high-risk patients were more sensitive to Afatinib, Gefitinib, and Gemcitabine as predicted by the pRRophetic algorithm. This study provides prognostic signatures capable of revealing the immune and metabolic landscapes for LUAD, and may shed light on the identification of other cancer biomarkers.
Collapse
Affiliation(s)
- Shimei Qin
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Shibin Sun
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Yahui Wang
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Chao Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Lei Fu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Ming Wu
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Jinxing Yan
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Wan Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China
| | - Junjie Lv
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China.
| | - Lina Chen
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150000, China.
| |
Collapse
|
16
|
Praljak N, Lian X, Ranganathan R, Ferguson AL. ProtWave-VAE: Integrating Autoregressive Sampling with Latent-Based Inference for Data-Driven Protein Design. ACS Synth Biol 2023; 12:3544-3561. [PMID: 37988083 PMCID: PMC10911954 DOI: 10.1021/acssynbio.3c00261] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Deep generative models (DGMs) have shown great success in the understanding and data-driven design of proteins. Variational autoencoders (VAEs) are a popular DGM approach that can learn the correlated patterns of amino acid mutations within a multiple sequence alignment (MSA) of protein sequences and distill this information into a low-dimensional latent space to expose phylogenetic and functional relationships and guide generative protein design. Autoregressive (AR) models are another popular DGM approach that typically lacks a low-dimensional latent embedding but does not require training sequences to be aligned into an MSA and enable the design of variable length proteins. In this work, we propose ProtWave-VAE as a novel and lightweight DGM, employing an information maximizing VAE with a dilated convolution encoder and an autoregressive WaveNet decoder. This architecture blends the strengths of the VAE and AR paradigms in enabling training over unaligned sequence data and the conditional generative design of variable length sequences from an interpretable, low-dimensional learned latent space. We evaluated the model's ability to infer patterns and design rules within alignment-free homologous protein family sequences and to design novel synthetic proteins in four diverse protein families. We show that our model can infer meaningful functional and phylogenetic embeddings within latent spaces and make highly accurate predictions within semisupervised downstream fitness prediction tasks. In an application to the C-terminal SH3 domain in the Sho1 transmembrane osmosensing receptor in baker's yeast, we subject ProtWave-VAE-designed sequences to experimental gene synthesis and select-seq assays for the osmosensing function to show that the model enables synthetic protein design, conditional C-terminus diversification, and engineering of the osmosensing function into SH3 paralogues.
Collapse
Affiliation(s)
- Nikša Praljak
- Graduate Program in Biophysical Sciences, University of Chicago, Chicago, Illinois 60637, United States
| | - Xinran Lian
- Department of Chemistry, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Center for Physics of Evolving Systems and Department of Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Andrew L Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
17
|
James JK, Norland K, Johar AS, Kullo IJ. Deep generative models of LDLR protein structure to predict variant pathogenicity. J Lipid Res 2023; 64:100455. [PMID: 37821076 PMCID: PMC10696256 DOI: 10.1016/j.jlr.2023.100455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 09/16/2023] [Accepted: 10/05/2023] [Indexed: 10/13/2023] Open
Abstract
The complex structure and function of low density lipoprotein receptor (LDLR) makes classification of protein-coding missense variants challenging. Deep generative models, including Evolutionary model of Variant Effect (EVE), Evolutionary Scale Modeling (ESM), and AlphaFold 2 (AF2), have enabled significant progress in the prediction of protein structure and function. ESM and EVE directly estimate the likelihood of a variant sequence but are purely data-driven and challenging to interpret. AF2 predicts LDLR structures, but variant effects are explicitly modeled by estimating changes in stability. We tested the effectiveness of these models for predicting variant pathogenicity compared to established methods. AF2 produced two distinct conformations based on a novel hinge mechanism. Within ESM's hidden space, benign and pathogenic variants had different distributions. In EVE, these distributions were similar. EVE and ESM were comparable to Polyphen-2, SIFT, REVEL, and Primate AI for predicting binary classifications in ClinVar. However, they were more strongly correlated with experimental measures of LDL uptake. AF2 poorly performed in these tasks. Using the UK Biobank to compare association with clinical phenotypes, ESM and EVE were more strongly associated with serum LDL-C than Polyphen-2. ESM was able to identify variants with more extreme LDL-C levels than EVE and had a significantly stronger association with atherosclerotic cardiovascular disease. In conclusion, AF2 predicted LDLR structures do not accurately model variant pathogenicity. ESM and EVE are competitive with prior scoring methods for prediction based on binary classifications in ClinVar but are superior based on correlations with experimental assays and clinical phenotypes.
Collapse
Affiliation(s)
- Jose K James
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Kristjan Norland
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Angad S Johar
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Iftikhar J Kullo
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA; Gonda Vascular Center, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
18
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 45] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
19
|
Michailidou F. The Scent of Change: Sustainable Fragrances Through Industrial Biotechnology. Chembiochem 2023; 24:e202300309. [PMID: 37668275 DOI: 10.1002/cbic.202300309] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 05/29/2023] [Indexed: 09/06/2023]
Abstract
Current environmental and safety considerations urge innovation to address the need for sustainable high-value chemicals that are embraced by consumers. This review discusses the concept of sustainable fragrances, as high-value, everyday and everywhere chemicals. Current and emerging technologies represent an opportunity to produce fragrances in an environmentally and socially responsible way. Biotechnology, including fermentation, biocatalysis, and genetic engineering, has the potential to reduce the environmental footprint of fragrance production while maintaining quality and consistency. Computational and in silico methods, including machine learning (ML), are also likely to augment the capabilities of sustainable fragrance production. Continued innovation and collaboration will be crucial to the future of sustainable fragrances, with a focus on developing novel sustainable ingredients, as well as ethical sourcing practices.
Collapse
Affiliation(s)
- Freideriki Michailidou
- Department of Health Sciences and Technology, ETH Zurich, Schmelzbergstrasse 9, 8092, Zürich, Switzerland
| |
Collapse
|
20
|
Braghetto A, Orlandini E, Baiesi M. Interpretable Machine Learning of Amino Acid Patterns in Proteins: A Statistical Ensemble Approach. J Chem Theory Comput 2023; 19:6011-6022. [PMID: 37552831 PMCID: PMC10500975 DOI: 10.1021/acs.jctc.3c00383] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Indexed: 08/10/2023]
Abstract
Explainable and interpretable unsupervised machine learning helps one to understand the underlying structure of data. We introduce an ensemble analysis of machine learning models to consolidate their interpretation. Its application shows that restricted Boltzmann machines compress consistently into a few bits the information stored in a sequence of five amino acids at the start or end of α-helices or β-sheets. The weights learned by the machines reveal unexpected properties of the amino acids and the secondary structure of proteins: (i) His and Thr have a negligible contribution to the amphiphilic pattern of α-helices; (ii) there is a class of α-helices particularly rich in Ala at their end; (iii) Pro occupies most often slots otherwise occupied by polar or charged amino acids, and its presence at the start of helices is relevant; (iv) Glu and especially Asp on one side and Val, Leu, Iso, and Phe on the other display the strongest tendency to mark amphiphilic patterns, i.e., extreme values of an effective hydrophobicity, though they are not the most powerful (non)hydrophobic amino acids.
Collapse
Affiliation(s)
- Anna Braghetto
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| | - Enzo Orlandini
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| | - Marco Baiesi
- Department
of Physics and Astronomy, University of
Padova, Via Marzolo 8, 35131 Padua, Italy
- INFN,
Sezione di Padova, Via
Marzolo 8, 35131 Padua, Italy
| |
Collapse
|
21
|
Li Y, Yao Y, Xia Y, Tang M. Searching for protein variants with desired properties using deep generative models. BMC Bioinformatics 2023; 24:297. [PMID: 37480001 PMCID: PMC10362698 DOI: 10.1186/s12859-023-05415-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2022] [Accepted: 07/17/2023] [Indexed: 07/23/2023] Open
Abstract
BACKGROUND Protein engineering aims to improve the functional properties of existing proteins to meet people's needs. Current deep learning-based models have captured evolutionary, functional, and biochemical features contained in amino acid sequences. However, the existing generative models need to be improved when capturing the relationship between amino acid sites on longer sequences. At the same time, the distribution of protein sequences in the homologous family has a specific positional relationship in the latent space. We want to use this relationship to search for new variants directly from the vicinity of better-performing varieties. RESULTS To improve the representation learning ability of the model for longer sequences and the similarity between the generated sequences and the original sequences, we propose a temporal variational autoencoder (T-VAE) model. T-VAE consists of an encoder and a decoder. The encoder expands the receptive field of neurons in the network structure by dilated causal convolution, thereby improving the encoding representation ability of longer sequences. The decoder decodes the sampled data into variants closely resembling the original sequence. CONCLUSION Compared to other models, the person correlation coefficient between the predicted values of protein fitness obtained by T-VAE and the truth values was higher, and the mean absolute deviation was lower. In addition, the T-VAE model has a better representation learning ability for longer sequences when comparing the encoding of protein sequences of different lengths. These results show that our model has more advantages in representation learning for longer sequences. To verify the model's generative effect, we also calculate the sequence identity between the generated data and the input data. The sequence identity obtained by T-VAE improved by 12.9% compared to the baseline model.
Collapse
Affiliation(s)
- Yan Li
- School of Information, Yunnan Normal University, Kunming, China
| | - Yinying Yao
- National Key Laboratory of Crop Genetic Improvement and National Centre of Plant Gene Research, Huazhong Agricultural University, Wuhan, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Yu Xia
- School of Information, Yunnan Normal University, Kunming, China
| | - Mingjing Tang
- Engineering Research Center of Sustainable Development and Utilization of Biomass Energy, Ministry of Education, Yunnan Normal University, Kunming, China
- School of Life Science, Yunnan Normal University, Kunming, China
| |
Collapse
|
22
|
Johnston KE, Fannjiang C, Wittmann BJ, Hie BL, Yang KK, Wu Z. Machine Learning for Protein Engineering. ARXIV 2023:arXiv:2305.16634v1. [PMID: 37292483 PMCID: PMC10246115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Directed evolution of proteins has been the most effective method for protein engineering. However, a new paradigm is emerging, fusing the library generation and screening approaches of traditional directed evolution with computation through the training of machine learning models on protein sequence fitness data. This chapter highlights successful applications of machine learning to protein engineering and directed evolution, organized by the improvements that have been made with respect to each step of the directed evolution cycle. Additionally, we provide an outlook for the future based on the current direction of the field, namely in the development of calibrated models and in incorporating other modalities, such as protein structure.
Collapse
Affiliation(s)
| | | | - Bruce J Wittmann
- work done while at California Institute of Technology, now at Microsoft
| | | | | | | |
Collapse
|
23
|
Chen Z, Wang X, Chen X, Huang J, Wang C, Wang J, Wang Z. Accelerating therapeutic protein design with computational approaches toward the clinical stage. Comput Struct Biotechnol J 2023; 21:2909-2926. [PMID: 38213894 PMCID: PMC10781723 DOI: 10.1016/j.csbj.2023.04.027] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 04/11/2023] [Accepted: 04/27/2023] [Indexed: 01/13/2024] Open
Abstract
Therapeutic protein, represented by antibodies, is of increasing interest in human medicine. However, clinical translation of therapeutic protein is still largely hindered by different aspects of developability, including affinity and selectivity, stability and aggregation prevention, solubility and viscosity reduction, and deimmunization. Conventional optimization of the developability with widely used methods, like display technologies and library screening approaches, is a time and cost-intensive endeavor, and the efficiency in finding suitable solutions is still not enough to meet clinical needs. In recent years, the accelerated advancement of computational methodologies has ushered in a transformative era in the field of therapeutic protein design. Owing to their remarkable capabilities in feature extraction and modeling, the integration of cutting-edge computational strategies with conventional techniques presents a promising avenue to accelerate the progression of therapeutic protein design and optimization toward clinical implementation. Here, we compared the differences between therapeutic protein and small molecules in developability and provided an overview of the computational approaches applicable to the design or optimization of therapeutic protein in several developability issues.
Collapse
Affiliation(s)
- Zhidong Chen
- Department of Pathology, The Eighth Affiliated Hospital, Sun Yat-sen University, Shenzhen 518033, China
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Xinpei Wang
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Xu Chen
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Juyang Huang
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Chenglin Wang
- Shenzhen Qiyu Biotechnology Co., Ltd, Shenzhen 518107, China
| | - Junqing Wang
- School of Pharmaceutical Sciences, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
| | - Zhe Wang
- Department of Pathology, The Eighth Affiliated Hospital, Sun Yat-sen University, Shenzhen 518033, China
| |
Collapse
|
24
|
Ziegler C, Martin J, Sinner C, Morcos F. Latent generative landscapes as maps of functional diversity in protein sequence space. Nat Commun 2023; 14:2222. [PMID: 37076519 PMCID: PMC10113739 DOI: 10.1038/s41467-023-37958-z] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2022] [Accepted: 04/05/2023] [Indexed: 04/21/2023] Open
Abstract
Variational autoencoders are unsupervised learning models with generative capabilities, when applied to protein data, they classify sequences by phylogeny and generate de novo sequences which preserve statistical properties of protein composition. While previous studies focus on clustering and generative features, here, we evaluate the underlying latent manifold in which sequence information is embedded. To investigate properties of the latent manifold, we utilize direct coupling analysis and a Potts Hamiltonian model to construct a latent generative landscape. We showcase how this landscape captures phylogenetic groupings, functional and fitness properties of several systems including Globins, β-lactamases, ion channels, and transcription factors. We provide support on how the landscape helps us understand the effects of sequence variability observed in experimental data and provides insights on directed and natural protein evolution. We propose that combining generative properties and functional predictive power of variational autoencoders and coevolutionary analysis could be beneficial in applications for protein engineering and design.
Collapse
Affiliation(s)
- Cheyenne Ziegler
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Jonathan Martin
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Claude Sinner
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA
| | - Faruck Morcos
- Department of Biological Sciences, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Department of Bioengineering, University of Texas at Dallas, Richardson, TX, 75080, USA.
- Center for Systems Biology, University of Texas at Dallas, Richardson, TX, 75080, USA.
| |
Collapse
|
25
|
Dou Z, Sun Y, Jiang X, Wu X, Li Y, Gong B, Wang L. Data-driven strategies for the computational design of enzyme thermal stability: trends, perspectives, and prospects. Acta Biochim Biophys Sin (Shanghai) 2023; 55:343-355. [PMID: 37143326 PMCID: PMC10160227 DOI: 10.3724/abbs.2023033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2022] [Accepted: 11/23/2022] [Indexed: 03/05/2023] Open
Abstract
Thermal stability is one of the most important properties of enzymes, which sustains life and determines the potential for the industrial application of biocatalysts. Although traditional methods such as directed evolution and classical rational design contribute greatly to this field, the enormous sequence space of proteins implies costly and arduous experiments. The development of enzyme engineering focuses on automated and efficient strategies because of the breakthrough of high-throughput DNA sequencing and machine learning models. In this review, we propose a data-driven architecture for enzyme thermostability engineering and summarize some widely adopted datasets, as well as machine learning-driven approaches for designing the thermal stability of enzymes. In addition, we present a series of existing challenges while applying machine learning in enzyme thermostability design, such as the data dilemma, model training, and use of the proposed models. Additionally, a few promising directions for enhancing the performance of the models are discussed. We anticipate that the efficient incorporation of machine learning can provide more insights and solutions for the design of enzyme thermostability in the coming years.
Collapse
Affiliation(s)
- Zhixin Dou
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Yuqing Sun
- School of SoftwareShandong UniversityJinan250101China
| | - Xukai Jiang
- National Glycoengineering Research CenterShandong UniversityQingdao266237China
| | - Xiuyun Wu
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Yingjie Li
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| | - Bin Gong
- School of SoftwareShandong UniversityJinan250101China
| | - Lushan Wang
- State Key Laboratory of Microbial TechnologyShandong UniversityQingdao266237China
| |
Collapse
|
26
|
Lupo U, Sgarbossa D, Bitbol AF. Protein language models trained on multiple sequence alignments learn phylogenetic relationships. Nat Commun 2022; 13:6298. [PMID: 36273003 PMCID: PMC9588007 DOI: 10.1038/s41467-022-34032-y] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 10/07/2022] [Indexed: 12/25/2022] Open
Abstract
Self-supervised neural language models with attention have recently been applied to biological sequence data, advancing structure, function and mutational effect prediction. Some protein language models, including MSA Transformer and AlphaFold's EvoFormer, take multiple sequence alignments (MSAs) of evolutionarily related proteins as inputs. Simple combinations of MSA Transformer's row attentions have led to state-of-the-art unsupervised structural contact prediction. We demonstrate that similarly simple, and universal, combinations of MSA Transformer's column attentions strongly correlate with Hamming distances between sequences in MSAs. Therefore, MSA-based language models encode detailed phylogenetic relationships. We further show that these models can separate coevolutionary signals encoding functional and structural constraints from phylogenetic correlations reflecting historical contingency. To assess this, we generate synthetic MSAs, either without or with phylogeny, from Potts models trained on natural MSAs. We find that unsupervised contact prediction is substantially more resilient to phylogenetic noise when using MSA Transformer versus inferred Potts models.
Collapse
Affiliation(s)
- Umberto Lupo
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| | - Damiano Sgarbossa
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland
| | - Anne-Florence Bitbol
- Institute of Bioengineering, School of Life Sciences, École Polytechnique Fédérale de Lausanne (EPFL), CH-1015, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, CH-1015, Lausanne, Switzerland.
| |
Collapse
|
27
|
Colberg M, Schofield J. Configurational entropy, transition rates, and optimal interactions for rapid folding in coarse-grained model proteins. J Chem Phys 2022; 157:125101. [PMID: 36182418 DOI: 10.1063/5.0098612] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Under certain conditions, the dynamics of coarse-grained models of solvated proteins can be described using a Markov state model, which tracks the evolution of populations of configurations. The transition rates among states that appear in the Markov model can be determined by computing the relative entropy of states and their mean first passage times. In this paper, we present an adaptive method to evaluate the configurational entropy and the mean first passage times for linear chain models with discontinuous potentials. The approach is based on event-driven dynamical sampling in a massively parallel architecture. Using the fact that the transition rate matrix can be calculated for any choice of interaction energies at any temperature, it is demonstrated how each state's energy can be chosen such that the average time to transition between any two states is minimized. The methods are used to analyze the optimization of the folding process of two protein systems: the crambin protein and a model with frustration and misfolding. It is shown that the folding pathways for both systems are comprised of two regimes: first, the rapid establishment of local bonds, followed by the subsequent formation of more distant contacts. The state energies that lead to the most rapid folding encourage multiple pathways, and they either penalize folding pathways through kinetic traps by raising the energies of trapping states or establish an escape route from the trapping states by lowering free energy barriers to other states that rapidly reach the native state.
Collapse
Affiliation(s)
- Margarita Colberg
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada
| | - Jeremy Schofield
- Chemical Physics Theory Group, Department of Chemistry, University of Toronto, Toronto, Ontario M5S 3H6, Canada
| |
Collapse
|
28
|
Feinauer C, Meynard-Piganeau B, Lucibello C. Interpretable pairwise distillations for generative protein sequence models. PLoS Comput Biol 2022; 18:e1010219. [PMID: 35737722 PMCID: PMC9258900 DOI: 10.1371/journal.pcbi.1010219] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 07/06/2022] [Accepted: 05/17/2022] [Indexed: 11/25/2022] Open
Abstract
Many different types of generative models for protein sequences have been proposed in literature. Their uses include the prediction of mutational effects, protein design and the prediction of structural properties. Neural network (NN) architectures have shown great performances, commonly attributed to the capacity to extract non-trivial higher-order interactions from the data. In this work, we analyze two different NN models and assess how close they are to simple pairwise distributions, which have been used in the past for similar problems. We present an approach for extracting pairwise models from more complex ones using an energy-based modeling framework. We show that for the tested models the extracted pairwise models can replicate the energies of the original models and are also close in performance in tasks like mutational effect prediction. In addition, we show that even simpler, factorized models often come close in performance to the original models. Complex neural networks trained on large biological datasets have recently shown powerful capabilites in tasks like the prediction of protein structure, assessing the effect of mutations on the fitness of proteins and even designing completely novel proteins with desired characteristics. The enthralling prospect of leveraging these advances in fields like medicine and synthetic biology has created a large amount of interest in academic research and industry. The connected question of what biological insights these methods actually gain during training has, however, received less attention. In this work, we systematically investigate in how far neural networks capture information that could not be captured by simpler models. To this end, we develop a method to train simpler models to imitate more complex models, and compare their performance to the original neural network models. Surprisingly, we find that the simpler models thus trained often perform on par with the neural networks, while having a considerably easier structure. This highlights the importance of finding ways to interpret the predictions of neural networks in these fields, which could inform the creation of better models, improve methods for their assessment and ultimately also increase our understanding of the underlying biology.
Collapse
Affiliation(s)
- Christoph Feinauer
- Department of Computing Sciences, Bocconi University, Milan, Italy
- Bocconi Institute for Data Science and Analytics (BIDSA), Milan, Italy
- * E-mail:
| | - Barthelemy Meynard-Piganeau
- Laboratory of Computational and Quantitative Biology (LCQB) UMR 7238 CNRS, Sorbonne Université, Paris, France
- Department of Applied Science and Technologies (DISAT), Politecnico di Torino, Turin, Italy
| | - Carlo Lucibello
- Department of Computing Sciences, Bocconi University, Milan, Italy
- Bocconi Institute for Data Science and Analytics (BIDSA), Milan, Italy
| |
Collapse
|
29
|
Freschlin CR, Fahlberg SA, Romero PA. Machine learning to navigate fitness landscapes for protein engineering. Curr Opin Biotechnol 2022; 75:102713. [PMID: 35413604 PMCID: PMC9177649 DOI: 10.1016/j.copbio.2022.102713] [Citation(s) in RCA: 46] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 01/05/2022] [Accepted: 02/28/2022] [Indexed: 11/19/2022]
Abstract
Machine learning (ML) is revolutionizing our ability to understand and predict the complex relationships between protein sequence, structure, and function. Predictive sequence-function models are enabling protein engineers to efficiently search the sequence space for useful proteins with broad applications in biotechnology. In this review, we highlight the recent advances in applying ML to protein engineering. We discuss supervised learning methods that infer the sequence-function mapping from experimental data and new sequence representation strategies for data-efficient modeling. We then describe the various ways in which ML can be incorporated into protein engineering workflows, including purely in silico searches, ML-assisted directed evolution, and generative models that can learn the underlying distribution of the protein function in a sequence space. ML-driven protein engineering will become increasingly powerful with continued advances in high-throughput data generation, data science, and deep learning.
Collapse
Affiliation(s)
- Chase R Freschlin
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Sarah A Fahlberg
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA
| | - Philip A Romero
- Department of Biochemistry, University of Wisconsin-Madison, Madison, WI, USA; Department of Chemical & Biological Engineering, University of Wisconsin-Madison, Madison, WI, USA.
| |
Collapse
|
30
|
Detlefsen NS, Hauberg S, Boomsma W. Learning meaningful representations of protein sequences. Nat Commun 2022; 13:1914. [PMID: 35395843 PMCID: PMC8993921 DOI: 10.1038/s41467-022-29443-w] [Citation(s) in RCA: 61] [Impact Index Per Article: 20.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Accepted: 03/15/2022] [Indexed: 01/27/2023] Open
Abstract
How we choose to represent our data has a fundamental impact on our ability to subsequently extract information from them. Machine learning promises to automatically determine efficient representations from large unstructured datasets, such as those arising in biology. However, empirical evidence suggests that seemingly minor changes to these machine learning models yield drastically different data representations that result in different biological interpretations of data. This begs the question of what even constitutes the most meaningful representation. Here, we approach this question for representations of protein sequences, which have received considerable attention in the recent literature. We explore two key contexts in which representations naturally arise: transfer learning and interpretable learning. In the first context, we demonstrate that several contemporary practices yield suboptimal performance, and in the latter we demonstrate that taking representation geometry into account significantly improves interpretability and lets the models reveal biological information that is otherwise obscured.
Collapse
Affiliation(s)
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs. Lyngby, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
31
|
Enhancing computational enzyme design by a maximum entropy strategy. Proc Natl Acad Sci U S A 2022; 119:2122355119. [PMID: 35135886 PMCID: PMC8851541 DOI: 10.1073/pnas.2122355119] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/03/2022] [Indexed: 01/16/2023] Open
Abstract
Although computational enzyme design is of great importance, the advances utilizing physics-based approaches have been slow, and further progress is urgently needed. One promising direction is using machine learning, but such strategies have not been established as effective tools for predicting the catalytic power of enzymes. Here, we show that the statistical energy inferred from homologous sequences with the maximum entropy (MaxEnt) principle significantly correlates with enzyme catalysis and stability at the active site region and the more distant region, respectively. This finding decodes enzyme architecture and offers a connection between enzyme evolution and the physical chemistry of enzyme catalysis, and it deepens our understanding of the stability-activity trade-off hypothesis for enzymes. Overall, the strong correlations found here provide a powerful way of guiding enzyme design.
Collapse
|
32
|
Giessel A, Dousis A, Ravichandran K, Smith K, Sur S, McFadyen I, Zheng W, Licht S. Therapeutic enzyme engineering using a generative neural network. Sci Rep 2022; 12:1536. [PMID: 35087131 PMCID: PMC8795449 DOI: 10.1038/s41598-022-05195-x] [Citation(s) in RCA: 19] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2021] [Accepted: 12/15/2021] [Indexed: 12/31/2022] Open
Abstract
Enhancing the potency of mRNA therapeutics is an important objective for treating rare diseases, since it may enable lower and less-frequent dosing. Enzyme engineering can increase potency of mRNA therapeutics by improving the expression, half-life, and catalytic efficiency of the mRNA-encoded enzymes. However, sequence space is incomprehensibly vast, and methods to map sequence to function (computationally or experimentally) are inaccurate or time-/labor-intensive. Here, we present a novel, broadly applicable engineering method that combines deep latent variable modelling of sequence co-evolution with automated protein library design and construction to rapidly identify metabolic enzyme variants that are both more thermally stable and more catalytically active. We apply this approach to improve the potency of ornithine transcarbamylase (OTC), a urea cycle enzyme for which loss of catalytic activity causes a rare but serious metabolic disease.
Collapse
Affiliation(s)
- Andrew Giessel
- Moderna Therapeutics, 200 Technology Square, Cambridge, MA, 02139, USA.
| | - Athanasios Dousis
- Moderna Therapeutics, 200 Technology Square, Cambridge, MA, 02139, USA
| | | | - Kevin Smith
- Moderna Therapeutics, 200 Technology Square, Cambridge, MA, 02139, USA
| | - Sreyoshi Sur
- Moderna Therapeutics, 200 Technology Square, Cambridge, MA, 02139, USA
| | - Iain McFadyen
- Moderna Therapeutics, 200 Technology Square, Cambridge, MA, 02139, USA
| | - Wei Zheng
- Moderna Therapeutics, 200 Technology Square, Cambridge, MA, 02139, USA
| | - Stuart Licht
- Moderna Therapeutics, 200 Technology Square, Cambridge, MA, 02139, USA.
| |
Collapse
|
33
|
AIM in Genomic Basis of Medicine: Applications. Artif Intell Med 2022. [DOI: 10.1007/978-3-030-64573-1_264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
34
|
Cadet XF, Gelly JC, van Noord A, Cadet F, Acevedo-Rocha CG. Learning Strategies in Protein Directed Evolution. Methods Mol Biol 2022; 2461:225-275. [PMID: 35727454 DOI: 10.1007/978-1-0716-2152-3_15] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Synthetic biology is a fast-evolving research field that combines biology and engineering principles to develop new biological systems for medical, pharmacological, and industrial applications. Synthetic biologists use iterative "design, build, test, and learn" cycles to efficiently engineer genetic systems that are reliable, reproducible, and predictable. Protein engineering by directed evolution can benefit from such a systematic engineering approach for various reasons. Learning can be carried out before starting, throughout or after finalizing a directed evolution project. Computational tools, bioinformatics, and scanning mutagenesis methods can be excellent starting points, while molecular dynamics simulations and other strategies can guide engineering efforts. Similarly, studying protein intermediates along evolutionary pathways offers fascinating insights into the molecular mechanisms shaped by evolution. The learning step of the cycle is not only crucial for proteins or enzymes that are not suitable for high-throughput screening or selection systems, but it is also valuable for any platform that can generate a large amount of data that can be aided by machine learning algorithms. The main challenge in protein engineering is to predict the effect of a single mutation on one functional parameter-to say nothing of several mutations on multiple parameters. This is largely due to nonadditive mutational interactions, known as epistatic effects-beneficial mutations present in a genetic background may not be beneficial in another genetic background. In this work, we provide an overview of experimental and computational strategies that can guide the user to learn protein function at different stages in a directed evolution project. We also discuss how epistatic effects can influence the success of directed evolution projects. Since machine learning is gaining momentum in protein engineering and the field is becoming more interdisciplinary thanks to collaboration between mathematicians, computational scientists, engineers, molecular biologists, and chemists, we provide a general workflow that familiarizes nonexperts with the basic concepts, dataset requirements, learning approaches, model capabilities and performance metrics of this intriguing area. Finally, we also provide some practical recommendations on how machine learning can harness epistatic effects for engineering proteins in an "outside-the-box" way.
Collapse
Affiliation(s)
- Xavier F Cadet
- PEACCEL, Artificial Intelligence Department, Paris, France
| | - Jean Christophe Gelly
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | | - Frédéric Cadet
- Laboratoire d'Excellence GR-Ex, Paris, France
- BIGR, DSIMB, UMR_S1134, INSERM, University of Paris & University of Reunion, Paris, France
| | | |
Collapse
|
35
|
Deep generative modeling for protein design. Curr Opin Struct Biol 2021; 72:226-236. [PMID: 34963082 DOI: 10.1016/j.sbi.2021.11.008] [Citation(s) in RCA: 45] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 11/01/2021] [Accepted: 11/22/2021] [Indexed: 11/21/2022]
Abstract
Deep learning approaches have produced substantial breakthroughs in fields such as image classification and natural language processing and are making rapid inroads in the area of protein design. Many generative models of proteins have been developed that encompass all known protein sequences, model specific protein families, or extrapolate the dynamics of individual proteins. Those generative models can learn protein representations that are often more informative of protein structure and function than hand-engineered features. Furthermore, they can be used to quickly propose millions of novel proteins that resemble the native counterparts in terms of expression level, stability, or other attributes. The protein design process can further be guided by discriminative oracles to select candidates with the highest probability of having the desired properties. In this review, we discuss five classes of generative models that have been most successful at modeling proteins and provide a framework for model guided protein design.
Collapse
|
36
|
McGee F, Hauri S, Novinger Q, Vucetic S, Levy RM, Carnevale V, Haldane A. The generative capacity of probabilistic protein sequence models. Nat Commun 2021; 12:6302. [PMID: 34728624 PMCID: PMC8563988 DOI: 10.1038/s41467-021-26529-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2021] [Accepted: 09/23/2021] [Indexed: 01/10/2023] Open
Abstract
Potts models and variational autoencoders (VAEs) have recently gained popularity as generative protein sequence models (GPSMs) to explore fitness landscapes and predict mutation effects. Despite encouraging results, current model evaluation metrics leave unclear whether GPSMs faithfully reproduce the complex multi-residue mutational patterns observed in natural sequences due to epistasis. Here, we develop a set of sequence statistics to assess the "generative capacity" of three current GPSMs: the pairwise Potts Hamiltonian, the VAE, and the site-independent model. We show that the Potts model's generative capacity is largest, as the higher-order mutational statistics generated by the model agree with those observed for natural sequences, while the VAE's lies between the Potts and site-independent models. Importantly, our work provides a new framework for evaluating and interpreting GPSM accuracy which emphasizes the role of higher-order covariation and epistasis, with broader implications for probabilistic sequence models in general.
Collapse
Affiliation(s)
- Francisco McGee
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA
- Department of Biology, Temple University, Philadelphia, 19122, USA
| | - Sandro Hauri
- Center for Hybrid Intelligence, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Quentin Novinger
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Slobodan Vucetic
- Center for Hybrid Intelligence, Temple University, Philadelphia, 19122, USA
- Department of Computer & Information Sciences, Temple University, Philadelphia, 19122, USA
| | - Ronald M Levy
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA
- Department of Biology, Temple University, Philadelphia, 19122, USA
- Department of Physics, Temple University, Philadelphia, 19122, USA
- Department of Chemistry, Temple University, Philadelphia, 19122, USA
| | - Vincenzo Carnevale
- Institute for Computational Molecular Science, Temple University, Philadelphia, 19122, USA.
- Department of Biology, Temple University, Philadelphia, 19122, USA.
| | - Allan Haldane
- Center for Biophysics and Computational Biology, Temple University, Philadelphia, 19122, USA.
- Department of Chemistry, Temple University, Philadelphia, 19122, USA.
| |
Collapse
|
37
|
Al-Saggaf UM, Usman M, Naseem I, Moinuddin M, Jiman AA, Alsaggaf MU, Alshoubaki HK, Khan S. ECM-LSE: Prediction of Extracellular Matrix Proteins Using Deep Latent Space Encoding of k-Spaced Amino Acid Pairs. Front Bioeng Biotechnol 2021; 9:752658. [PMID: 34722479 PMCID: PMC8552119 DOI: 10.3389/fbioe.2021.752658] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Accepted: 09/13/2021] [Indexed: 12/26/2022] Open
Abstract
Extracelluar matrix (ECM) proteins create complex networks of macromolecules which fill-in the extracellular spaces of living tissues. They provide structural support and play an important role in maintaining cellular functions. Identification of ECM proteins can play a vital role in studying various types of diseases. Conventional wet lab-based methods are reliable; however, they are expensive and time consuming and are, therefore, not scalable. In this research, we propose a sequence-based novel machine learning approach for the prediction of ECM proteins. In the proposed method, composition of k-spaced amino acid pair (CKSAAP) features are encoded into a classifiable latent space (LS) with the help of deep latent space encoding (LSE). A comprehensive ablation analysis is conducted for performance evaluation of the proposed method. Results are compared with other state-of-the-art methods on the benchmark dataset, and the proposed ECM-LSE approach has shown to comprehensively outperform the contemporary methods.
Collapse
Affiliation(s)
- Ubaid M. Al-Saggaf
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Usman
- Department of Computer Engineering, Chosun University, Gwangju, South Korea
| | - Imran Naseem
- Research and Development, Love For Data, Karachi, Pakistan
- School of Electrical, Electronic and Computer Engineering, The University of Western Australia, Perth, WA, Australia
- College of Engineering, Karachi Institute of Economics and Technology, Korangi Creek, Karachi, Pakistan
| | - Muhammad Moinuddin
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Ahmad A. Jiman
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Mohammed U. Alsaggaf
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Department of Radiology, Faculty of Medicine, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Hitham K. Alshoubaki
- Center of Excellence in Intelligent Engineering Systems, King Abdulaziz University, Jeddah, Saudi Arabia
- Electrical and Computer Engineering Department, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Shujaat Khan
- Department of Bio and Brain Engineering, Daejeon, South Korea
| |
Collapse
|
38
|
Bitard-Feildel T. Navigating the amino acid sequence space between functional proteins using a deep learning framework. PeerJ Comput Sci 2021; 7:e684. [PMID: 34616884 PMCID: PMC8459775 DOI: 10.7717/peerj-cs.684] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2021] [Accepted: 07/30/2021] [Indexed: 06/13/2023]
Abstract
MOTIVATION Shedding light on the relationships between protein sequences and functions is a challenging task with many implications in protein evolution, diseases understanding, and protein design. The protein sequence space mapping to specific functions is however hard to comprehend due to its complexity. Generative models help to decipher complex systems thanks to their abilities to learn and recreate data specificity. Applied to proteins, they can capture the sequence patterns associated with functions and point out important relationships between sequence positions. By learning these dependencies between sequences and functions, they can ultimately be used to generate new sequences and navigate through uncharted area of molecular evolution. RESULTS This study presents an Adversarial Auto-Encoder (AAE) approached, an unsupervised generative model, to generate new protein sequences. AAEs are tested on three protein families known for their multiple functions the sulfatase, the HUP and the TPP families. Clustering results on the encoded sequences from the latent space computed by AAEs display high level of homogeneity regarding the protein sequence functions. The study also reports and analyzes for the first time two sampling strategies based on latent space interpolation and latent space arithmetic to generate intermediate protein sequences sharing sequential properties of original sequences linked to known functional properties issued from different families and functions. Generated sequences by interpolation between latent space data points demonstrate the ability of the AAE to generalize and produce meaningful biological sequences from an evolutionary uncharted area of the biological sequence space. Finally, 3D structure models computed by comparative modelling using generated sequences and templates of different sub-families point out to the ability of the latent space arithmetic to successfully transfer protein sequence properties linked to function between different sub-families. All in all this study confirms the ability of deep learning frameworks to model biological complexity and bring new tools to explore amino acid sequence and functional spaces.
Collapse
Affiliation(s)
- Tristan Bitard-Feildel
- IBPS, CNRS, Laboratoire de Biologie Computationnelle et Quantitative, Sorbonne Université, Paris, France
- Institut des Sciences du Calcul et de des Données (ISCD), Sorbonne Université, Paris, France
| |
Collapse
|
39
|
Michael E, Simonson T. How much can physics do for protein design? Curr Opin Struct Biol 2021; 72:46-54. [PMID: 34461593 DOI: 10.1016/j.sbi.2021.07.011] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Revised: 07/22/2021] [Accepted: 07/25/2021] [Indexed: 01/03/2023]
Abstract
Physics and physical chemistry are an important thread in computational protein design, complementary to knowledge-based tools. They provide molecular mechanics scoring functions that need little or no ad hoc parameter readjustment, methods to thoroughly sample equilibrium ensembles, and different levels of approximation for conformational flexibility. They led recently to the successful redesign of a small protein using a physics-based folded state energy. Adaptive Monte Carlo or molecular dynamics schemes were discovered where protein variants are populated as per their ligand-binding free energy or catalytic efficiency. Molecular dynamics have been used for backbone flexibility. Implicit solvent models have been refined, polarizable force fields applied, and many physical insights obtained.
Collapse
Affiliation(s)
- Eleni Michael
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128, Palaiseau, France
| | - Thomas Simonson
- Laboratoire de Biologie Structurale de la Cellule (CNRS UMR7654), Ecole Polytechnique, 91128, Palaiseau, France.
| |
Collapse
|
40
|
Siedhoff NE, Illig AM, Schwaneberg U, Davari MD. PyPEF-An Integrated Framework for Data-Driven Protein Engineering. J Chem Inf Model 2021; 61:3463-3476. [PMID: 34260225 DOI: 10.1021/acs.jcim.1c00099] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Data-driven strategies are gaining increased attention in protein engineering due to recent advances in access to large experimental databanks of proteins, next-generation sequencing (NGS), high-throughput screening (HTS) methods, and the development of artificial intelligence algorithms. However, the reliable prediction of beneficial amino acid substitutions, their combination, and the effect on functional properties remain the most significant challenges in protein engineering, which is applied to develop proteins and enzymes for biocatalysis, biomedicine, and life sciences. Here, we present a general-purpose framework (PyPEF: pythonic protein engineering framework) for performing data-driven protein engineering using machine learning methods combined with techniques from signal processing and statistical physics. PyPEF guides the identification and selection of beneficial proteins of a defined sequence space by systematically or randomly exploring the fitness of variants and by sampling random evolution pathways. The performance of PyPEF was evaluated concerning its predictive accuracy and throughput on four public protein and enzyme data sets using common regression models. It was proved that the program could efficiently predict the fitness of protein sequences for different target properties (predictive models with coefficient of determination values ranging from 0.58 to 0.92). By combining machine learning and protein evolution, PyPEF enabled the screening of proteins with various functions, reaching a screening capacity of more than 500,000 protein sequence variants in the timeframe of only a few minutes on a personal computer. PyPEF displayed significant accuracies on four public data sets (different proteins and properties) and underlined the potential of integrating data-driven technologies for covering different philosophies by either predicting the fitness of the variants to the highest accuracy accounting for epistatic effects or capturing the general trend of introduced mutations on the fitness in directed protein evolution campaigns. In essence, PyPEF can provide a powerful solution to current sequence exploration and combinatorial problems faced in protein engineering through exhaustive in silico screening of the sequence space.
Collapse
Affiliation(s)
- Niklas E Siedhoff
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
| | | | - Ulrich Schwaneberg
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany.,DWI-Leibniz Institute for Interactive Materials, Forckenbeckstraße 50, 52074 Aachen, Germany
| | - Mehdi D Davari
- Institute of Biotechnology, RWTH Aachen University, Worringer Weg 3, 52074 Aachen, Germany
| |
Collapse
|
41
|
Bepler T, Berger B. Learning the protein language: Evolution, structure, and function. Cell Syst 2021; 12:654-669.e3. [PMID: 34139171 PMCID: PMC8238390 DOI: 10.1016/j.cels.2021.05.017] [Citation(s) in RCA: 215] [Impact Index Per Article: 53.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Revised: 05/20/2021] [Accepted: 05/20/2021] [Indexed: 02/06/2023]
Abstract
Language models have recently emerged as a powerful machine-learning approach for distilling information from massive protein sequence databases. From readily available sequence data alone, these models discover evolutionary, structural, and functional organization across protein space. Using language models, we can encode amino-acid sequences into distributed vector representations that capture their structural and functional properties, as well as evaluate the evolutionary fitness of sequence variants. We discuss recent advances in protein language modeling and their applications to downstream protein property prediction problems. We then consider how these models can be enriched with prior biological knowledge and introduce an approach for encoding protein structural knowledge into the learned representations. The knowledge distilled by these models allows us to improve downstream function prediction through transfer learning. Deep protein language models are revolutionizing protein biology. They suggest new ways to approach protein and therapeutic design. However, further developments are needed to encode strong biological priors into protein language models and to increase their accessibility to the broader community.
Collapse
Affiliation(s)
- Tristan Bepler
- Simons Machine Learning Center, New York Structural Biology Center, New York, NY, USA; Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA; Computational and Systems Biology Program, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
| |
Collapse
|
42
|
Osadchy M, Kolodny R. How Deep Learning Tools Can Help Protein Engineers Find Good Sequences. J Phys Chem B 2021; 125:6440-6450. [PMID: 34105961 DOI: 10.1021/acs.jpcb.1c02449] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
The deep learning revolution introduced a new and efficacious way to address computational challenges in a wide range of fields, relying on large data sets and powerful computational resources. In protein engineering, we consider the challenge of computationally predicting properties of a protein and designing sequences with these properties. Indeed, accurate and fast deep network oracles for different properties of proteins have been developed. These learn to predict a property from an amino acid sequence by training on large sets of proteins that have this property. In particular, deep networks can learn from the set of all known protein sequences to identify ones that are protein-like. A fundamental challenge when engineering sequences that are both protein-like and satisfy a desired property is that these are rare instances within the vast space of all possible ones. When searching for these very rare instances, one would like to use good sampling procedures. Sampling approaches that are decoupled from the prediction of the property or in which the predictor uses only post-sampling to identify good instances are less efficient. The alternative is to use sampling methods that are geared to generate sequences satisfying and/or optimizing the predictor's desired properties. Deep learning has a class of architectures, denoted as generative models, which offer the capability of sampling from the learned distribution of a predicted property. Here, we review the use of deep learning tools to find good sequences for protein engineering, including developing oracles/predictors of a property of the proteins and methods that sample from a distribution of protein-like sequences to optimize the desired property.
Collapse
Affiliation(s)
- Margarita Osadchy
- Department of Computer Science, Jacobs Building, University of Haifa, 199 Aba Houshi Road, Mount Carmel, Haifa, Israel 3498838
| | - Rachel Kolodny
- Department of Computer Science, Jacobs Building, University of Haifa, 199 Aba Houshi Road, Mount Carmel, Haifa, Israel 3498838
| |
Collapse
|
43
|
Hayes RL, Brooks CL. A strategy for proline and glycine mutations to proteins with alchemical free energy calculations. J Comput Chem 2021; 42:1088-1094. [PMID: 33844328 DOI: 10.1002/jcc.26525] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2020] [Revised: 03/03/2021] [Accepted: 03/05/2021] [Indexed: 11/07/2022]
Abstract
Computation of the thermodynamic consequences of protein mutations holds great promise in protein biophysics and design. Alchemical free energy methods can give improved estimates of mutational free energies, and are already widely used in calculations of relative and absolute binding free energies in small molecule design problems. In principle, alchemical methods can address any amino acid mutation with an appropriate alchemical pathway, but identifying a strategy that produces such a path for proline and glycine mutations is an ongoing challenge. Most current strategies perturb only side chain atoms, while proline and glycine mutations also alter the backbone parameters and backbone ring topology. Some strategies also perturb backbone parameters and enable glycine mutations. This work presents a strategy that enables both proline and glycine mutations and comprises two key elements: a dual backbone with restraints and scaling of bonded terms, facilitating backbone parameter changes, and a soft bond in the proline ring, enabling ring topology changes in proline mutations. These elements also have utility for core hopping and macrocycle studies in computer-aided drug design. This new strategy shows slight improvements over an alternative side chain perturbation strategy for a set T4 lysozyme mutations lacking proline and glycine, and yields good agreement with experiment for a set of T4 lysozyme proline and glycine mutations not previously studied. To our knowledge this is the first report comparing alchemical predictions of proline mutations with experiment. With this strategy in hand, alchemical methods now have access to the full palette of amino acid mutations.
Collapse
Affiliation(s)
- Ryan L Hayes
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan, USA
| | - Charles L Brooks
- Department of Chemistry, University of Michigan, Ann Arbor, Michigan, USA.,Biophysics Program, University of Michigan, Ann Arbor, Michigan, USA
| |
Collapse
|
44
|
Smerlak M. Quasi-species evolution maximizes genotypic reproductive value (not fitness or flatness). J Theor Biol 2021; 522:110699. [PMID: 33794289 DOI: 10.1016/j.jtbi.2021.110699] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2020] [Revised: 02/23/2021] [Accepted: 03/23/2021] [Indexed: 10/21/2022]
Abstract
Growing efforts to measure fitness landscapes in molecular and microbial systems are motivated by a longstanding goal to predict future evolutionary trajectories. Sometimes under-appreciated, however, is that the fitness landscape and its topography do not by themselves determine the direction of evolution: under sufficiently high mutation rates, populations can climb the closest fitness peak (survival of the fittest), settle in lower regions with higher mutational robustness (survival of the flattest), or even fail to adapt altogether (error catastrophes). I show that another measure of reproductive success, Fisher's reproductive value, resolves the trade-off between fitness and robustness in the quasi-species regime of evolution: to forecast the motion of a population in genotype space, one should look for peaks in the (mutation-rate dependent) landscape of genotypic reproductive values-whether or not these peaks correspond to local fitness maxima or flat fitness plateaus. This new landscape picture turns quasi-species dynamics into an instance of non-equilibrium dynamics, in the physical sense of Markovian processes, potential landscapes, entropy production, etc.
Collapse
Affiliation(s)
- Matteo Smerlak
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany
| |
Collapse
|
45
|
Ferguson AL, Ranganathan R. 100th Anniversary of Macromolecular Science Viewpoint: Data-Driven Protein Design. ACS Macro Lett 2021; 10:327-340. [PMID: 35549066 DOI: 10.1021/acsmacrolett.0c00885] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The design of synthetic proteins with the desired function is a long-standing goal in biomolecular science, with broad applications in biochemical engineering, agriculture, medicine, and public health. Rational de novo design and experimental directed evolution have achieved remarkable successes but are challenged by the requirement to find functional "needles" in the vast "haystack" of protein sequence space. Data-driven models for fitness landscapes provide a predictive map between protein sequence and function and can prospectively identify functional candidates for experimental testing to greatly improve the efficiency of this search. This Viewpoint reviews the applications of machine learning and, in particular, deep learning as part of data-driven protein engineering platforms. We highlight recent successes, review promising computational methodologies, and provide an outlook on future challenges and opportunities. The article is written for a broad audience comprising both polymer and protein scientists and computer and data scientists interested in an up-to-date review of recent innovations and opportunities in this rapidly evolving field.
Collapse
Affiliation(s)
- Andrew L. Ferguson
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
| | - Rama Ranganathan
- Pritzker School of Molecular Engineering, University of Chicago, Chicago, Illinois 60637, United States
- Center for Physics of Evolving Systems, University of Chicago, Chicago, Illinois 60637, United States
- Biochemistry and Molecular Biology, University of Chicago, Chicago, Illinois 60637, United States
| |
Collapse
|
46
|
Kopf A, Claassen M. Latent representation learning in biology and translational medicine. PATTERNS (NEW YORK, N.Y.) 2021; 2:100198. [PMID: 33748792 PMCID: PMC7961186 DOI: 10.1016/j.patter.2021.100198] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Current data generation capabilities in the life sciences render scientists in an apparently contradicting situation. While it is possible to simultaneously measure an ever-increasing number of systems parameters, the resulting data are becoming increasingly difficult to interpret. Latent variable modeling allows for such interpretation by learning non-measurable hidden variables from observations. This review gives an overview over the different formal approaches to latent variable modeling, as well as applications at different scales of biological systems, such as molecular structures, intra- and intercellular regulatory up to physiological networks. The focus is on demonstrating how these approaches have enabled interpretable representations and ultimately insights in each of these domains. We anticipate that a wider dissemination of latent variable modeling in the life sciences will enable a more effective and productive interpretation of studies based on heterogeneous and high-dimensional data modalities.
Collapse
Affiliation(s)
- Andreas Kopf
- Institute of Molecular Systems Biology, ETH Zürich, 8093 Zürich, Switzerland
| | - Manfred Claassen
- Division of Clinical Bioinformatics, Department of Internal Medicine I, University Hospital Tübingen, 72076 Tübingen, Germany
- Computer Science Department, Eberhard Karls University of Tübingen, 72076 Tübingen, Germany
- Cluster of Excellence Machine Learning (EXC 2064), Eberhard Karls University of Tübingen, 72076 Tübingen, Germany
| |
Collapse
|
47
|
Wittmann BJ, Johnston KE, Wu Z, Arnold FH. Advances in machine learning for directed evolution. Curr Opin Struct Biol 2021; 69:11-18. [PMID: 33647531 DOI: 10.1016/j.sbi.2021.01.008] [Citation(s) in RCA: 78] [Impact Index Per Article: 19.5] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 01/09/2021] [Accepted: 01/26/2021] [Indexed: 01/11/2023]
Abstract
Machine learning (ML) can expedite directed evolution by allowing researchers to move expensive experimental screens in silico. Gathering sequence-function data for training ML models, however, can still be costly. In contrast, raw protein sequence data is widely available. Recent advances in ML approaches use protein sequences to augment limited sequence-function data for directed evolution. We highlight contributions in a growing effort to use sequences to reduce or eliminate the amount of sequence-function data needed for effective in silico screening. We also highlight approaches that use ML models trained on sequences to generate new functional sequence diversity, focusing on strategies that use these generative models to efficiently explore vast regions of protein space.
Collapse
Affiliation(s)
- Bruce J Wittmann
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, CA 91125, USA
| | - Kadina E Johnston
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, CA 91125, USA
| | - Zachary Wu
- Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, CA 91125, USA; Present address: Google DeepMind, 6 Pancras Square, Kings Cross, London, N1C 4AG, UK
| | - Frances H Arnold
- Division of Biology and Biological Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, CA 91125, USA; Division of Chemistry and Chemical Engineering, California Institute of Technology, MC 210-41, 1200 E. California Boulevard, Pasadena, CA 91125, USA.
| |
Collapse
|
48
|
Kamada M, Okuno Y. AIM in Genomic Basis of Medicine: Applications. Artif Intell Med 2021. [DOI: 10.1007/978-3-030-58080-3_264-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
49
|
Gao W, Mahajan SP, Sulam J, Gray JJ. Deep Learning in Protein Structural Modeling and Design. PATTERNS (NEW YORK, N.Y.) 2020; 1:100142. [PMID: 33336200 PMCID: PMC7733882 DOI: 10.1016/j.patter.2020.100142] [Citation(s) in RCA: 100] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
Deep learning is catalyzing a scientific revolution fueled by big data, accessible toolkits, and powerful computational resources, impacting many fields, including protein structural modeling. Protein structural modeling, such as predicting structure from amino acid sequence and evolutionary information, designing proteins toward desirable functionality, or predicting properties or behavior of a protein, is critical to understand and engineer biological systems at the molecular level. In this review, we summarize the recent advances in applying deep learning techniques to tackle problems in protein structural modeling and design. We dissect the emerging approaches using deep learning techniques for protein structural modeling and discuss advances and challenges that must be addressed. We argue for the central importance of structure, following the "sequence → structure → function" paradigm. This review is directed to help both computational biologists to gain familiarity with the deep learning methods applied in protein modeling, and computer scientists to gain perspective on the biologically meaningful problems that may benefit from deep learning techniques.
Collapse
Affiliation(s)
- Wenhao Gao
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Sai Pooja Mahajan
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeremias Sulam
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Jeffrey J. Gray
- Department of Chemical and Biomolecular Engineering, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
50
|
Strokach A, Becerra D, Corbi-Verge C, Perez-Riba A, Kim PM. Fast and Flexible Protein Design Using Deep Graph Neural Networks. Cell Syst 2020; 11:402-411.e4. [DOI: 10.1016/j.cels.2020.08.016] [Citation(s) in RCA: 56] [Impact Index Per Article: 11.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2020] [Revised: 06/27/2020] [Accepted: 08/26/2020] [Indexed: 11/15/2022]
|