1
|
Ji D, Frkic RL, Delyami J, Larsen JS, Spence MA, Jackson CJ. A Thermostable Bacterial Metallohydrolase that Degrades Organophosphate Plasticizers. Chembiochem 2025:e2500055. [PMID: 40364453 DOI: 10.1002/cbic.202500055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Revised: 05/08/2025] [Accepted: 05/12/2025] [Indexed: 05/15/2025]
Abstract
A cyclase-phosphotriesterase (C-PTE) from Ruegeria pomeroyi DSS-3 has recently been identified for its capacity to detoxify several organophosphate compounds. However, several aspects of this enzyme remain unexplored, such as its activity with industrial organophosphates, its molecular structure, and its thermostability. In this work, the crystal structure of C-PTE is reported, which is solved to 2.3 Å resolution, providing insight into the enzyme's mechanism of action, revealing a binuclear Zn2+ active site and distant similarity to other phosphotriesterases from the amidohydrolase superfamily. It is shown that C-PTE catalyzes the hydrolysis of the OP plasticizers triphenyl phosphate (TPhP) and tris(2-chloropropyl) phosphate (TCPP), albeit with low efficiency, but not the sterically bulkier tri-o-tolyl phosphate (ToTP). Finally, it is demonstrated that, even though Ruegeria pomeroyi DSS-3 is not a thermophile, C-PTE exhibits remarkable thermostability and retains structure up to 90 °C. Overall, these findings advance the understanding of C-PTE, suggesting that it is a good candidate for engineering owing to its thermostability and that it could contribute to bioremediation strategies to reduce the impact of pollution by industrial organophosphates.
Collapse
Affiliation(s)
- Dawei Ji
- Research School of Chemistry, Australian National University, Canberra, ACT, 2601, Australia
| | - Rebecca L Frkic
- Research School of Chemistry, Australian National University, Canberra, ACT, 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Australian National University, Canberra, ACT, 2601, Australia
| | - Javad Delyami
- Research School of Chemistry, Australian National University, Canberra, ACT, 2601, Australia
| | - Joachim S Larsen
- Research School of Chemistry, Australian National University, Canberra, ACT, 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Australian National University, Canberra, ACT, 2601, Australia
| | - Matthew A Spence
- Research School of Chemistry, Australian National University, Canberra, ACT, 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Australian National University, Canberra, ACT, 2601, Australia
| | - Colin J Jackson
- Research School of Chemistry, Australian National University, Canberra, ACT, 2601, Australia
- ARC Centre of Excellence for Innovations in Peptide & Protein Science, Australian National University, Canberra, ACT, 2601, Australia
- ARC Centre of Excellence in Synthetic Biology, Australian National University, Canberra, ACT, 2601, Australia
| |
Collapse
|
2
|
Pandey A, Chen W, Keten S. COLOR: A Compositional Linear Operation-Based Representation of Protein Sequences for Identification of Monomer Contributions to Properties. J Chem Inf Model 2025; 65:4320-4333. [PMID: 40272990 DOI: 10.1021/acs.jcim.5c00205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/26/2025]
Abstract
The properties of biological materials like proteins and nucleic acids are largely determined by their primary sequence. Certain segments in the sequence strongly influence specific functions, but identifying these segments, or so-called motifs, is challenging due to the complexity of sequential data. While deep learning (DL) models can accurately capture sequence-property relationships, the degree of nonlinearity in these models limits the assessment of monomer contributions to a property─a critical step in identifying key motifs. Recent advances in explainable AI (XAI) offer attention and gradient-based methods for estimating monomeric contributions. However, these methods are primarily applied to classification tasks, such as binding site identification, where they achieve limited accuracy (40-45%) and rely on qualitative evaluations. To address these limitations, we introduce a DL model with interpretable steps, enabling direct tracing of monomeric contributions. Inspired by the masking technique commonly used in vision and natural language processing domains, we propose a new metric ( I ) for quantitative analysis on datasets mainly containing distinct properties of anticancer peptides (ACP), antimicrobial peptides (AMP), and collagen. Our model exhibits 22% higher explainability than the gradient and attention-based state-of-the-art models, recognizes critical motifs (RRR, RRI, and RSS) that significantly destabilize ACPs, and identifies motifs in AMPs that are 50% more effective in converting non-AMPs to AMPs. These findings highlight the potential of our model in guiding mutation strategies for designing protein-based biomaterials.
Collapse
Affiliation(s)
- Akash Pandey
- Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States
| | - Wei Chen
- Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States
| | - Sinan Keten
- Department of Mechanical Engineering, Northwestern University, Evanston, Illinois 60208, United States
- Department of Civil and Environmental Engineering, Northwestern University, Evanston, Illinois 60208, United States
| |
Collapse
|
3
|
Thompson M, Martín M, Olmo TS, Rajesh C, Koo PK, Bolognesi B, Lehner B. Massive experimental quantification allows interpretable deep learning of protein aggregation. SCIENCE ADVANCES 2025; 11:eadt5111. [PMID: 40305601 PMCID: PMC12042874 DOI: 10.1126/sciadv.adt5111] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/29/2024] [Accepted: 03/26/2025] [Indexed: 05/02/2025]
Abstract
Protein aggregation is a pathological hallmark of more than 50 human diseases and a major problem for biotechnology. Methods have been proposed to predict aggregation from sequence, but these have been trained and evaluated on small and biased experimental datasets. Here we directly address this data shortage by experimentally quantifying the aggregation of >100,000 protein sequences. This unprecedented dataset reveals the limited performance of existing computational methods and allows us to train CANYA, a convolution-attention hybrid neural network that accurately predicts aggregation from sequence. We adapt genomic neural network interpretability analyses to reveal CANYA's decision-making process and learned grammar. Our results illustrate the power of massive experimental analysis of random sequence-spaces and provide an interpretable and robust neural network model to predict aggregation.
Collapse
Affiliation(s)
- Mike Thompson
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain
| | - Mariano Martín
- Institute for Bioengineering of Catalonia (IBEC), Barcelona Institute of Science and Technology, Barcelona 08028, Spain
| | - Trinidad Sanmartín Olmo
- Institute for Bioengineering of Catalonia (IBEC), Barcelona Institute of Science and Technology, Barcelona 08028, Spain
| | - Chandana Rajesh
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Peter K. Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY 11724, USA
| | - Benedetta Bolognesi
- Institute for Bioengineering of Catalonia (IBEC), Barcelona Institute of Science and Technology, Barcelona 08028, Spain
| | - Ben Lehner
- Centre for Genomic Regulation (CRG), Barcelona Institute of Science and Technology, Dr. Aiguader 88, Barcelona 08003, Spain
- Universitat Pompeu Fabra (UPF), Barcelona 08002, Spain
- ICREA, Pg. Lluis Companys 23, Barcelona 08010, Spain
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton CB10 1RQ, UK
| |
Collapse
|
4
|
Merdler-Rabinowicz R, Omar M, Ganesh J, Morava E, Nadkarni GN, Klang E. The role of large language models in medical genetics. Mol Genet Metab 2025; 145:109098. [PMID: 40154187 DOI: 10.1016/j.ymgme.2025.109098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 04/01/2025]
Affiliation(s)
| | - Mahmud Omar
- Tel-Aviv University, Faculty of Medicine, Tel-Aviv, Israel
| | - Jaya Ganesh
- Department of Genomics and Genetic Sciences, Icahn School of Medicine, New York, NY, USA
| | - Eva Morava
- Department of Genomics and Genetic Sciences, Icahn School of Medicine, New York, NY, USA
| | - Girish N Nadkarni
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eyal Klang
- Division of Data-Driven and Digital Medicine (D3M), Icahn School of Medicine at Mount Sinai, New York, NY, USA
| |
Collapse
|
5
|
Magateshvaren Saras MA, Mitra MK, Tyagi S. Navigating the Multiverse: a Hitchhiker's guide to selecting harmonization methods for multimodal biomedical data. Biol Methods Protoc 2025; 10:bpaf028. [PMID: 40308831 PMCID: PMC12043205 DOI: 10.1093/biomethods/bpaf028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Revised: 03/20/2025] [Accepted: 04/15/2025] [Indexed: 05/02/2025] Open
Abstract
The application of machine learning (ML) techniques in predictive modelling has greatly advanced our comprehension of biological systems. There is a notable shift in the trend towards integration methods that specifically target the simultaneous analysis of multiple modes or types of data, showcasing superior results compared to individual analyses. Despite the availability of diverse ML architectures for researchers interested in embracing a multimodal approach, the current literature lacks a comprehensive taxonomy that includes the pros and cons of these methods to guide the entire process. Closing this gap is imperative, necessitating the creation of a robust framework. This framework should not only categorize the diverse ML architectures suitable for multimodal analysis but also offer insights into their respective advantages and limitations. Additionally, such a framework can serve as a valuable guide for selecting an appropriate workflow for multimodal analysis. This comprehensive taxonomy would provide a clear guidance and support informed decision-making within the progressively intricate landscape of biomedical and clinical data analysis. This is an essential step towards advancing personalized medicine. The aims of the work are to comprehensively study and describe the harmonization processes that are performed and reported in the literature and present a working guide that would enable planning and selecting an appropriate integrative model. We present harmonization as a dual process of representation and integration, each with multiple methods and categories. The taxonomy of the various representation and integration methods are classified into six broad categories and detailed with the advantages, disadvantages and examples. A guide flowchart describing the step-by-step processes that are needed to adopt a multimodal approach is also presented along with examples and references. This review provides a thorough taxonomy of methods for harmonizing multimodal data and introduces a foundational 10-step guide for newcomers to implement a multimodal workflow.
Collapse
Affiliation(s)
- Murali Aadhitya Magateshvaren Saras
- IITB-Monash Research Academy, Mumbai, Maharashtra 400076, India
- Department of Physics, Indian Institute of Technology Bombay, Mumbai, Maharashtra 400076, India
- School of Translational Medicine, Monash University, Melbourne, Victoria 3181, Australia
| | - Mithun K Mitra
- Department of Physics, Indian Institute of Technology Bombay, Mumbai, Maharashtra 400076, India
| | - Sonika Tyagi
- School of Translational Medicine, Monash University, Melbourne, Victoria 3181, Australia
- School of Computing Technologies, RMIT University, Melbourne, Victoria 3001, Australia
| |
Collapse
|
6
|
Bjerregaard A, Groth PM, Hauberg S, Krogh A, Boomsma W. Foundation models of protein sequences: A brief overview. Curr Opin Struct Biol 2025; 91:103004. [PMID: 39983412 DOI: 10.1016/j.sbi.2025.103004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 01/24/2025] [Accepted: 01/26/2025] [Indexed: 02/23/2025]
Abstract
Protein sequence models have evolved from simple statistics of aligned families to versatile foundation models of evolutionary scale. Enabled by self-supervised learning and an abundance of protein sequence data, such foundation models now play a central role in protein science. They facilitate rich representations, powerful generative design, and fine-tuning across diverse domains. In this review, we trace modeling developments and categorize them into methodological trends over the modalities they describe and the contexts they condition upon. Following a brief historical overview, we focus our attention on the most recent trends and outline future perspectives.
Collapse
Affiliation(s)
- Andreas Bjerregaard
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Novonesis, Kgs, Lyngby, Denmark
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs, Lyngby, Denmark
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
7
|
Refahi M, Sokhansanj BA, Mell JC, Brown JR, Yoo H, Hearne G, Rosen GL. Enhancing nucleotide sequence representations in genomic analysis with contrastive optimization. Commun Biol 2025; 8:517. [PMID: 40155693 PMCID: PMC11953366 DOI: 10.1038/s42003-025-07902-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Accepted: 03/07/2025] [Indexed: 04/01/2025] Open
Abstract
Analysis of genomic and metagenomic sequences is inherently more challenging than that of amino acid sequences due to the higher divergence among evolutionarily related nucleotide sequences, variable k-mer and codon usage within and among genomes of diverse species, and poorly understood selective constraints. We introduce Scorpio (Sequence Contrastive Optimization for Representation and Predictive Inference on DNA), a versatile framework designed for nucleotide sequences that employ contrastive learning to improve embeddings. By leveraging pre-trained genomic language models and k-mer frequency embeddings, Scorpio demonstrates competitive performance in diverse applications, including taxonomic and gene classification, antimicrobial resistance (AMR) gene identification, and promoter detection. A key strength of Scorpio is its ability to generalize to novel DNA sequences and taxa, addressing a significant limitation of alignment-based methods. Scorpio has been tested on multiple datasets with DNA sequences of varying lengths (long and short) and shows robust inference capabilities. Additionally, we provide an analysis of the biological information underlying this representation, including correlations between codon adaptation index as a gene expression factor, sequence similarity, and taxonomy, as well as the functional and structural information of genes.
Collapse
Affiliation(s)
| | - Bahrad A Sokhansanj
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Joshua C Mell
- College of Medicine, Drexel University, Philadelphia, PA, USA
| | - James R Brown
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Hyunwoo Yoo
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gavin Hearne
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Gail L Rosen
- Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA.
| |
Collapse
|
8
|
Chang DH, Richardson JD, Lee MR, Lynn DM, Palecek SP, Van Lehn RC. Machine learning-driven discovery of highly selective antifungal peptides containing non-canonical β-amino acids. Chem Sci 2025; 16:5579-5594. [PMID: 40028619 PMCID: PMC11867109 DOI: 10.1039/d4sc06689h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Accepted: 02/19/2025] [Indexed: 03/05/2025] Open
Abstract
Antimicrobial peptides (AMPs) are promising compounds for the treatment and prevention of multidrug-resistant infections because of their ability to directly disrupt microbial membranes, a mechanism that is less likely to lead to resistance compared to antibiotics. Unfortunately, natural AMPs are prone to proteolytic cleavage in vivo and have relatively low selectivity for microbial versus human cells, motivating the development of synthetic peptidomimetics of AMPs with improved peptide stability, activity, and selectivity. However, a lack of understanding of structure-activity relationships for peptidomimetics constrains development to rational design or experimental predictors, both of which are cost and time prohibitive, especially when the design space of possible sequences scales exponentially with the number of amino acids. To address these challenges, we developed an iterative Gaussian process regression (GPR) approach to explore a large design space of 336 000 synthetic α/β-peptide analogues of a natural AMP, aurein 1.2, based on an initial training set of 147 sequences and their biological activities against microbial pathogens and selectivity for microbes vs. mammalian cells. We show that the quantification of prediction uncertainty provided by GPR can guide the exploration of this design space via iterative experimental measurements to efficiently discover novel sequences with up to a 52-fold increase in antifungal selectivity compared to aurein 1.2. The highest selectivity peptide discovered using this approach features an unconventional substitution of cationic amino acids in the hydrophobic face and would be unlikely to be explored by conventional rational design. Overall, this work demonstrates a generalizable approach that integrates computation and experiment to accurately predict the selectivity of AMPs containing synthetic amino acids, which we employed to discover new α/β-peptides that hold promise as selective antifungal agents to combat the antimicrobial resistance crisis.
Collapse
Affiliation(s)
- Douglas H Chang
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI USA
| | - Joshua D Richardson
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI USA
| | - Myung-Ryul Lee
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI USA
| | - David M Lynn
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI USA
- Department of Chemistry, University of Wisconsin-Madison Madison WI USA
| | - Sean P Palecek
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI USA
| | - Reid C Van Lehn
- Department of Chemical and Biological Engineering, University of Wisconsin-Madison Madison WI USA
- Department of Chemistry, University of Wisconsin-Madison Madison WI USA
| |
Collapse
|
9
|
Shukla D, Martin J, Morcos F, Potoyan DA. Thermal Adaptation of Cytosolic Malate Dehydrogenase Revealed by Deep Learning and Coevolutionary Analysis. J Chem Theory Comput 2025; 21:3277-3287. [PMID: 40079215 PMCID: PMC11948321 DOI: 10.1021/acs.jctc.4c01774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2024] [Revised: 03/06/2025] [Accepted: 03/07/2025] [Indexed: 03/14/2025]
Abstract
Protein evolution has shaped enzymes that maintain stability and function across diverse thermal environments. While sequence variation, thermal stability and conformational dynamics are known to influence an enzyme's thermal adaptation, how these factors collectively govern stability and function across diverse temperatures remains unresolved. Cytosolic malate dehydrogenase (cMDH), a citric acid cycle enzyme, is an ideal model for studying these mechanisms due to its temperature-sensitive flexibility and broad presence in species from diverse thermal environments. In this study, we employ techniques inspired by deep learning and statistical mechanics to uncover how sequence variation and conformational dynamics shape patterns of cMDH's thermal adaptation. By integrating coevolutionary models with variational autoencoders (VAE), we generate a latent generative landscape (LGL) of the cMDH sequence space, enabling us to explore mutational pathways and predict fitness using direct coupling analysis (DCA). Structure predictions via AlphaFold and molecular dynamics simulations further illuminate how variations in hydrophobic interactions and conformational flexibility contribute to the thermal stability of warm- and cold-adapted cMDH orthologs. Notably, we identify the ratio of hydrophobic contacts between two regions as a predictive order parameter for thermal stability features, providing a quantitative metric for understanding cMDH dynamics across temperatures. The integrative computational framework employed in this study provides mechanistic insights into protein adaptation at both sequence and structural levels, offering unique perspectives on the evolution of thermal stability and creating avenues for the rational design of proteins with optimized thermal properties.
Collapse
Affiliation(s)
- Divyanshu Shukla
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| | - Jonathan Martin
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
| | - Faruck Morcos
- Department
of Biological Sciences, UT Dallas, Richardson, TX 75080, United States
- Departments
of Bioengineering and Physics, UT Dallas, Richardson, TX 75080, United States
- Center
for
Systems Biology, UT Dallas, Richardson, TX 75080, United States
| | - Davit A. Potoyan
- Department
of Chemistry, Iowa State University, Ames, Iowa 50011, United States
- Department
of Biochemistry, Biophysics and Molecular Biology, Iowa State University, Ames, Iowa 50011, United States
- Bioinformatics
and Computational Biology Program, Iowa
State University, Ames, Iowa 50011, United States
| |
Collapse
|
10
|
NaderiAlizadeh N, Singh R. Aggregating residue-level protein language model embeddings with optimal transport. BIOINFORMATICS ADVANCES 2025; 5:vbaf060. [PMID: 40170888 PMCID: PMC11961220 DOI: 10.1093/bioadv/vbaf060] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 02/13/2025] [Accepted: 03/17/2025] [Indexed: 04/03/2025]
Abstract
Motivation Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable for various applications. As protein representation schemes, PLMs generate per-token (i.e. per-residue) representations, resulting in variable-sized outputs based on protein length. This variability poses a challenge for protein-level prediction tasks that require uniform-sized embeddings for consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information across token-level representations. Results We introduce a novel method utilizing optimal transport to convert variable-length PLM outputs into fixed-length representations. We conceptualize per-token PLM outputs as samples from a probabilistic distribution and employ sliced-Wasserstein distances to map these samples against a reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. We demonstrate the superiority of our method over average pooling for several downstream prediction tasks, particularly with constrained PLM sizes, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Our aggregation scheme is especially effective for longer protein sequences by capturing essential information that might be lost through average pooling. Availability and implementation Our implementation code can be found at https://github.com/navid-naderi/PLM_SWE.
Collapse
Affiliation(s)
- Navid NaderiAlizadeh
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27705, United States
| | - Rohit Singh
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC 27705, United States
- Department of Cell Biology, Duke University, Durham, NC 27705, United States
| |
Collapse
|
11
|
Kohout P, Vasina M, Majerova M, Novakova V, Damborsky J, Bednar D, Marek M, Prokop Z, Mazurenko S. Engineering Dehalogenase Enzymes Using Variational Autoencoder-Generated Latent Spaces and Microfluidics. JACS AU 2025; 5:838-850. [PMID: 40017771 PMCID: PMC11862945 DOI: 10.1021/jacsau.4c01101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 01/23/2025] [Accepted: 01/30/2025] [Indexed: 03/01/2025]
Abstract
Enzymes play a crucial role in sustainable industrial applications, with their optimization posing a formidable challenge due to the intricate interplay among residues. Computational methodologies predominantly rely on evolutionary insights of homologous sequences. However, deciphering the evolutionary variability and complex dependencies among residues presents substantial hurdles. Here, we present a new machine-learning method based on variational autoencoders and evolutionary sampling strategy to address those limitations. We customized our method to generate novel sequences of model enzymes, haloalkane dehalogenases. Three design-build-test cycles improved the solubility of variants from 11% to 75%. Thorough experimental validation including the microfluidic device MicroPEX resulted in 20 multiple-point variants. Nine of them, sharing as little as 67% sequence similarity with the template, showed a melting temperature increase of up to 9 °C and an average improvement of 3 °C. The most stable variant demonstrated a 3.5-fold increase in activity compared to the template. High-quality experimental data collected with 20 variants represent a valuable data set for the critical validation of novel protein design approaches. Python scripts, jupyter notebooks, and data sets are available on GitHub (https://github.com/loschmidt/vae-dehalogenases), and interactive calculations will be possible via https://loschmidt.chemi.muni.cz/fireprotasr/.
Collapse
Affiliation(s)
- Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Michal Vasina
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Marika Majerova
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Veronika Novakova
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - David Bednar
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Martin Marek
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Zbynek Prokop
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Brno 611 37, Czech Republic
- International
Clinical Research Centre, St. Anne’s Hospital, Brno 656 91, Czech Republic
| |
Collapse
|
12
|
Adams E, Bai L, Lee M, Yu Y, AlQuraishi M. From Mechanistic Interpretability to Mechanistic Biology: Training, Evaluating, and Interpreting Sparse Autoencoders on Protein Language Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.06.636901. [PMID: 39975216 PMCID: PMC11839115 DOI: 10.1101/2025.02.06.636901] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/21/2025]
Abstract
Protein language models (pLMs) are powerful predictors of protein structure and function, learning through unsupervised training on millions of protein sequences. pLMs are thought to capture common motifs in protein sequences, but the specifics of pLM features are not well understood. Identifying these features would not only shed light on how pLMs work, but potentially uncover novel protein biology-studying the model to study the biology. Motivated by this, we train sparse autoencoders (SAEs) on the residual stream of a pLM, ESM-2. By characterizing SAE features, we determine that pLMs use a combination of generic features and family-specific features to represent a protein. In addition, we demonstrate how known sequence determinants of properties such as thermostability and subcellular localization can be identified by linear probing of SAE features. For predictive features without known functional associations, we hypothesize their role in unknown mechanisms and provide visualization tools to aid their interpretation. Our study gives a better understanding of the limitations of pLMs, and demonstrates how SAE features can be used to help generate hypotheses for biological mechanisms. We release our code, model weights and feature visualizer.
Collapse
Affiliation(s)
- Etowah Adams
- Department of Systems Biology, Columbia University
| | | | - Minji Lee
- Department of Systems Biology, Columbia University
| | - Yiyang Yu
- Department of Systems Biology, Columbia University
| | | |
Collapse
|
13
|
Bowyer S, Allen DJ, Furnham N. Unveiling the ghost: machine learning's impact on the landscape of virology. J Gen Virol 2025; 106. [PMID: 39804261 DOI: 10.1099/jgv.0.002067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2025] Open
Abstract
The complexity and speed of evolution in viruses with RNA genomes makes predictive identification of variants with epidemic or pandemic potential challenging. In recent years, machine learning has become an increasingly capable technology for addressing this challenge, as advances in methods and computational power have dramatically improved the performance of models and led to their widespread adoption across industries and disciplines. Nascent applications of machine learning technology to virus research have now expanded, providing new tools for handling large-scale datasets and leading to a reshaping of existing workflows for phenotype prediction, phylogenetic analysis, drug discovery and more. This review explores how machine learning has been applied to and has impacted the study of viruses, before addressing the strengths and limitations of its techniques and finally highlighting the next steps that are needed for the technology to reach its full potential in this challenging and ever-relevant research area.
Collapse
Affiliation(s)
- Sebastian Bowyer
- Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, UK
| | - David J Allen
- Department of Comparative Biomedical Sciences, Section Infection and Immunity, School of Veterinary Medicine, Faculty of Health and Medical Sciences, University of Surrey, Guildford, UK
| | - Nicholas Furnham
- Department of Infection Biology, Faculty of Infectious and Tropical Diseases, London School of Hygiene and Tropical Medicine, London, UK
| |
Collapse
|
14
|
Ghazikhani H, Butler G. Ion channel classification through machine learning and protein language model embeddings. J Integr Bioinform 2024; 21:jib-2023-0047. [PMID: 39572876 PMCID: PMC11698620 DOI: 10.1515/jib-2023-0047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 09/04/2024] [Indexed: 01/06/2025] Open
Abstract
Ion channels are critical membrane proteins that regulate ion flux across cellular membranes, influencing numerous biological functions. The resource-intensive nature of traditional wet lab experiments for ion channel identification has led to an increasing emphasis on computational techniques. This study extends our previous work on protein language models for ion channel prediction, significantly advancing the methodology and performance. We employ a comprehensive array of machine learning algorithms, including k-Nearest Neighbors, Random Forest, Support Vector Machines, and Feed-Forward Neural Networks, alongside a novel Convolutional Neural Network (CNN) approach. These methods leverage fine-tuned embeddings from ProtBERT, ProtBERT-BFD, and MembraneBERT to differentiate ion channels from non-ion channels. Our empirical findings demonstrate that TooT-BERT-CNN-C, which combines features from ProtBERT-BFD and a CNN, substantially surpasses existing benchmarks. On our original dataset, it achieves a Matthews Correlation Coefficient (MCC) of 0.8584 and an accuracy of 98.35 %. More impressively, on a newly curated, larger dataset (DS-Cv2), it attains an MCC of 0.9492 and an ROC AUC of 0.9968 on the independent test set. These results not only highlight the power of integrating protein language models with deep learning for ion channel classification but also underscore the importance of using up-to-date, comprehensive datasets in bioinformatics tasks. Our approach represents a significant advancement in computational methods for ion channel identification, with potential implications for accelerating research in ion channel biology and aiding drug discovery efforts.
Collapse
Affiliation(s)
- Hamed Ghazikhani
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| | - Gregory Butler
- Department of Computer Science and Software Engineering, Concordia University, Montreal, Canada
| |
Collapse
|
15
|
Dong B, Liu Z, Xu D, Hou C, Dong G, Zhang T, Wang G. SERT-StructNet: Protein secondary structure prediction method based on multi-factor hybrid deep model. Comput Struct Biotechnol J 2024; 23:1364-1375. [PMID: 38596312 PMCID: PMC11001767 DOI: 10.1016/j.csbj.2024.03.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2023] [Revised: 03/20/2024] [Accepted: 03/21/2024] [Indexed: 04/11/2024] Open
Abstract
Protein secondary structure prediction (PSSP) is a pivotal research endeavour that plays a crucial role in the comprehensive elucidation of protein functions and properties. Current prediction methodologies are focused on deep-learning techniques, particularly focusing on multi-factor features. Diverging from existing approaches, in this study, we placed special emphasis on the effects of amino acid properties and protein secondary structure propensity scores (SSPs) on secondary structure during the meticulous selection of multi-factor features. This differential feature-selection strategy results in a distinctive and effective amalgamation of the sequence and property features. To harness these multi-factor features optimally, we introduced a hybrid deep feature extraction model. The model initially employs mechanisms such as dilated convolution (D-Conv) and a channel attention network (SENet) for local feature extraction and targeted channel enhancement. Subsequently, a combination of recurrent neural network variants (BiGRU and BiLSTM), along with a transformer module, was employed to achieve global bidirectional information consideration and feature enhancement. This approach to multi-factor feature input and multi-level feature processing enabled a comprehensive exploration of intricate associations among amino acid residues in protein sequences, yielding a Q 3 accuracy of 84.9% and an Sov score of 85.1%. The overall performance surpasses that of the comparable methods. This study introduces a novel and efficient method for determining the PSSP domain, which is poised to deepen our understanding of the practical applications of protein molecular structures.
Collapse
Affiliation(s)
- Benzhi Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Zheng Liu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Dali Xu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Chang Hou
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Guanghui Dong
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Tianjiao Zhang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China
| |
Collapse
|
16
|
Harding-Larsen D, Funk J, Madsen NG, Gharabli H, Acevedo-Rocha CG, Mazurenko S, Welner DH. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnol Adv 2024; 77:108459. [PMID: 39366493 DOI: 10.1016/j.biotechadv.2024.108459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/19/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations - primary sequence, 3D structure, and dynamics - to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Collapse
Affiliation(s)
- David Harding-Larsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Jonathan Funk
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Niklas Gesmar Madsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Hani Gharabli
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Carlos G Acevedo-Rocha
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Ditte Hededam Welner
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
17
|
Kantroo P, Wagner GP, Machta BB. High fitness paths can connect proteins with low sequence overlap. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.13.623265. [PMID: 39605533 PMCID: PMC11601429 DOI: 10.1101/2024.11.13.623265] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
The structure and function of a protein are determined by its amino acid sequence. While random mutations change a protein's sequence, evolutionary forces shape its structural fold and biological activity. Studies have shown that neutral networks can connect a local region of sequence space by single residue mutations that preserve viability. However, the larger-scale connectedness of protein morphospace remains poorly understood. Recent advances in artificial intelligence have enabled us to computationally predict a protein's structure and quantify its functional plausibility. Here we build on these tools to develop an algorithm that generates viable paths between distantly related extant protein pairs. The intermediate sequences in these paths differ by single residue changes over subsequent steps - substitutions, insertions and deletions are admissible moves. Their fitness is evaluated using the protein language model ESM2, and maintained as high as possible subject to the constraints of the traversal. We document the qualitative variation across paths generated between progressively divergent protein pairs, some of which do not even acquire the same structural fold. The ease of interpolating between two sequences could be used as a proxy for the likelihood of homology between them.
Collapse
Affiliation(s)
- Pranav Kantroo
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| | - Günter P. Wagner
- Emeritus, Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT-06520, USA
- Department of Evolutionary Biology, University of Vienna, Djerassi Platz 1, A-1030 Vienna, Austria
- Hagler Institute for Advanced Studies, Texas A&M, College Station, TX-77843, USA
| | - Benjamin B. Machta
- Department of Physics, Yale University, New Haven, CT-06520, USA
- Quantitative Biology Institute, Yale University, New Haven, CT-06520, USA
| |
Collapse
|
18
|
Chen Z, Li H, Zhang C, Zhang H, Zhao Y, Cao J, He T, Xu L, Xiao H, Li Y, Shao H, Yang X, He X, Fang G. Crystal Structure Prediction Using Generative Adversarial Network with Data-Driven Latent Space Fusion Strategy. J Chem Theory Comput 2024; 20:9627-9641. [PMID: 39454048 DOI: 10.1021/acs.jctc.4c01096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2024]
Abstract
Crystal structure prediction (CSP) is an important field of material design. Herein, we propose a novel generative adversarial network model, guided by a data-driven approach and incorporating the real physical structure of crystals, to address the complexity of high-dimensional data and improve prediction accuracy in materials science. The model, termed GAN-DDLSF, introduces a novel sampling method called data-driven latent space fusion (DDLSF), which aims to optimize the latent space of generative adversarial networks (GANs) by combining the statistical properties of real data with a standard Gaussian distribution, effectively mitigating the "mode collapse" problem prevalent in GANs. Our approach introduces a more refined generation mechanism specifically for binary crystal structures such as gallium nitride (GaN). By optimizing for the specific crystallographic features of GaN while maintaining structural rationality, we achieve higher precision and efficiency in predicting and designing structures for this particular material system. The model generates 9321 GaN binary crystal structures, with 16.59% reaching a stable state and 24.21% found to be metastable. These results can significantly enhance the accuracy of crystal structure predictions and provide valuable insights into the potential of the GAN-DDLSF approach for the discovery and design of binary, ternary, and multinary materials, offering new perspectives and methods for materials science research and applications.
Collapse
Affiliation(s)
- Zian Chen
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Haichao Li
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Chen Zhang
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Hongbin Zhang
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Yongxiao Zhao
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Jian Cao
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Tao He
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Lina Xu
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Hongping Xiao
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| | - Yi Li
- College of Computer Science and Artificial Intelligence, Wenzhou University, Wenzhou 325035, China
| | - Hezhu Shao
- College of Electrical and Electronic Engineering, Wenzhou University, Wenzhou 325035, China
| | - Xiaoyu Yang
- Computer Network Information Center, Chinese Academy of Sciences, Beijing 100190, China
| | - Xiao He
- Shanghai Engineering Research Center of Molecular Therapeutics and New Drug Development, Shanghai Frontiers Science Center of Molecule Intelligent Syntheses, School of Chemistry and Molecular Engineering, East China Normal University, Shanghai 200062, China
- Chongqing Key Laboratory of Precision Optics, Chongqing Institute of East China Normal University, Chongqing 401120, China
- New York University-East China Normal University Center for Computational Chemistry, New York University Shanghai, Shanghai 200062, China
| | - Guoyong Fang
- Key Laboratory of Carbon Materials of Zhejiang Province, College of Chemistry and Materials Engineering, Wenzhou University, Wenzhou 325035, China
| |
Collapse
|
19
|
Alazmi M. Enzyme catalytic efficiency prediction: employing convolutional neural networks and XGBoost. Front Artif Intell 2024; 7:1446063. [PMID: 39498388 PMCID: PMC11532030 DOI: 10.3389/frai.2024.1446063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2024] [Accepted: 10/07/2024] [Indexed: 11/07/2024] Open
Abstract
Introduction In the intricate realm of enzymology, the precise quantification of enzyme efficiency, epitomized by the turnover number (k cat), is a paramount yet elusive objective. Existing methodologies, though sophisticated, often grapple with the inherent stochasticity and multifaceted nature of enzymatic reactions. Thus, there arises a necessity to explore avant-garde computational paradigms. Methods In this context, we introduce "enzyme catalytic efficiency prediction (ECEP)," leveraging advanced deep learning techniques to enhance the previous implementation, TurNuP, for predicting the enzyme catalase k cat. Our approach significantly outperforms prior methodologies, incorporating new features derived from enzyme sequences and chemical reaction dynamics. Through ECEP, we unravel the intricate enzyme-substrate interactions, capturing the nuanced interplay of molecular determinants. Results Preliminary assessments, compared against established models like TurNuP and DLKcat, underscore the superior predictive capabilities of ECEP, marking a pivotal shift in silico enzymatic turnover number estimation. This study enriches the computational toolkit available to enzymologists and lays the groundwork for future explorations in the burgeoning field of bioinformatics. This paper suggested a multi-feature ensemble deep learning-based approach to predict enzyme kinetic parameters using an ensemble convolution neural network and XGBoost by calculating weighted-average of each feature-based model's output to outperform traditional machine learning methods. The proposed "ECEP" model significantly outperformed existing methodologies, achieving a mean squared error (MSE) reduction of 0.35 from 0.81 to 0.46 and R-squared score from 0.44 to 0.54, thereby demonstrating its superior accuracy and effectiveness in enzyme catalytic efficiency prediction. Discussion This improvement underscores the model's potential to enhance the field of bioinformatics, setting a new benchmark for performance.
Collapse
Affiliation(s)
- Meshari Alazmi
- College of Computer Science and Engineering, University of Ha’il, Ha’il, Saudi Arabia
| |
Collapse
|
20
|
Susanty M, Mursalim MKN, Hertadi R, Purwarianti A, LE Rajab T. Leveraging protein language model embeddings and logistic regression for efficient and accurate in-silico acidophilic proteins classification. Comput Biol Chem 2024; 112:108163. [PMID: 39098138 DOI: 10.1016/j.compbiolchem.2024.108163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 07/02/2024] [Accepted: 07/24/2024] [Indexed: 08/06/2024]
Abstract
The increasing demand for eco-friendly technologies in biotechnology necessitates effective and sustainable catalysts. Acidophilic proteins, functioning optimally in highly acidic environments, hold immense promise for various applications, including food production, biofuels, and bioremediation. However, limited knowledge about these proteins hinders their exploration. This study addresses this gap by employing in silico methods utilizing computational tools and machine learning. We propose a novel approach to predict acidophilic proteins using protein language models (PLMs), accelerating discovery without extensive lab work. Our investigation highlights the potential of PLMs in understanding and harnessing acidophilic proteins for scientific and industrial advancements. We introduce the ACE model, which combines a simple Logistic Regression model with embeddings derived from protein sequences processed by the ProtT5 PLM. This model achieves high performance on an independent test set, with accuracy (0.91), F1-score (0.93), and Matthew's correlation coefficient (0.76). To our knowledge, this is the first application of pre-trained PLM embeddings for acidophilic protein classification. The ACE model serves as a powerful tool for exploring protein acidophilicity, paving the way for future advancements in protein design and engineering.
Collapse
Affiliation(s)
- Meredita Susanty
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas Pertamina, School of Computer Science, Jl Teuku Nyak Arief Jakarta Selatan DKI, Jakarta, Indonesia
| | - Muhammad Khaerul Naim Mursalim
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Universitas UniversalKompleks Maha Vihara Duta Maitreya Bukit Beruntung, Sei Panas Batam, Kepulauan, Riau 29456, Indonesia
| | - Rukman Hertadi
- Institut Teknologi Bandung Faculty of Math and Natural Sciences, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia
| | - Ayu Purwarianti
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia; Center for Artificial Intelligence (U-CoE AI-VLB), Institut Teknologi Bandung, Bandung, Indonesia
| | - Tati LE Rajab
- Institut Teknologi Bandung School of Electrical Engineering and Informatics, Jl. Ganesa 10, Bandung, Jawa Barat, Indonesia.
| |
Collapse
|
21
|
Thompson M, Martín M, Olmo TS, Rajesh C, Koo PK, Bolognesi B, Lehner B. Massive experimental quantification of amyloid nucleation allows interpretable deep learning of protein aggregation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.13.603366. [PMID: 39071305 PMCID: PMC11275847 DOI: 10.1101/2024.07.13.603366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Protein aggregation is a pathological hallmark of more than fifty human diseases and a major problem for biotechnology. Methods have been proposed to predict aggregation from sequence, but these have been trained and evaluated on small and biased experimental datasets. Here we directly address this data shortage by experimentally quantifying the amyloid nucleation of >100,000 protein sequences. This unprecedented dataset reveals the limited performance of existing computational methods and allows us to train CANYA, a convolution-attention hybrid neural network that accurately predicts amyloid nucleation from sequence. We adapt genomic neural network interpretability analyses to reveal CANYA's decision-making process and learned grammar. Our results illustrate the power of massive experimental analysis of random sequence-spaces and provide an interpretable and robust neural network model to predict amyloid nucleation.
Collapse
Affiliation(s)
- Mike Thompson
- Systems and Synthetic Biology, Centre for Genomic Regulation, The Barcelona Institute for Science and Technology (BIST), Barcelona, Spain
| | - Mariano Martín
- Institute for Bioengineering of Catalonia (IBEC), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Trinidad Sanmartín Olmo
- Institute for Bioengineering of Catalonia (IBEC), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Chandana Rajesh
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Peter K. Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Benedetta Bolognesi
- Institute for Bioengineering of Catalonia (IBEC), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Ben Lehner
- Systems and Synthetic Biology, Centre for Genomic Regulation, The Barcelona Institute for Science and Technology (BIST), Barcelona, Spain
- University Pompeu Fabra (UPF), Barcelona, Spain
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, UK
| |
Collapse
|
22
|
Dong B, Liu Z, Xu D, Hou C, Niu N, Wang G. Impact of Multi-Factor Features on Protein Secondary Structure Prediction. Biomolecules 2024; 14:1155. [PMID: 39334921 PMCID: PMC11430196 DOI: 10.3390/biom14091155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2024] [Revised: 09/05/2024] [Accepted: 09/10/2024] [Indexed: 09/30/2024] Open
Abstract
Protein secondary structure prediction (PSSP) plays a crucial role in resolving protein functions and properties. Significant progress has been made in this field in recent years, and the use of a variety of protein-related features, including amino acid sequences, position-specific score matrices (PSSM), amino acid properties, and secondary structure trend factors, to improve prediction accuracy is an important technical route for it. However, a comprehensive evaluation of the impact of these factor features in secondary structure prediction is lacking in the current work. This study quantitatively analyzes the impact of several major factors on secondary structure prediction models using a more explanatory four-class machine learning approach. The applicability of each factor in the different types of methods, the extent to which the different methods work on each factor, and the evaluation of the effect of multi-factor combinations are explored in detail. Through experiments and analyses, it was found that PSSM performs best in methods with strong high-dimensional features and complex feature extraction capabilities, while amino acid sequences, although performing poorly overall, perform relatively well in methods with strong linear processing capabilities. Also, the combination of amino acid properties and trend factors significantly improved the prediction performance. This study provides empirical evidence for future researchers to optimize multi-factor feature combinations and apply them to protein secondary structure prediction models, which is beneficial in further optimizing the use of these factors to enhance the performance of protein secondary structure prediction models.
Collapse
Affiliation(s)
| | | | | | | | - Na Niu
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China; (B.D.); (Z.L.); (D.X.); (C.H.)
| | - Guohua Wang
- College of Computer and Control Engineering, Northeast Forestry University, Harbin 150040, China; (B.D.); (Z.L.); (D.X.); (C.H.)
| |
Collapse
|
23
|
Struski L, Sadowski M, Danel T, Tabor J, Podolak IT. Feature-Based Interpolation and Geodesics in the Latent Spaces of Generative Models. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2024; 35:12068-12082. [PMID: 37028296 DOI: 10.1109/tnnls.2023.3251848] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Interpolating between points is a problem connected simultaneously with finding geodesics and study of generative models. In the case of geodesics, we search for the curves with the shortest length, while in the case of generative models, we typically apply linear interpolation in the latent space. However, this interpolation uses implicitly the fact that Gaussian is unimodal. Thus, the problem of interpolating in the case when the latent density is non-Gaussian is an open problem. In this article, we present a general and unified approach to interpolation, which simultaneously allows us to search for geodesics and interpolating curves in latent space in the case of arbitrary density. Our results have a strong theoretical background based on the introduced quality measure of an interpolating curve. In particular, we show that maximizing the quality measure of the curve can be equivalently understood as a search of geodesic for a certain redefinition of the Riemannian metric on the space. We provide examples in three important cases. First, we show that our approach can be easily applied to finding geodesics on manifolds. Next, we focus our attention in finding interpolations in pretrained generative models. We show that our model effectively works in the case of arbitrary density. Moreover, we can interpolate in the subset of the space consisting of data possessing a given feature. The last case is focused on finding interpolation in the space of chemical compounds.
Collapse
|
24
|
Concha-Eloko R, Stock M, De Baets B, Briers Y, Sanjuán R, Domingo-Calap P, Boeckaerts D. DepoScope: Accurate phage depolymerase annotation and domain delineation using large language models. PLoS Comput Biol 2024; 20:e1011831. [PMID: 39102416 PMCID: PMC11326577 DOI: 10.1371/journal.pcbi.1011831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Revised: 08/15/2024] [Accepted: 07/20/2024] [Indexed: 08/07/2024] Open
Abstract
Bacteriophages (phages) are viruses that infect bacteria. Many of them produce specific enzymes called depolymerases to break down external polysaccharide structures. Accurate annotation and domain identification of these depolymerases are challenging due to their inherent sequence diversity. Hence, we present DepoScope, a machine learning tool that combines a fine-tuned ESM-2 model with a convolutional neural network to identify depolymerase sequences and their enzymatic domains precisely. To accomplish this, we curated a dataset from the INPHARED phage genome database, created a polysaccharide-degrading domain database, and applied sequential filters to construct a high-quality dataset, which is subsequently used to train DepoScope. Our work is the first approach that combines sequence-level predictions with amino-acid-level predictions for accurate depolymerase detection and functional domain identification. In that way, we believe that DepoScope can greatly enhance our understanding of phage-host interactions at the level of depolymerases.
Collapse
Affiliation(s)
- Robby Concha-Eloko
- Institute for Integrative Systems Biology (I2SysBio), Universitat de Valencia-CSIC, Paterna, Spain
| | - Michiel Stock
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Bernard De Baets
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
| | - Yves Briers
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| | - Rafael Sanjuán
- Institute for Integrative Systems Biology (I2SysBio), Universitat de Valencia-CSIC, Paterna, Spain
| | - Pilar Domingo-Calap
- Institute for Integrative Systems Biology (I2SysBio), Universitat de Valencia-CSIC, Paterna, Spain
| | - Dimitri Boeckaerts
- KERMIT, Department of Data Analysis and Mathematical Modelling, Ghent University, Ghent, Belgium
- Laboratory of Applied Biotechnology, Department of Biotechnology, Ghent University, Ghent, Belgium
| |
Collapse
|
25
|
Madan S, Lentzen M, Brandt J, Rueckert D, Hofmann-Apitius M, Fröhlich H. Transformer models in biomedicine. BMC Med Inform Decis Mak 2024; 24:214. [PMID: 39075407 PMCID: PMC11287876 DOI: 10.1186/s12911-024-02600-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 07/08/2024] [Indexed: 07/31/2024] Open
Abstract
Deep neural networks (DNN) have fundamentally revolutionized the artificial intelligence (AI) field. The transformer model is a type of DNN that was originally used for the natural language processing tasks and has since gained more and more attention for processing various kinds of sequential data, including biological sequences and structured electronic health records. Along with this development, transformer-based models such as BioBERT, MedBERT, and MassGenie have been trained and deployed by researchers to answer various scientific questions originating in the biomedical domain. In this paper, we review the development and application of transformer models for analyzing various biomedical-related datasets such as biomedical textual data, protein sequences, medical structured-longitudinal data, and biomedical images as well as graphs. Also, we look at explainable AI strategies that help to comprehend the predictions of transformer-based models. Finally, we discuss the limitations and challenges of current models, and point out emerging novel research directions.
Collapse
Affiliation(s)
- Sumit Madan
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Institute of Computer Science, University of Bonn, Bonn, 53115, Germany.
| | - Manuel Lentzen
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Johannes Brandt
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
| | - Daniel Rueckert
- School of Medicine, Klinikum Rechts der Isar, Technical University Munich, Munich, Germany
- School of Computation, Information and Technology, Technical University Munich, Munich, Germany
- Department of Computing, Imperial College London, London, UK
| | - Martin Hofmann-Apitius
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany
| | - Holger Fröhlich
- Department of Bioinformatics, Fraunhofer Institute for Algorithms and Scientific Computing (SCAI), Schloss Birlinghoven, Sankt Augustin, 53757, Germany.
- Bonn-Aachen International Center for Information Technology (B-IT), University of Bonn, Bonn, 53115, Germany.
| |
Collapse
|
26
|
Cuturello F, Celoria M, Ansuini A, Cazzaniga A. Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models. Bioinformatics 2024; 40:btae447. [PMID: 39012369 PMCID: PMC11269464 DOI: 10.1093/bioinformatics/btae447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 06/19/2024] [Accepted: 07/10/2024] [Indexed: 07/17/2024] Open
Abstract
MOTIVATION Protein Language Models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle this problem, we introduce two key novelties: leveraging a Protein Language Model that incorporates Multiple Sequence Alignments to capture evolutionary information, and using a recently released mega-scale dataset with rigorous data pre-processing to mitigate overfitting. RESULTS We ensure comprehensive comparisons by fine-tuning various pre-trained models, taking advantage of analyses such as ablation studies and baselines evaluation. Our methodology introduces a stringent policy to reduce the widespread issue of data leakage, rigorously removing sequences from the training set when they exhibit significant similarity with the test set. The MSA Transformer emerges as the most accurate among the models under investigation, given its capability to leverage co-evolution signals encoded in aligned homologous sequences. Moreover, the optimized MSA Transformer outperforms existing methods and exhibits enhanced generalization power, leading to a notable improvement in predicting changes in protein stability resulting from point mutations. AVAILABILITY AND IMPLEMENTATION Code and data at https://github.com/RitAreaSciencePark/PLM4Muts. SUPPLEMENTARY INFORMATION Supplementary Information is available at Bioinformatics online.
Collapse
Affiliation(s)
- Francesca Cuturello
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Marco Celoria
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
- HPC Department, , CINECA National Supercomputing Center, Bologna 40033, Italy
| | - Alessio Ansuini
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Alberto Cazzaniga
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| |
Collapse
|
27
|
Li MM, Huang Y, Sumathipala M, Liang MQ, Valdeolivas A, Ananthakrishnan AN, Liao K, Marbach D, Zitnik M. Contextual AI models for single-cell protein biology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.07.18.549602. [PMID: 37503080 PMCID: PMC10370131 DOI: 10.1101/2023.07.18.549602] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/29/2023]
Abstract
Understanding protein function and developing molecular therapies require deciphering the cell types in which proteins act as well as the interactions between proteins. However, modeling protein interactions across biological contexts remains challenging for existing algorithms. Here, we introduce Pinnacle, a geometric deep learning approach that generates context-aware protein representations. Leveraging a multi-organ single-cell atlas, Pinnacle learns on contextualized protein interaction networks to produce 394,760 protein representations from 156 cell type contexts across 24 tissues. Pinnacle's embedding space reflects cellular and tissue organization, enabling zero-shot retrieval of the tissue hierarchy. Pretrained protein representations can be adapted for downstream tasks: enhancing 3D structure-based representations for resolving immuno-oncological protein interactions, and investigating drugs' effects across cell types. Pinnacle outperforms state-of-the-art models in nominating therapeutic targets for rheumatoid arthritis and inflammatory bowel diseases, and pinpoints cell type contexts with higher predictive capability than context-free models. Pinnacle's ability to adjust its outputs based on the context in which it operates paves way for large-scale context-specific predictions in biology.
Collapse
Affiliation(s)
- Michelle M. Li
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Yepeng Huang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Marissa Sumathipala
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Man Qing Liang
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Alberto Valdeolivas
- Roche Pharma Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland
| | - Ashwin N. Ananthakrishnan
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Division of Gastroenterology, Massachusetts General Hospital, Boston, MA, USA
| | - Katherine Liao
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Division of Rheumatology, Inflammation, and Immunity, Brigham and Women’s Hospital, Boston, MA, USA
| | - Daniel Marbach
- Roche Pharma Research and Early Development, Pharmaceutical Sciences, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Allston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Harvard Data Science Initiative, Cambridge, MA, USA
| |
Collapse
|
28
|
Randall JR, Vieira LC, Wilke CO, Davies BW. Deep mutational scanning and machine learning for the analysis of antimicrobial-peptide features driving membrane selectivity. Nat Biomed Eng 2024; 8:842-853. [PMID: 39085646 PMCID: PMC12044605 DOI: 10.1038/s41551-024-01243-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 05/12/2024] [Indexed: 08/02/2024]
Abstract
Many antimicrobial peptides directly disrupt bacterial membranes yet can also damage mammalian membranes. It is therefore central to their therapeutic use that rules governing the membrane selectivity of antimicrobial peptides be deciphered. However, this is difficult even for short peptides owing to the large combinatorial space of amino acid sequences. Here we describe a method for measuring the loss or maintenance of antimicrobial-peptide activity for thousands of peptide-sequence variants simultaneously, and its application to Protegrin-1, a potent yet toxic antimicrobial peptide, to determine the positional importance and flexibility of residues across its sequence while identifying variants with changes in membrane selectivity. More bacterially selective variants maintained a membrane-bound secondary structure while avoiding aromatic residues and cysteine pairs. A machine-learning model trained with our datasets accurately predicted membrane-specific activities for over 5.7 million Protegrin-1 variants, and identified one variant that showed substantially reduced toxicity and retention of activity in a mouse model of intraperitoneal infection. The high-throughput methodology may help elucidate sequence-structure-function relationships in antimicrobial peptides and inform the design of peptide-based synthetic drugs.
Collapse
Affiliation(s)
- Justin R Randall
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Luiz C Vieira
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| | - Claus O Wilke
- Department of Integrative Biology, The University of Texas at Austin, Austin, TX, USA
| | - Bryan W Davies
- Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
29
|
Norton-Baker B, Denton MCR, Murphy NP, Fram B, Lim S, Erickson E, Gauthier NP, Beckham GT. Enabling high-throughput enzyme discovery and engineering with a low-cost, robot-assisted pipeline. Sci Rep 2024; 14:14449. [PMID: 38914665 PMCID: PMC11196671 DOI: 10.1038/s41598-024-64938-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Accepted: 06/14/2024] [Indexed: 06/26/2024] Open
Abstract
As genomic databases expand and artificial intelligence tools advance, there is a growing demand for efficient characterization of large numbers of proteins. To this end, here we describe a generalizable pipeline for high-throughput protein purification using small-scale expression in E. coli and an affordable liquid-handling robot. This low-cost platform enables the purification of 96 proteins in parallel with minimal waste and is scalable for processing hundreds of proteins weekly per user. We demonstrate the performance of this method with the expression and purification of the leading poly(ethylene terephthalate) hydrolases reported in the literature. Replicate experiments demonstrated reproducibility and enzyme purity and yields (up to 400 µg) sufficient for comprehensive analyses of both thermostability and activity, generating a standardized benchmark dataset for comparing these plastic-degrading enzymes. The cost-effectiveness and ease of implementation of this platform render it broadly applicable to diverse protein characterization challenges in the biological sciences.
Collapse
Grants
- DE-SC0022024 U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER), Genomic Science Program
- DE-SC0022024 U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER), Genomic Science Program
- DE-SC0022024 U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER), Genomic Science Program
- DE-SC0022024 U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER), Genomic Science Program
- DE-SC0022024 U.S. Department of Energy, Office of Science, Office of Biological and Environmental Research (BER), Genomic Science Program
- DE-AC36-08GO28308 Advanced Materials and Manufacturing Technologies Office (AMMTO)
- DE-AC36-08GO28308 Advanced Materials and Manufacturing Technologies Office (AMMTO)
- DE-AC36-08GO28308 Advanced Materials and Manufacturing Technologies Office (AMMTO)
- DE-AC36-08GO28308 Advanced Materials and Manufacturing Technologies Office (AMMTO)
- U.S. Department of Energy Office of Energy Efficiency and Renewable Energy Bioenergy Technologies Office (BETO)
- Bio-Optimized Technologies to keep Thermoplastics out of Landfills and the Environment (BOTTLE) Consortium
- Dana-Farber Cancer Institute
Collapse
Affiliation(s)
- Brenna Norton-Baker
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
- Agile BioFoundry, Emeryville, CA, USA
| | - Mackenzie C R Denton
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Natasha P Murphy
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Benjamin Fram
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Samuel Lim
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
| | - Erika Erickson
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA
- BOTTLE Consortium, Golden, CO, USA
| | - Nicholas P Gauthier
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA.
| | - Gregg T Beckham
- Renewable Resources and Enabling Sciences Center, National Renewable Energy Laboratory, Golden, CO, USA.
- BOTTLE Consortium, Golden, CO, USA.
- Agile BioFoundry, Emeryville, CA, USA.
| |
Collapse
|
30
|
Martínez Gascueña A, Wu H, Wang R, Owen CD, Hernando PJ, Monaco S, Penner M, Xing K, Le Gall G, Gardner R, Ndeh D, Urbanowicz PA, Spencer DIR, Walsh M, Angulo J, Juge N. Exploring the sequence-function space of microbial fucosidases. Commun Chem 2024; 7:137. [PMID: 38890439 PMCID: PMC11189522 DOI: 10.1038/s42004-024-01212-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 05/28/2024] [Indexed: 06/20/2024] Open
Abstract
Microbial α-L-fucosidases catalyse the hydrolysis of terminal α-L-fucosidic linkages and can perform transglycosylation reactions. Based on sequence identity, α-L-fucosidases are classified in glycoside hydrolases (GHs) families of the carbohydrate-active enzyme database. Here we explored the sequence-function space of GH29 fucosidases. Based on sequence similarity network (SSN) analyses, 15 GH29 α-L-fucosidases were selected for functional characterisation. HPAEC-PAD and LC-FD-MS/MS analyses revealed substrate and linkage specificities for α1,2, α1,3, α1,4 and α1,6 linked fucosylated oligosaccharides and glycoconjugates, consistent with their SSN clustering. The structural basis for the substrate specificity of GH29 fucosidase from Bifidobacterium asteroides towards α1,6 linkages and FA2G2 N-glycan was determined by X-ray crystallography and STD NMR. The capacity of GH29 fucosidases to carry out transfucosylation reactions with GlcNAc and 3FN as acceptors was evaluated by TLC combined with ESI-MS and NMR. These experimental data supported the use of SSN to further explore the GH29 sequence-function space through machine-learning models. Our lightweight protein language models could accurately allocate test sequences in their respective SSN clusters and assign 34,258 non-redundant GH29 sequences into SSN clusters. It is expected that the combination of these computational approaches will be used in the future for the identification of novel GHs with desired specificities.
Collapse
Affiliation(s)
- Ana Martínez Gascueña
- The Gut Microbes and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich, NR4 7UQ, UK
| | - Haiyang Wu
- The Gut Microbes and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich, NR4 7UQ, UK
- GuangDong Engineering Technology Research Center of Enzyme and Biocatalysis, Institute of Biological and Medical Engineering, Guangdong Academy of Sciences, Guangzhou, China
| | - Rui Wang
- Beijing Key Lab of Traffic Data Analysis and Mining, Beijing Jiaotong University, Beijing, China
- Collaborative Innovation Center of Railway Traffic Safety, Beijing Jiaotong University, Beijing, China
- School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
| | - C David Owen
- Diamond Light Source Ltd, Diamond House, Harwell Science and Innovation Campus, Didcot, OX11 0FA, UK
- Research Complex at Harwell, Rutherford Appleton Laboratory, Harwell Oxford, Didcot, OX11 0FA, UK
| | - Pedro J Hernando
- The Gut Microbes and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich, NR4 7UQ, UK
- Iceni Glycoscience Ltd., Norwich Research Park, Norwich, NR4 7JG, UK
| | - Serena Monaco
- School of Pharmacy, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
| | - Matthew Penner
- Diamond Light Source Ltd, Diamond House, Harwell Science and Innovation Campus, Didcot, OX11 0FA, UK
- Research Complex at Harwell, Rutherford Appleton Laboratory, Harwell Oxford, Didcot, OX11 0FA, UK
| | - Ke Xing
- School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
| | - Gwenaelle Le Gall
- Norwich Medical School, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
| | | | - Didier Ndeh
- The Gut Microbes and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich, NR4 7UQ, UK
- University of Dundee, School of Life Sciences, Dundee, DD1 5EH, Scotland, UK
| | | | | | - Martin Walsh
- Diamond Light Source Ltd, Diamond House, Harwell Science and Innovation Campus, Didcot, OX11 0FA, UK
- Research Complex at Harwell, Rutherford Appleton Laboratory, Harwell Oxford, Didcot, OX11 0FA, UK
| | - Jesus Angulo
- School of Pharmacy, University of East Anglia, Norwich Research Park, Norwich, NR4 7TJ, UK
- Departamento de Química Orgánica, Universidad de Sevilla, 41012, Sevilla, Spain
- Instituto de Investigaciones Químicas (CSIC-US), 41092, Sevilla, Spain
| | - Nathalie Juge
- The Gut Microbes and Health Institute Strategic Programme, Quadram Institute Bioscience, Norwich Research Park, Norwich, NR4 7UQ, UK.
| |
Collapse
|
31
|
Fooladi H, Hirte S, Kirchmair J. Quantifying the Hardness of Bioactivity Prediction Tasks for Transfer Learning. J Chem Inf Model 2024; 64:4031-4046. [PMID: 38739465 PMCID: PMC11134514 DOI: 10.1021/acs.jcim.4c00160] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2024] [Revised: 04/24/2024] [Accepted: 04/24/2024] [Indexed: 05/16/2024]
Abstract
Today, machine learning methods are widely employed in drug discovery. However, the chronic lack of data continues to hamper their further development, validation, and application. Several modern strategies aim to mitigate the challenges associated with data scarcity by learning from data on related tasks. These knowledge-sharing approaches encompass transfer learning, multitask learning, and meta-learning. A key question remaining to be answered for these approaches is about the extent to which their performance can benefit from the relatedness of available source (training) tasks; in other words, how difficult ("hard") a test task is to a model, given the available source tasks. This study introduces a new method for quantifying and predicting the hardness of a bioactivity prediction task based on its relation to the available training tasks. The approach involves the generation of protein and chemical representations and the calculation of distances between the bioactivity prediction task and the available training tasks. In the example of meta-learning on the FS-Mol data set, we demonstrate that the proposed task hardness metric is inversely correlated with performance (Pearson's correlation coefficient r = -0.72). The metric will be useful in estimating the task-specific gain in performance that can be achieved through meta-learning.
Collapse
Affiliation(s)
- Hosein Fooladi
- Department
of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry,
Faculty of Life Sciences, University of
Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
- Christian
Doppler Laboratory for Molecular Informatics in the Biosciences, Department
for Pharmaceutical Sciences, University
of Vienna, 1090 Vienna, Austria
- Vienna
Doctoral School of Pharmaceutical, Nutritional and Sport Sciences
(PhaNuSpo), University of Vienna, 1090 Vienna, Austria
| | - Steffen Hirte
- Department
of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry,
Faculty of Life Sciences, University of
Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
- Vienna
Doctoral School of Pharmaceutical, Nutritional and Sport Sciences
(PhaNuSpo), University of Vienna, 1090 Vienna, Austria
| | - Johannes Kirchmair
- Department
of Pharmaceutical Sciences, Division of Pharmaceutical Chemistry,
Faculty of Life Sciences, University of
Vienna, Josef-Holaubek-Platz 2, 1090 Vienna, Austria
- Christian
Doppler Laboratory for Molecular Informatics in the Biosciences, Department
for Pharmaceutical Sciences, University
of Vienna, 1090 Vienna, Austria
| |
Collapse
|
32
|
Leary AY, Scott D, Gupta NT, Waite JC, Skokos D, Atwal GS, Hawkins PG. Designing meaningful continuous representations of T cell receptor sequences with deep generative models. Nat Commun 2024; 15:4271. [PMID: 38769289 PMCID: PMC11106309 DOI: 10.1038/s41467-024-48198-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 04/24/2024] [Indexed: 05/22/2024] Open
Abstract
T Cell Receptor (TCR) antigen binding underlies a key mechanism of the adaptive immune response yet the vast diversity of TCRs and the complexity of protein interactions limits our ability to build useful low dimensional representations of TCRs. To address the current limitations in TCR analysis we develop a capacity-controlled disentangling variational autoencoder trained using a dataset of approximately 100 million TCR sequences, that we name TCR-VALID. We design TCR-VALID such that the model representations are low-dimensional, continuous, disentangled, and sufficiently informative to provide high-quality TCR sequence de novo generation. We thoroughly quantify these properties of the representations, providing a framework for future protein representation learning in low dimensions. The continuity of TCR-VALID representations allows fast and accurate TCR clustering and is benchmarked against other state-of-the-art TCR clustering tools and pre-trained language models.
Collapse
Affiliation(s)
- Allen Y Leary
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA.
| | - Darius Scott
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| | - Namita T Gupta
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| | - Janelle C Waite
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| | - Dimitris Skokos
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| | - Gurinder S Atwal
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA
| | - Peter G Hawkins
- Regeneron Pharmaceuticals Inc., 777 Old Saw Mill River Road, Tarrytown, NY, 10591, USA.
| |
Collapse
|
33
|
García Sánchez N, Ugarte Carro E, Prieto-Santamaría L, Rodríguez-González A. Protein sequence analysis in the context of drug repurposing. BMC Med Inform Decis Mak 2024; 24:122. [PMID: 38741115 DOI: 10.1186/s12911-024-02531-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Accepted: 05/08/2024] [Indexed: 05/16/2024] Open
Abstract
MOTIVATION Drug repurposing speeds up the development of new treatments, being less costly, risky, and time consuming than de novo drug discovery. There are numerous biological elements that contribute to the development of diseases and, as a result, to the repurposing of drugs. METHODS In this article, we analysed the potential role of protein sequences in drug repurposing scenarios. For this purpose, we embedded the protein sequences by performing four state of the art methods and validated their capacity to encapsulate essential biological information through visualization. Then, we compared the differences in sequence distance between protein-drug target pairs of drug repurposing and non - drug repurposing data. Thus, we were able to uncover patterns that define protein sequences in repurposing cases. RESULTS We found statistically significant sequence distance differences between protein pairs in the repurposing data and the rest of protein pairs in non-repurposing data. In this manner, we verified the potential of using numerical representations of sequences to generate repurposing hypotheses in the future.
Collapse
Affiliation(s)
- Natalia García Sánchez
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
| | - Esther Ugarte Carro
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
| | - Lucía Prieto-Santamaría
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain
- ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain
| | - Alejandro Rodríguez-González
- Centro de Tecnología Biomédica, Universidad Politécnica de Madrid, Pozuelo de Alarcón, Madrid, 28223, Spain.
- ETS de Ingenieros Informáticos, Universidad Politécnica de Madrid, Boadilla del Monte, Madrid, 28660, Spain.
| |
Collapse
|
34
|
Hu F, Zhang W, Huang H, Li W, Li Y, Yin P. A Transferability-Based Method for Evaluating the Protein Representation Learning. IEEE J Biomed Health Inform 2024; 28:3158-3166. [PMID: 38416611 DOI: 10.1109/jbhi.2024.3370680] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/01/2024]
Abstract
Self-supervised pre-trained language models have recently risen as a powerful approach in learning protein representations, showing exceptional effectiveness in various biological tasks, such as drug discovery. Amidst the evolving trend in protein language model development, there is an observable shift towards employing large-scale multimodal and multitask models. However, the predominant reliance on empirical assessments using specific benchmark datasets for evaluating these models raises concerns about the comprehensiveness and efficiency of current evaluation methods. Addressing this gap, our study introduces a novel quantitative approach for estimating the performance of transferring multi-task pre-trained protein representations to downstream tasks. This transferability-based method is designed to quantify the similarities in latent space distributions between pre-trained features and those fine-tuned for downstream tasks. It encompasses a broad spectrum, covering multiple domains and a variety of heterogeneous tasks. To validate this method, we constructed a diverse set of protein-specific pre-training tasks. The resulting protein representations were then evaluated across several downstream biological tasks. Our experimental results demonstrate a robust correlation between the transferability scores obtained using our method and the actual transfer performance observed. This significant correlation highlights the potential of our method as a more comprehensive and efficient tool for evaluating protein representation learning.
Collapse
|
35
|
Michael R, Kæstel-Hansen J, Mørch Groth P, Bartels S, Salomon J, Tian P, Hatzakis NS, Boomsma W. A systematic analysis of regression models for protein engineering. PLoS Comput Biol 2024; 20:e1012061. [PMID: 38701099 PMCID: PMC11095727 DOI: 10.1371/journal.pcbi.1012061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 05/15/2024] [Accepted: 04/10/2024] [Indexed: 05/05/2024] Open
Abstract
To optimize proteins for particular traits holds great promise for industrial and pharmaceutical purposes. Machine Learning is increasingly applied in this field to predict properties of proteins, thereby guiding the experimental optimization process. A natural question is: How much progress are we making with such predictions, and how important is the choice of regressor and representation? In this paper, we demonstrate that different assessment criteria for regressor performance can lead to dramatically different conclusions, depending on the choice of metric, and how one defines generalization. We highlight the fundamental issues of sample bias in typical regression scenarios and how this can lead to misleading conclusions about regressor performance. Finally, we make the case for the importance of calibrated uncertainty in this domain.
Collapse
Affiliation(s)
- Richard Michael
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
- Enzyme Research, Novozymes A/S, Kongens Lyngby, Denmark
| | - Simon Bartels
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| | | | - Pengfei Tian
- Enzyme Research, Novozymes A/S, Kongens Lyngby, Denmark
| | - Nikos S. Hatzakis
- Department of Chemistry, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
36
|
Vitale R, Bugnon LA, Fenoy EL, Milone DH, Stegmayer G. Evaluating large language models for annotating proteins. Brief Bioinform 2024; 25:bbae177. [PMID: 38706315 PMCID: PMC11070647 DOI: 10.1093/bib/bbae177] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2023] [Revised: 03/15/2024] [Accepted: 03/27/2024] [Indexed: 05/07/2024] Open
Abstract
In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam.
Collapse
Affiliation(s)
- Rosario Vitale
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Leandro A Bugnon
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Emilio Luis Fenoy
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Diego H Milone
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| | - Georgina Stegmayer
- Research Institute for Signals, Systems and Computational Intelligence sinc(i) (CONICET-UNL), Ciudad Universitaria, Santa Fe, Argentina
| |
Collapse
|
37
|
Chu HY, Fong JHC, Thean DGL, Zhou P, Fung FKC, Huang Y, Wong ASL. Accurate top protein variant discovery via low-N pick-and-validate machine learning. Cell Syst 2024; 15:193-203.e6. [PMID: 38340729 DOI: 10.1016/j.cels.2024.01.002] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 01/18/2024] [Indexed: 02/12/2024]
Abstract
A strategy to obtain the greatest number of best-performing variants with least amount of experimental effort over the vast combinatorial mutational landscape would have enormous utility in boosting resource producibility for protein engineering. Toward this goal, we present a simple and effective machine learning-based strategy that outperforms other state-of-the-art methods. Our strategy integrates zero-shot prediction and multi-round sampling to direct active learning via experimenting with only a few predicted top variants. We find that four rounds of low-N pick-and-validate sampling of 12 variants for machine learning yielded the best accuracy of up to 92.6% in selecting the true top 1% variants in combinatorial mutant libraries, whereas two rounds of 24 variants can also be used. We demonstrate our strategy in successfully discovering high-performance protein variants from diverse families including the CRISPR-based genome editors, supporting its generalizable application for solving protein engineering tasks. A record of this paper's transparent peer review process is included in the supplemental information.
Collapse
Affiliation(s)
- Hoi Yee Chu
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - John H C Fong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Dawn G L Thean
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Peng Zhou
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Frederic K C Fung
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Pokfulam, Hong Kong SAR, China
| | - Alan S L Wong
- Laboratory of Combinatorial Genetics and Synthetic Biology, School of Biomedical Sciences, The University of Hong Kong, Pokfulam, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
38
|
Wang M, Patsenker J, Li H, Kluger Y, Kleinstein S. Language model-based B cell receptor sequence embeddings can effectively encode receptor specificity. Nucleic Acids Res 2024; 52:548-557. [PMID: 38109302 PMCID: PMC10810273 DOI: 10.1093/nar/gkad1128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2023] [Revised: 10/18/2023] [Accepted: 11/11/2023] [Indexed: 12/20/2023] Open
Abstract
High throughput sequencing of B cell receptors (BCRs) is increasingly applied to study the immense diversity of antibodies. Learning biologically meaningful embeddings of BCR sequences is beneficial for predictive modeling. Several embedding methods have been developed for BCRs, but no direct performance benchmarking exists. Moreover, the impact of the input sequence length and paired-chain information on the prediction remains to be explored. We evaluated the performance of multiple embedding models to predict BCR sequence properties and receptor specificity. Despite the differences in model architectures, most embeddings effectively capture BCR sequence properties and specificity. BCR-specific embeddings slightly outperform general protein language models in predicting specificity. In addition, incorporating full-length heavy chains and paired light chain sequences improves the prediction performance of all embeddings. This study provides insights into the properties of BCR embeddings to improve downstream prediction applications for antibody analysis and discovery.
Collapse
Affiliation(s)
- Meng Wang
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | | | - Henry Li
- Program in Applied Mathematics, Yale University, New Haven, CT, USA
| | - Yuval Kluger
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
- Program in Applied Mathematics, Yale University, New Haven, CT, USA
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
| | - Steven H Kleinstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
- Department of Pathology, Yale School of Medicine, New Haven, CT, USA
- Department of Immunobiology, Yale School of Medicine, New Haven, CT, USA
| |
Collapse
|
39
|
Bravi B. Development and use of machine learning algorithms in vaccine target selection. NPJ Vaccines 2024; 9:15. [PMID: 38242890 PMCID: PMC10798987 DOI: 10.1038/s41541-023-00795-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 22.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 12/07/2023] [Indexed: 01/21/2024] Open
Abstract
Computer-aided discovery of vaccine targets has become a cornerstone of rational vaccine design. In this article, I discuss how Machine Learning (ML) can inform and guide key computational steps in rational vaccine design concerned with the identification of B and T cell epitopes and correlates of protection. I provide examples of ML models, as well as types of data and predictions for which they are built. I argue that interpretable ML has the potential to improve the identification of immunogens also as a tool for scientific discovery, by helping elucidate the molecular processes underlying vaccine-induced immune responses. I outline the limitations and challenges in terms of data availability and method development that need to be addressed to bridge the gap between advances in ML predictions and their translational application to vaccine design.
Collapse
Affiliation(s)
- Barbara Bravi
- Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.
| |
Collapse
|
40
|
James JK, Norland K, Johar AS, Kullo IJ. Deep generative models of LDLR protein structure to predict variant pathogenicity. J Lipid Res 2023; 64:100455. [PMID: 37821076 PMCID: PMC10696256 DOI: 10.1016/j.jlr.2023.100455] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2023] [Revised: 09/16/2023] [Accepted: 10/05/2023] [Indexed: 10/13/2023] Open
Abstract
The complex structure and function of low density lipoprotein receptor (LDLR) makes classification of protein-coding missense variants challenging. Deep generative models, including Evolutionary model of Variant Effect (EVE), Evolutionary Scale Modeling (ESM), and AlphaFold 2 (AF2), have enabled significant progress in the prediction of protein structure and function. ESM and EVE directly estimate the likelihood of a variant sequence but are purely data-driven and challenging to interpret. AF2 predicts LDLR structures, but variant effects are explicitly modeled by estimating changes in stability. We tested the effectiveness of these models for predicting variant pathogenicity compared to established methods. AF2 produced two distinct conformations based on a novel hinge mechanism. Within ESM's hidden space, benign and pathogenic variants had different distributions. In EVE, these distributions were similar. EVE and ESM were comparable to Polyphen-2, SIFT, REVEL, and Primate AI for predicting binary classifications in ClinVar. However, they were more strongly correlated with experimental measures of LDL uptake. AF2 poorly performed in these tasks. Using the UK Biobank to compare association with clinical phenotypes, ESM and EVE were more strongly associated with serum LDL-C than Polyphen-2. ESM was able to identify variants with more extreme LDL-C levels than EVE and had a significantly stronger association with atherosclerotic cardiovascular disease. In conclusion, AF2 predicted LDLR structures do not accurately model variant pathogenicity. ESM and EVE are competitive with prior scoring methods for prediction based on binary classifications in ClinVar but are superior based on correlations with experimental assays and clinical phenotypes.
Collapse
Affiliation(s)
- Jose K James
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Kristjan Norland
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Angad S Johar
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Iftikhar J Kullo
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA; Gonda Vascular Center, Mayo Clinic, Rochester, MN, USA.
| |
Collapse
|
41
|
Xie WJ, Warshel A. Harnessing generative AI to decode enzyme catalysis and evolution for enhanced engineering. Natl Sci Rev 2023; 10:nwad331. [PMID: 38299119 PMCID: PMC10829072 DOI: 10.1093/nsr/nwad331] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2023] [Revised: 09/27/2023] [Accepted: 10/13/2023] [Indexed: 02/02/2024] Open
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. Generative models could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, catalytic activity and stability, rationalizing the laboratory evolution of de novo enzymes, and decoding protein sequence semantics and their application in enzyme engineering. Notably, the prediction of catalytic activity and stability of enzymes using natural protein sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development, Genetics Institute, University of Florida, Gainesville, FL 32610, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
42
|
Kouba P, Kohout P, Haddadi F, Bushuiev A, Samusevich R, Sedlar J, Damborsky J, Pluskal T, Sivic J, Mazurenko S. Machine Learning-Guided Protein Engineering. ACS Catal 2023; 13:13863-13895. [PMID: 37942269 PMCID: PMC10629210 DOI: 10.1021/acscatal.3c02743] [Citation(s) in RCA: 45] [Impact Index Per Article: 22.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2023] [Revised: 09/20/2023] [Indexed: 11/10/2023]
Abstract
Recent progress in engineering highly promising biocatalysts has increasingly involved machine learning methods. These methods leverage existing experimental and simulation data to aid in the discovery and annotation of promising enzymes, as well as in suggesting beneficial mutations for improving known targets. The field of machine learning for protein engineering is gathering steam, driven by recent success stories and notable progress in other areas. It already encompasses ambitious tasks such as understanding and predicting protein structure and function, catalytic efficiency, enantioselectivity, protein dynamics, stability, solubility, aggregation, and more. Nonetheless, the field is still evolving, with many challenges to overcome and questions to address. In this Perspective, we provide an overview of ongoing trends in this domain, highlight recent case studies, and examine the current limitations of machine learning-based methods. We emphasize the crucial importance of thorough experimental validation of emerging models before their use for rational protein design. We present our opinions on the fundamental problems and outline the potential directions for future research.
Collapse
Affiliation(s)
- Petr Kouba
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Faculty of
Electrical Engineering, Czech Technical
University in Prague, Technicka 2, 166 27 Prague 6, Czech Republic
| | - Pavel Kohout
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Faraneh Haddadi
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Anton Bushuiev
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Raman Samusevich
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Jiri Sedlar
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Jiri Damborsky
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Tomas Pluskal
- Institute
of Organic Chemistry and Biochemistry of the Czech Academy of Sciences, Flemingovo nám. 2, 160 00 Prague 6, Czech Republic
| | - Josef Sivic
- Czech Institute
of Informatics, Robotics and Cybernetics, Czech Technical University in Prague, Jugoslavskych partyzanu 1580/3, 160 00 Prague 6, Czech Republic
| | - Stanislav Mazurenko
- Loschmidt
Laboratories, Department of Experimental Biology and RECETOX, Faculty
of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech
Republic
- International
Clinical Research Center, St. Anne’s
University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| |
Collapse
|
43
|
Markus B, C GC, Andreas K, Arkadij K, Stefan L, Gustav O, Elina S, Radka S. Accelerating Biocatalysis Discovery with Machine Learning: A Paradigm Shift in Enzyme Engineering, Discovery, and Design. ACS Catal 2023; 13:14454-14469. [PMID: 37942268 PMCID: PMC10629211 DOI: 10.1021/acscatal.3c03417] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Revised: 09/29/2023] [Accepted: 10/03/2023] [Indexed: 11/10/2023]
Abstract
Emerging computational tools promise to revolutionize protein engineering for biocatalytic applications and accelerate the development timelines previously needed to optimize an enzyme to its more efficient variant. For over a decade, the benefits of predictive algorithms have helped scientists and engineers navigate the complexity of functional protein sequence space. More recently, spurred by dramatic advances in underlying computational tools, the promise of faster, cheaper, and more accurate enzyme identification, characterization, and engineering has catapulted terms such as artificial intelligence and machine learning to the must-have vocabulary in the field. This Perspective aims to showcase the current status of applications in pharmaceutical industry and also to discuss and celebrate the innovative approaches in protein science by highlighting their potential in selected recent developments and offering thoughts on future opportunities for biocatalysis. It also critically assesses the technology's limitations, unanswered questions, and unmet challenges.
Collapse
Affiliation(s)
- Braun Markus
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Gruber Christian C
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Krassnigg Andreas
- Enzyme
and Drug Discovery, Innophore. 1700 Montgomery Street, San Francisco, California 94111, United States
| | - Kummer Arkadij
- Moderna,
Inc., 200 Technology
Square, Cambridge, Massachusetts 02139, United States
| | - Lutz Stefan
- Codexis
Inc., 200 Penobscot Drive, Redwood City, California 94063, United States
| | - Oberdorfer Gustav
- Department
of Biochemistry, Graz University of Technology, Petersgasse 12/2, 8010 Graz, Austria
| | - Siirola Elina
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| | - Snajdrova Radka
- Novartis
Institute for Biomedical Research, Global Discovery Chemistry, Basel CH-4108, Switzerland
| |
Collapse
|
44
|
Xie WJ, Warshel A. Harnessing Generative AI to Decode Enzyme Catalysis and Evolution for Enhanced Engineering. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.10.10.561808. [PMID: 37873334 PMCID: PMC10592750 DOI: 10.1101/2023.10.10.561808] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Enzymes, as paramount protein catalysts, occupy a central role in fostering remarkable progress across numerous fields. However, the intricacy of sequence-function relationships continues to obscure our grasp of enzyme behaviors and curtails our capabilities in rational enzyme engineering. Generative artificial intelligence (AI), known for its proficiency in handling intricate data distributions, holds the potential to offer novel perspectives in enzyme research. By applying generative models, we could discern elusive patterns within the vast sequence space and uncover new functional enzyme sequences. This review highlights the recent advancements in employing generative AI for enzyme sequence analysis. We delve into the impact of generative AI in predicting mutation effects on enzyme fitness, activity, and stability, rationalizing the laboratory evolution of de novo enzymes, decoding protein sequence semantics, and its applications in enzyme engineering. Notably, the prediction of enzyme activity and stability using natural enzyme sequences serves as a vital link, indicating how enzyme catalysis shapes enzyme evolution. Overall, we foresee that the integration of generative AI into enzyme studies will remarkably enhance our knowledge of enzymes and expedite the creation of superior biocatalysts.
Collapse
Affiliation(s)
- Wen Jun Xie
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
- Departmet of Medicinal Chemistry, Center for Natural Products, Drug Discovery and Development (CNPD3), Genetics Institute, University of Florida, Gainesville, FL, USA
| | - Arieh Warshel
- Department of Chemistry, University of Southern California, Los Angeles, CA, USA
| |
Collapse
|
45
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. Brief Bioinform 2023; 24:bbad289. [PMID: 37580175 PMCID: PMC10516362 DOI: 10.1093/bib/bbad289] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 07/14/2023] [Accepted: 07/26/2023] [Indexed: 08/16/2023] Open
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824 MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824 MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824 MI, USA
| |
Collapse
|
46
|
Randall JR, Vieira LC, Wilke CO, Davies BW. Deep mutational scanning and machine learning uncover antimicrobial peptide features driving membrane selectivity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.28.551017. [PMID: 37547010 PMCID: PMC10402124 DOI: 10.1101/2023.07.28.551017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Antimicrobial peptides commonly act by disrupting bacterial membranes, but also frequently damage mammalian membranes. Deciphering the rules governing membrane selectivity is critical to understanding their function and enabling their therapeutic use. Past attempts to decipher these rules have failed because they cannot interrogate adequate peptide sequence variation. To overcome this problem, we develop deep mutational surface localized antimicrobial display (dmSLAY), which reveals comprehensive positional residue importance and flexibility across an antimicrobial peptide sequence. We apply dmSLAY to Protegrin-1, a potent yet toxic antimicrobial peptide, and identify thousands of sequence variants that positively or negatively influence its antibacterial activity. Further analysis reveals that avoiding large aromatic residues and eliminating disulfide bound cysteine pairs while maintaining membrane bound secondary structure greatly improves Protegrin-1 bacterial specificity. Moreover, dmSLAY datasets enable machine learning to expand our analysis to include over 5.7 million sequence variants and reveal full Protegrin-1 mutational profiles driving either bacterial or mammalian membrane specificity. Our results describe an innovative, high-throughput approach for elucidating antimicrobial peptide sequence-structure-function relationships which can inform synthetic peptide-based drug design.
Collapse
Affiliation(s)
- Justin R. Randall
- Department of Molecular Biosciences, University of Texas at Austin, Austin, Texas 78712
| | - Luiz C. Vieira
- Department of Integrative Biology, University of Texas at Austin; Austin, Texas, 78712
| | - Claus O. Wilke
- Department of Integrative Biology, University of Texas at Austin; Austin, Texas, 78712
| | - Bryan W. Davies
- Department of Molecular Biosciences, University of Texas at Austin, Austin, Texas 78712
| |
Collapse
|
47
|
Koludarov I, Senoner T, Jackson TNW, Dashevsky D, Heinzinger M, Aird SD, Rost B. Domain loss enabled evolution of novel functions in the snake three-finger toxin gene superfamily. Nat Commun 2023; 14:4861. [PMID: 37567881 PMCID: PMC10421932 DOI: 10.1038/s41467-023-40550-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 07/28/2023] [Indexed: 08/13/2023] Open
Abstract
Three-finger toxins (3FTXs) are a functionally diverse family of toxins, apparently unique to venoms of caenophidian snakes. Although the ancestral function of 3FTXs is antagonism of nicotinic acetylcholine receptors, redundancy conferred by the accumulation of duplicate genes has facilitated extensive neofunctionalization, such that derived members of the family interact with a range of targets. 3FTXs are members of the LY6/UPAR family, but their non-toxin ancestor remains unknown. Combining traditional phylogenetic approaches, manual synteny analysis, and machine learning techniques (including AlphaFold2 and ProtT5), we have reconstructed a detailed evolutionary history of 3FTXs. We identify their immediate ancestor as a non-secretory LY6, unique to squamate reptiles, and propose that changes in molecular ecology resulting from loss of a membrane-anchoring domain and changes in gene expression, paved the way for the evolution of one of the most important families of snake toxins.
Collapse
Affiliation(s)
- Ivan Koludarov
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany.
| | - Tobias Senoner
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany
| | - Timothy N W Jackson
- Australian Venom Research Unit, Department of Biochemistry and Pharmacology, University of Melbourne, Melbourne, VIC, Australia
| | - Daniel Dashevsky
- Australian National Insect Collection, Commonwealth Scientific & Industrial Research Organisation, Canberra, ACT, Australia
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany
| | - Steven D Aird
- 7744-23 Hotaka Ariake, 399-8301, Azumino-shi, Nagano-ken, Japan
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology-i12, Boltzmannstr. 3, 85748, Garching/Munich, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, 85748, Garching/Munich, Germany
- TUM School of Life Sciences Weihenstephan (WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
48
|
Wang H, Fu T, Du Y, Gao W, Huang K, Liu Z, Chandak P, Liu S, Van Katwyk P, Deac A, Anandkumar A, Bergen K, Gomes CP, Ho S, Kohli P, Lasenby J, Leskovec J, Liu TY, Manrai A, Marks D, Ramsundar B, Song L, Sun J, Tang J, Veličković P, Welling M, Zhang L, Coley CW, Bengio Y, Zitnik M. Scientific discovery in the age of artificial intelligence. Nature 2023; 620:47-60. [PMID: 37532811 DOI: 10.1038/s41586-023-06221-2] [Citation(s) in RCA: 271] [Impact Index Per Article: 135.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2022] [Accepted: 05/16/2023] [Indexed: 08/04/2023]
Abstract
Artificial intelligence (AI) is being increasingly integrated into scientific discovery to augment and accelerate research, helping scientists to generate hypotheses, design experiments, collect and interpret large datasets, and gain insights that might not have been possible using traditional scientific methods alone. Here we examine breakthroughs over the past decade that include self-supervised learning, which allows models to be trained on vast amounts of unlabelled data, and geometric deep learning, which leverages knowledge about the structure of scientific data to enhance model accuracy and efficiency. Generative AI methods can create designs, such as small-molecule drugs and proteins, by analysing diverse data modalities, including images and sequences. We discuss how these methods can help scientists throughout the scientific process and the central issues that remain despite such advances. Both developers and users of AI toolsneed a better understanding of when such approaches need improvement, and challenges posed by poor data quality and stewardship remain. These issues cut across scientific disciplines and require developing foundational algorithmic approaches that can contribute to scientific understanding or acquire it autonomously, making them critical areas of focus for AI innovation.
Collapse
Affiliation(s)
- Hanchen Wang
- Department of Engineering, University of Cambridge, Cambridge, UK
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- Department of Research and Early Development, Genentech Inc, South San Francisco, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Tianfan Fu
- Department of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Yuanqi Du
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - Wenhao Gao
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Kexin Huang
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Ziming Liu
- Department of Physics, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Payal Chandak
- Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA, USA
| | - Shengchao Liu
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Peter Van Katwyk
- Department of Earth, Environmental and Planetary Sciences, Brown University, Providence, RI, USA
- Data Science Institute, Brown University, Providence, RI, USA
| | - Andreea Deac
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Anima Anandkumar
- Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA
- NVIDIA, Santa Clara, CA, USA
| | - Karianne Bergen
- Department of Earth, Environmental and Planetary Sciences, Brown University, Providence, RI, USA
- Data Science Institute, Brown University, Providence, RI, USA
| | - Carla P Gomes
- Department of Computer Science, Cornell University, Ithaca, NY, USA
| | - Shirley Ho
- Center for Computational Astrophysics, Flatiron Institute, New York, NY, USA
- Department of Astrophysical Sciences, Princeton University, Princeton, NJ, USA
- Department of Physics, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Physics and Center for Data Science, New York University, New York, NY, USA
| | | | - Joan Lasenby
- Department of Engineering, University of Cambridge, Cambridge, UK
| | - Jure Leskovec
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | | | - Arjun Manrai
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Debora Marks
- Department of Systems Biology, Harvard Medical School, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Le Song
- BioMap, Beijing, China
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates
| | - Jimeng Sun
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Jian Tang
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- HEC Montréal, Montreal, Quebec, Canada
- CIFAR AI Chair, Toronto, Ontario, Canada
| | - Petar Veličković
- Google DeepMind, London, UK
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK
| | - Max Welling
- University of Amsterdam, Amsterdam, Netherlands
- Microsoft Research Amsterdam, Amsterdam, Netherlands
| | - Linfeng Zhang
- DP Technology, Beijing, China
- AI for Science Institute, Beijing, China
| | - Connor W Coley
- Department of Chemical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Yoshua Bengio
- Mila - Quebec AI Institute, Montreal, Quebec, Canada
- Université de Montréal, Montreal, Quebec, Canada
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Data Science Initiative, Cambridge, MA, USA.
- Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University, Cambridge, MA, USA.
| |
Collapse
|
49
|
Qiu Y, Wei GW. Artificial intelligence-aided protein engineering: from topological data analysis to deep protein language models. ARXIV 2023:arXiv:2307.14587v1. [PMID: 37547662 PMCID: PMC10402185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
Protein engineering is an emerging field in biotechnology that has the potential to revolutionize various areas, such as antibody design, drug discovery, food security, ecology, and more. However, the mutational space involved is too vast to be handled through experimental means alone. Leveraging accumulative protein databases, machine learning (ML) models, particularly those based on natural language processing (NLP), have considerably expedited protein engineering. Moreover, advances in topological data analysis (TDA) and artificial intelligence-based protein structure prediction, such as AlphaFold2, have made more powerful structure-based ML-assisted protein engineering strategies possible. This review aims to offer a comprehensive, systematic, and indispensable set of methodological components, including TDA and NLP, for protein engineering and to facilitate their future development.
Collapse
Affiliation(s)
- Yuchi Qiu
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, 48824, MI, USA
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824, MI, USA
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824, MI, USA
| |
Collapse
|
50
|
Saar KL, Qian D, Good LL, Morgunov AS, Collepardo-Guevara R, Best RB, Knowles TPJ. Theoretical and Data-Driven Approaches for Biomolecular Condensates. Chem Rev 2023; 123:8988-9009. [PMID: 37171907 PMCID: PMC10375482 DOI: 10.1021/acs.chemrev.2c00586] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2022] [Indexed: 05/14/2023]
Abstract
Biomolecular condensation processes are increasingly recognized as a fundamental mechanism that living cells use to organize biomolecules in time and space. These processes can lead to the formation of membraneless organelles that enable cells to perform distinct biochemical processes in controlled local environments, thereby supplying them with an additional degree of spatial control relative to that achieved by membrane-bound organelles. This fundamental importance of biomolecular condensation has motivated a quest to discover and understand the molecular mechanisms and determinants that drive and control this process. Within this molecular viewpoint, computational methods can provide a unique angle to studying biomolecular condensation processes by contributing the resolution and scale that are challenging to reach with experimental techniques alone. In this Review, we focus on three types of dry-lab approaches: theoretical methods, physics-driven simulations and data-driven machine learning methods. We review recent progress in using these tools for probing biomolecular condensation across all three fields and outline the key advantages and limitations of each of the approaches. We further discuss some of the key outstanding challenges that we foresee the community addressing next in order to develop a more complete picture of the molecular driving forces behind biomolecular condensation processes and their biological roles in health and disease.
Collapse
Affiliation(s)
- Kadi L. Saar
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Transition
Bio Ltd., Cambridge, United Kingdom
| | - Daoyuan Qian
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
| | - Lydia L. Good
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Laboratory
of Chemical Physics, National Institute of Diabetes and Digestive
and Kidney Diseases, National Institutes
of Health, Bethesda, Maryland 20892, United States
| | - Alexey S. Morgunov
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
| | - Rosana Collepardo-Guevara
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Department
of Genetics, University of Cambridge, Cambridge CB2 3EH, United Kingdom
| | - Robert B. Best
- Laboratory
of Chemical Physics, National Institute of Diabetes and Digestive
and Kidney Diseases, National Institutes
of Health, Bethesda, Maryland 20892, United States
| | - Tuomas P. J. Knowles
- Yusuf
Hamied Department of Chemistry, University
of Cambridge, Cambridge CB2 1EW, United Kingdom
- Cavendish
Laboratory, Department of Physics, University
of Cambridge, Cambridge CB3 0HE, United Kingdom
| |
Collapse
|