1
|
Li B, Luo S, Wang W, Xu J, Liu D, Shameem M, Mattila J, Franklin MC, Hawkins PG, Atwal GS. PROPERMAB: an integrative framework for in silico prediction of antibody developability using machine learning. MAbs 2025; 17:2474521. [PMID: 40042626 PMCID: PMC11901398 DOI: 10.1080/19420862.2025.2474521] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Revised: 02/25/2025] [Accepted: 02/26/2025] [Indexed: 03/14/2025] Open
Abstract
Selection of lead therapeutic molecules is often driven predominantly by pharmacological efficacy and safety. Candidate developability, such as biophysical properties that affect the formulation of the molecule into a product, is usually evaluated only toward the end of the drug development pipeline. The ability to evaluate developability properties early in the process of antibody therapeutic development could accelerate the timeline from discovery to clinic and save considerable resources. In silico predictive approaches, such as machine learning models, which map molecular features to predictions of developability properties could offer a cost-effective and high-throughput alternative to experiments for antibody developability assessment. We developed a computational framework, PROPERMAB (PROPERties of Monoclonal AntiBodies), for large-scale and efficient in silico prediction of developability properties for monoclonal antibodies, using custom molecular features and machine learning modeling. We demonstrate the power of PROPERMAB by using it to develop models to predict antibody hydrophobic interaction chromatography retention time and high-concentration viscosity. We further show that structure-derived features can be rapidly and accurately predicted directly from sequences by pre-training simple models for molecular features, thus providing the ability to scale these approaches to repertoire-scale sequence datasets.
Collapse
Affiliation(s)
- Bian Li
- Therapeutic Proteins, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | - Shukun Luo
- Formulation Development, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | - Wenhua Wang
- Formulation Development, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | - Jiahui Xu
- Formulation Development, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | - Dingjiang Liu
- Formulation Development, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | - Mohammed Shameem
- Formulation Development, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | - John Mattila
- Preclinical Manufacturing and Process Development, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | | | - Peter G. Hawkins
- Molecular Profiling and Data Science, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| | - Gurinder S. Atwal
- Molecular Profiling and Data Science, Regeneron Pharmaceuticals, Inc, Tarrytown, NY, USA
| |
Collapse
|
2
|
Bjerregaard A, Groth PM, Hauberg S, Krogh A, Boomsma W. Foundation models of protein sequences: A brief overview. Curr Opin Struct Biol 2025; 91:103004. [PMID: 39983412 DOI: 10.1016/j.sbi.2025.103004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 01/24/2025] [Accepted: 01/26/2025] [Indexed: 02/23/2025]
Abstract
Protein sequence models have evolved from simple statistics of aligned families to versatile foundation models of evolutionary scale. Enabled by self-supervised learning and an abundance of protein sequence data, such foundation models now play a central role in protein science. They facilitate rich representations, powerful generative design, and fine-tuning across diverse domains. In this review, we trace modeling developments and categorize them into methodological trends over the modalities they describe and the contexts they condition upon. Following a brief historical overview, we focus our attention on the most recent trends and outline future perspectives.
Collapse
Affiliation(s)
- Andreas Bjerregaard
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Novonesis, Kgs, Lyngby, Denmark
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs, Lyngby, Denmark
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
3
|
Allman BE, Vieira L, Diaz DJ, Wilke CO. A systematic evaluation of the language-of-viral-escape model using multiple machine learning frameworks. J R Soc Interface 2025; 22:20240598. [PMID: 40300635 PMCID: PMC12040448 DOI: 10.1098/rsif.2024.0598] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 01/02/2025] [Accepted: 02/18/2025] [Indexed: 05/01/2025] Open
Abstract
Predicting the evolutionary patterns of emerging and endemic viruses is key for mitigating their spread. In particular, it is critical to rapidly identify mutations with the potential for immune escape or increased disease burden. Knowing which circulating mutations pose a concern can inform treatment or mitigation strategies such as alternative vaccines or targeted social distancing. In 2021, Hie B, Zhong ED, Berger B, Bryson B. 2021 Learning the language of viral evolution and escape. Science 371, 284-288. (doi:10.1126/science.abd7331) proposed that variants of concern can be identified using two quantities extracted from protein language models, grammaticality and semantic change. These quantities are defined by analogy to concepts from natural language processing. Grammaticality is intended to be a measure of whether a variant viral protein is viable, and semantic change is intended to be a measure of potential for immune escape. Here, we systematically test this hypothesis, taking advantage of several high-throughput datasets that have become available, and also comparing this model with several more recently published machine learning models. We find that grammaticality can be a measure of protein viability, though methods that are trained explicitly to predict mutational effects appear to be more effective. By contrast, we do not find compelling evidence that semantic change is a useful tool for identifying immune escape mutations.
Collapse
Affiliation(s)
- Brent E. Allman
- Integrative Biology, The University of Texas at Austin, Austin, Texas, USA
| | - Luiz Vieira
- Integrative Biology, The University of Texas at Austin, Austin, Texas, USA
| | - Daniel J. Diaz
- Institute for Foundations of Machine Learning, The University of Texas at Austin, Austin, Texas, USA
| | - Claus O. Wilke
- Integrative Biology, The University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
4
|
Powers AC, Renfrew PD, Hosseinzadeh P, Mulligan VK. CYCLICCAE: A CONFORMATIONAL AUTOENCODER FOR EFFICIENT HETEROCHIRAL MACROCYCLIC BACKBONE SAMPLING. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.21.639569. [PMID: 40060652 PMCID: PMC11888347 DOI: 10.1101/2025.02.21.639569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/20/2025]
Abstract
Macrocycles are a promising therapeutic class. The incorporation of heterochiral and non-natural chemical building-blocks presents challenges for rational design, however. With no existing machine learning methods tailored for heterochiral macrocycle design, we developed a novel convolutional autoencoder model to rapidly generate energetically favorable macrocycle backbones for heterochiral design and structure prediction. Our approach surpasses the current state-of-the-art method, Generalized Kinematic loop closure (GenKIC) in the Rosetta software suite. Given the absence of large, available macrocycle datasets, we created a custom dataset in-house and in silico. Our model, CyclicCAE, produces energetically stable backbones and designable structures more rapidly than GenKIC. It enables users to perform energy minimization, generate structurally similar or diverse inputs via MCMC, and conduct inpainting with fixed anchors or motifs. We propose that this novel method will accelerate the development of stable macrocycles, speeding up macrocycle drug design pipelines.
Collapse
Affiliation(s)
- Andrew C. Powers
- Department of Bioengineering, University of Oregon, Eugene, Oregon
| | - P. Douglas Renfrew
- Center for Computational Biology, Flatiron Institute, New York, New York
| | | | | |
Collapse
|
5
|
Ertelt M, Moretti R, Meiler J, Schoeder CT. Self-supervised machine learning methods for protein design improve sampling but not the identification of high-fitness variants. SCIENCE ADVANCES 2025; 11:eadr7338. [PMID: 39937901 PMCID: PMC11817935 DOI: 10.1126/sciadv.adr7338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Accepted: 01/10/2025] [Indexed: 02/14/2025]
Abstract
Machine learning (ML) is changing the world of computational protein design, with data-driven methods surpassing biophysical-based methods in experimental success. However, they are most often reported as case studies, lack integration and standardization, and are therefore hard to objectively compare. In this study, we established a streamlined and diverse toolbox for methods that predict amino acid probabilities inside the Rosetta software framework that allows for the side-by-side comparison of these models. Subsequently, existing protein fitness landscapes were used to benchmark novel ML methods in realistic protein design settings. We focused on the traditional problems of protein design: sampling and scoring. A major finding of our study is that ML approaches are better at purging the sampling space from deleterious mutations. Nevertheless, scoring resulting mutations without model fine-tuning showed no clear improvement over scoring with Rosetta. We conclude that ML now complements, rather than replaces, biophysical methods in protein design.
Collapse
Affiliation(s)
- Moritz Ertelt
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
| | - Rocco Moretti
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Jens Meiler
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
- Department of Chemistry, Vanderbilt University, Nashville, TN, USA
- Center for Structural Biology, Vanderbilt University, Nashville, TN, USA
| | - Clara T. Schoeder
- Institute for Drug Discovery, Leipzig University Faculty of Medicine, Leipzig, Germany
- Center for Scalable Data Analytics and Artificial Intelligence ScaDS.AI, Dresden/Leipzig, Dresden, Germany
| |
Collapse
|
6
|
Harding-Larsen D, Funk J, Madsen NG, Gharabli H, Acevedo-Rocha CG, Mazurenko S, Welner DH. Protein representations: Encoding biological information for machine learning in biocatalysis. Biotechnol Adv 2024; 77:108459. [PMID: 39366493 DOI: 10.1016/j.biotechadv.2024.108459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 09/19/2024] [Accepted: 09/29/2024] [Indexed: 10/06/2024]
Abstract
Enzymes offer a more environmentally friendly and low-impact solution to conventional chemistry, but they often require additional engineering for their application in industrial settings, an endeavour that is challenging and laborious. To address this issue, the power of machine learning can be harnessed to produce predictive models that enable the in silico study and engineering of improved enzymatic properties. Such machine learning models, however, require the conversion of the complex biological information to a numerical input, also called protein representations. These inputs demand special attention to ensure the training of accurate and precise models, and, in this review, we therefore examine the critical step of encoding protein information to numeric representations for use in machine learning. We selected the most important approaches for encoding the three distinct biological protein representations - primary sequence, 3D structure, and dynamics - to explore their requirements for employment and inductive biases. Combined representations of proteins and substrates are also introduced as emergent tools in biocatalysis. We propose the division of fixed representations, a collection of rule-based encoding strategies, and learned representations extracted from the latent spaces of large neural networks. To select the most suitable protein representation, we propose two main factors to consider. The first one is the model setup, which is influenced by the size of the training dataset and the choice of architecture. The second factor is the model objectives such as consideration about the assayed property, the difference between wild-type models and mutant predictors, and requirements for explainability. This review is aimed at serving as a source of information and guidance for properly representing enzymes in future machine learning models for biocatalysis.
Collapse
Affiliation(s)
- David Harding-Larsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Jonathan Funk
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Niklas Gesmar Madsen
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Hani Gharabli
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Carlos G Acevedo-Rocha
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark
| | - Stanislav Mazurenko
- Loschmidt Laboratories, Department of Experimental Biology and RECETOX, Faculty of Science, Masaryk University, Kamenice 5, 625 00 Brno, Czech Republic; International Clinical Research Center, St. Anne's University Hospital Brno, Pekarska 53, 656 91 Brno, Czech Republic
| | - Ditte Hededam Welner
- The Novo Nordisk Center for Biosustainability, Technical University of Denmark, Søltofts Plads, Bygning 220, 2800 Kgs. Lyngby, Denmark.
| |
Collapse
|
7
|
Bakkers MJG, Ritschel T, Tiemessen M, Dijkman J, Zuffianò AA, Yu X, van Overveld D, Le L, Voorzaat R, van Haaren MM, de Man M, Tamara S, van der Fits L, Zahn R, Juraszek J, Langedijk JPM. Efficacious human metapneumovirus vaccine based on AI-guided engineering of a closed prefusion trimer. Nat Commun 2024; 15:6270. [PMID: 39054318 PMCID: PMC11272930 DOI: 10.1038/s41467-024-50659-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 07/12/2024] [Indexed: 07/27/2024] Open
Abstract
The prefusion conformation of human metapneumovirus fusion protein (hMPV Pre-F) is critical for eliciting the most potent neutralizing antibodies and is the preferred immunogen for an efficacious vaccine against hMPV respiratory infections. Here we show that an additional cleavage event in the F protein allows closure and correct folding of the trimer. We therefore engineered the F protein to undergo double cleavage, which enabled screening for Pre-F stabilizing substitutions at the natively folded protomer interfaces. To identify these substitutions, we developed an AI convolutional classifier that successfully predicts complex polar interactions often overlooked by physics-based methods and visual inspection. The combination of additional processing, stabilization of interface regions and stabilization of the membrane-proximal stem, resulted in a Pre-F protein vaccine candidate without the need for a heterologous trimerization domain that exhibited high expression yields and thermostability. Cryo-EM analysis shows the complete ectodomain structure, including the stem, and a specific interaction of the newly identified cleaved C-terminus with the adjacent protomer. Importantly, the protein induces high and cross-neutralizing antibody responses resulting in near complete protection against hMPV challenge in cotton rats, making the highly stable, double-cleaved hMPV Pre-F trimer an attractive vaccine candidate.
Collapse
Affiliation(s)
- Mark J G Bakkers
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
- ForgeBio B.V., Amsterdam, The Netherlands
| | - Tina Ritschel
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
- J&J Innovative Medicine Technology, R&D, New Brunswick, NJ, USA
| | | | - Jacobus Dijkman
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
- Van 't Hoff Institute for Molecular Sciences, University of Amsterdam, Amsterdam, The Netherlands
- Amsterdam Machine Learning Lab, Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
| | - Angelo A Zuffianò
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
- Promaton BV, Amsterdam, The Netherlands
| | - Xiaodi Yu
- Structural & Protein Science, Janssen Research and Development, Spring House, PA, 19044, USA
| | | | - Lam Le
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
| | | | | | - Martijn de Man
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
| | - Sem Tamara
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
| | | | - Roland Zahn
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
| | - Jarek Juraszek
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands
| | - Johannes P M Langedijk
- Janssen Vaccines & Prevention BV, Leiden, The Netherlands.
- ForgeBio B.V., Amsterdam, The Netherlands.
| |
Collapse
|
8
|
Branda F, Scarpa F. Implications of Artificial Intelligence in Addressing Antimicrobial Resistance: Innovations, Global Challenges, and Healthcare's Future. Antibiotics (Basel) 2024; 13:502. [PMID: 38927169 PMCID: PMC11200959 DOI: 10.3390/antibiotics13060502] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2024] [Revised: 05/25/2024] [Accepted: 05/27/2024] [Indexed: 06/28/2024] Open
Abstract
Antibiotic resistance poses a significant threat to global public health due to complex interactions between bacterial genetic factors and external influences such as antibiotic misuse. Artificial intelligence (AI) offers innovative strategies to address this crisis. For example, AI can analyze genomic data to detect resistance markers early on, enabling early interventions. In addition, AI-powered decision support systems can optimize antibiotic use by recommending the most effective treatments based on patient data and local resistance patterns. AI can accelerate drug discovery by predicting the efficacy of new compounds and identifying potential antibacterial agents. Although progress has been made, challenges persist, including data quality, model interpretability, and real-world implementation. A multidisciplinary approach that integrates AI with other emerging technologies, such as synthetic biology and nanomedicine, could pave the way for effective prevention and mitigation of antimicrobial resistance, preserving the efficacy of antibiotics for future generations.
Collapse
Affiliation(s)
- Francesco Branda
- Unit of Medical Statistics and Molecular Epidemiology, Università Campus Bio-Medico di Roma, 00128 Rome, Italy
| | - Fabio Scarpa
- Department of Biomedical Sciences, University of Sassari, 07100 Sassari, Italy
| |
Collapse
|
9
|
Goshisht MK. Machine Learning and Deep Learning in Synthetic Biology: Key Architectures, Applications, and Challenges. ACS OMEGA 2024; 9:9921-9945. [PMID: 38463314 PMCID: PMC10918679 DOI: 10.1021/acsomega.3c05913] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/11/2023] [Revised: 01/19/2024] [Accepted: 01/30/2024] [Indexed: 03/12/2024]
Abstract
Machine learning (ML), particularly deep learning (DL), has made rapid and substantial progress in synthetic biology in recent years. Biotechnological applications of biosystems, including pathways, enzymes, and whole cells, are being probed frequently with time. The intricacy and interconnectedness of biosystems make it challenging to design them with the desired properties. ML and DL have a synergy with synthetic biology. Synthetic biology can be employed to produce large data sets for training models (for instance, by utilizing DNA synthesis), and ML/DL models can be employed to inform design (for example, by generating new parts or advising unrivaled experiments to perform). This potential has recently been brought to light by research at the intersection of engineering biology and ML/DL through achievements like the design of novel biological components, best experimental design, automated analysis of microscopy data, protein structure prediction, and biomolecular implementations of ANNs (Artificial Neural Networks). I have divided this review into three sections. In the first section, I describe predictive potential and basics of ML along with myriad applications in synthetic biology, especially in engineering cells, activity of proteins, and metabolic pathways. In the second section, I describe fundamental DL architectures and their applications in synthetic biology. Finally, I describe different challenges causing hurdles in the progress of ML/DL and synthetic biology along with their solutions.
Collapse
Affiliation(s)
- Manoj Kumar Goshisht
- Department of Chemistry, Natural and
Applied Sciences, University of Wisconsin—Green
Bay, Green
Bay, Wisconsin 54311-7001, United States
| |
Collapse
|
10
|
Parrilla-Gutiérrez JM, Granda JM, Ayme JF, Bajczyk MD, Wilbraham L, Cronin L. Electron density-based GPT for optimization and suggestion of host-guest binders. NATURE COMPUTATIONAL SCIENCE 2024; 4:200-209. [PMID: 38459272 DOI: 10.1038/s43588-024-00602-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2023] [Accepted: 01/23/2024] [Indexed: 03/10/2024]
Abstract
Here we present a machine learning model trained on electron density for the production of host-guest binders. These are read out as simplified molecular-input line-entry system (SMILES) format with >98% accuracy, enabling a complete characterization of the molecules in two dimensions. Our model generates three-dimensional representations of the electron density and electrostatic potentials of host-guest systems using a variational autoencoder, and then utilizes these representations to optimize the generation of guests via gradient descent. Finally the guests are converted to SMILES using a transformer. The successful practical application of our model to established molecular host systems, cucurbit[n]uril and metal-organic cages, resulted in the discovery of 9 previously validated guests for CB[6] and 7 unreported guests (with association constant Ka ranging from 13.5 M-1 to 5,470 M-1) and the discovery of 4 unreported guests for [Pd214]4+ (with Ka ranging from 44 M-1 to 529 M-1).
Collapse
Affiliation(s)
- Juan M Parrilla-Gutiérrez
- School of Chemistry, University of Glasgow, Glasgow, UK
- School of Computing, Engineering and Built Environment, Glasgow Caledonian University, Glasgow, UK
| | - Jarosław M Granda
- School of Chemistry, University of Glasgow, Glasgow, UK
- Institute of Organic Chemistry, Polish Academy of Sciences, Warsaw, Poland
| | | | | | | | - Leroy Cronin
- School of Chemistry, University of Glasgow, Glasgow, UK.
| |
Collapse
|
11
|
Pun MN, Ivanov A, Bellamy Q, Montague Z, LaMont C, Bradley P, Otwinowski J, Nourmohammad A. Learning the shape of protein microenvironments with a holographic convolutional neural network. Proc Natl Acad Sci U S A 2024; 121:e2300838121. [PMID: 38300863 PMCID: PMC10861886 DOI: 10.1073/pnas.2300838121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Accepted: 11/29/2023] [Indexed: 02/03/2024] Open
Abstract
Proteins play a central role in biology from immune recognition to brain activity. While major advances in machine learning have improved our ability to predict protein structure from sequence, determining protein function from its sequence or structure remains a major challenge. Here, we introduce holographic convolutional neural network (H-CNN) for proteins, which is a physically motivated machine learning approach to model amino acid preferences in protein structures. H-CNN reflects physical interactions in a protein structure and recapitulates the functional information stored in evolutionary data. H-CNN accurately predicts the impact of mutations on protein stability and binding of protein complexes. Our interpretable computational model for protein structure-function maps could guide design of novel proteins with desired function.
Collapse
Affiliation(s)
- Michael N. Pun
- Department of Physics, University of Washington, Seattle, WA98195
- The Department for Statistical Physics of Evolving Systems, Max Planck Institute for Dynamics and Self-Organization, Göttingen37077, Germany
| | - Andrew Ivanov
- Department of Physics, University of Washington, Seattle, WA98195
| | - Quinn Bellamy
- Department of Physics, University of Washington, Seattle, WA98195
| | - Zachary Montague
- Department of Physics, University of Washington, Seattle, WA98195
- The Department for Statistical Physics of Evolving Systems, Max Planck Institute for Dynamics and Self-Organization, Göttingen37077, Germany
| | - Colin LaMont
- The Department for Statistical Physics of Evolving Systems, Max Planck Institute for Dynamics and Self-Organization, Göttingen37077, Germany
| | - Philip Bradley
- Fred Hutchinson Cancer Center, Seattle, WA98102
- Department of Biochemistry, University of Washington, Seattle, WA98195
- Institute for Protein Design, University of Washington, Seattle, WA98195
| | - Jakub Otwinowski
- The Department for Statistical Physics of Evolving Systems, Max Planck Institute for Dynamics and Self-Organization, Göttingen37077, Germany
- Dyno Therapeutics, Watertown, MA02472
| | - Armita Nourmohammad
- Department of Physics, University of Washington, Seattle, WA98195
- The Department for Statistical Physics of Evolving Systems, Max Planck Institute for Dynamics and Self-Organization, Göttingen37077, Germany
- Fred Hutchinson Cancer Center, Seattle, WA98102
- Department of Applied Mathematics, University of Washington, Seattle, WA98105
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA98195
| |
Collapse
|
12
|
Liu Y, Liu H. Protein sequence design on given backbones with deep learning. Protein Eng Des Sel 2024; 37:gzad024. [PMID: 38157313 DOI: 10.1093/protein/gzad024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2023] [Revised: 12/08/2023] [Accepted: 12/18/2023] [Indexed: 01/03/2024] Open
Abstract
Deep learning methods for protein sequence design focus on modeling and sampling the many- dimensional distribution of amino acid sequences conditioned on the backbone structure. To produce physically foldable sequences, inter-residue couplings need to be considered properly. These couplings are treated explicitly in iterative methods or autoregressive methods. Non-autoregressive models treating these couplings implicitly are computationally more efficient, but still await tests by wet experiment. Currently, sequence design methods are evaluated mainly using native sequence recovery rate and native sequence perplexity. These metrics can be complemented by sequence-structure compatibility metrics obtained from energy calculation or structure prediction. However, existing computational metrics have important limitations that may render the generalization of computational test results to performance in real applications unwarranted. Validation of design methods by wet experiments should be encouraged.
Collapse
Affiliation(s)
- Yufeng Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230027, China
| | - Haiyan Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui 230027, China
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui 230027, China
- School of Biomedical Engineering, Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, Jiangsu 215004, China
| |
Collapse
|
13
|
Chowdhury J, Fricke C, Bamidele O, Bello M, Yang W, Heyden A, Terejanu G. Invariant Molecular Representations for Heterogeneous Catalysis. J Chem Inf Model 2024; 64:327-339. [PMID: 38197612 PMCID: PMC10806804 DOI: 10.1021/acs.jcim.3c00594] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2023] [Revised: 12/25/2023] [Accepted: 12/28/2023] [Indexed: 01/11/2024]
Abstract
Catalyst screening is a critical step in the discovery and development of heterogeneous catalysts, which are vital for a wide range of chemical processes. In recent years, computational catalyst screening, primarily through density functional theory (DFT), has gained significant attention as a method for identifying promising catalysts. However, the computation of adsorption energies for all likely chemical intermediates present in complex surface chemistries is computationally intensive and costly due to the expensive nature of these calculations and the intrinsic idiosyncrasies of the methods or data sets used. This study introduces a novel machine learning (ML) method to learn adsorption energies from multiple DFT functionals by using invariant molecular representations (IMRs). To do this, we first extract molecular fingerprints for the reaction intermediates and later use a Siamese-neural-network-based training strategy to learn invariant molecular representations or the IMR across all available functionals. Our Siamese network-based representations demonstrate superior performance in predicting adsorption energies compared with other molecular representations. Notably, when considering mean absolute values of adsorption energies as 0.43 eV (PBE-D3), 0.46 eV (BEEF-vdW), 0.81 eV (RPBE), and 0.37 eV (scan+rVV10), our IMR method has achieved the lowest mean absolute errors (MAEs) of 0.18 0.10, 0.16, and 0.18 eV, respectively. These results emphasize the superior predictive capacity of our Siamese network-based representations. The empirical findings in this study illuminate the efficacy, robustness, and dependability of our proposed ML paradigm in predicting adsorption energies, specifically for propane dehydrogenation on a platinum catalyst surface.
Collapse
Affiliation(s)
- Jawad Chowdhury
- Department
of Computer Science, University of North
Carolina at Charlotte, Charlotte, North Carolina 28223, United States
| | - Charles Fricke
- Department
of Chemical Engineering, University of South
Carolina, Columbia, South Carolina 29208, United States
| | - Olajide Bamidele
- Department
of Chemical Engineering, University of South
Carolina, Columbia, South Carolina 29208, United States
| | - Mubarak Bello
- Department
of Chemical Engineering, University of South
Carolina, Columbia, South Carolina 29208, United States
| | - Wenqiang Yang
- Department
of Chemical Engineering, University of South
Carolina, Columbia, South Carolina 29208, United States
| | - Andreas Heyden
- Department
of Chemical Engineering, University of South
Carolina, Columbia, South Carolina 29208, United States
| | - Gabriel Terejanu
- Department
of Computer Science, University of North
Carolina at Charlotte, Charlotte, North Carolina 28223, United States
| |
Collapse
|
14
|
Ugurlu SY, McDonald D, Lei H, Jones AM, Li S, Tong HY, Butler MS, He S. Cobdock: an accurate and practical machine learning-based consensus blind docking method. J Cheminform 2024; 16:5. [PMID: 38212855 PMCID: PMC10785400 DOI: 10.1186/s13321-023-00793-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2023] [Accepted: 12/10/2023] [Indexed: 01/13/2024] Open
Abstract
Probing the surface of proteins to predict the binding site and binding affinity for a given small molecule is a critical but challenging task in drug discovery. Blind docking addresses this issue by performing docking on binding regions randomly sampled from the entire protein surface. However, compared with local docking, blind docking is less accurate and reliable because the docking space is too largetly sampled. Cavity detection-guided blind docking methods improved the accuracy by using cavity detection (also known as binding site detection) tools to guide the docking procedure. However, it is worth noting that the performance of these methods heavily relies on the quality of the cavity detection tool. This constraint, namely the dependence on a single cavity detection tool, significantly impacts the overall performance of cavity detection-guided methods. To overcome this limitation, we proposed Consensus Blind Dock (CoBDock), a novel blind, parallel docking method that uses machine learning algorithms to integrate docking and cavity detection results to improve not only binding site identification but also pose prediction accuracy. Our experiments on several datasets, including PDBBind 2020, ADS, MTi, DUD-E, and CASF-2016, showed that CoBDock has better binding site and binding mode performance than other state-of-the-art cavity detector tools and blind docking methods.
Collapse
Affiliation(s)
- Sadettin Y Ugurlu
- School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | | | - Huangshu Lei
- YaoPharma Co. Ltd., 100 Xingguang Avenue, Renhe Town, Yubei District, Chongqing, 401121, People's Republic of China
| | - Alan M Jones
- School of Pharmacy, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK
| | - Shu Li
- Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao, 5HV2+CP8, China
| | - Henry Y Tong
- Centre for Artificial Intelligence Driven Drug Discovery, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao, 5HV2+CP8, China
| | | | - Shan He
- School of Computer Science, University of Birmingham, Edgbaston, Birmingham, B15 2TT, UK.
- AIA Insights Ltd, Birmingham, UK.
| |
Collapse
|
15
|
Rahman H, Khan AR, Sadiq T, Farooqi AH, Khan IU, Lim WH. A Systematic Literature Review of 3D Deep Learning Techniques in Computed Tomography Reconstruction. Tomography 2023; 9:2158-2189. [PMID: 38133073 PMCID: PMC10748093 DOI: 10.3390/tomography9060169] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 11/27/2023] [Accepted: 12/01/2023] [Indexed: 12/23/2023] Open
Abstract
Computed tomography (CT) is used in a wide range of medical imaging diagnoses. However, the reconstruction of CT images from raw projection data is inherently complex and is subject to artifacts and noise, which compromises image quality and accuracy. In order to address these challenges, deep learning developments have the potential to improve the reconstruction of computed tomography images. In this regard, our research aim is to determine the techniques that are used for 3D deep learning in CT reconstruction and to identify the training and validation datasets that are accessible. This research was performed on five databases. After a careful assessment of each record based on the objective and scope of the study, we selected 60 research articles for this review. This systematic literature review revealed that convolutional neural networks (CNNs), 3D convolutional neural networks (3D CNNs), and deep learning reconstruction (DLR) were the most suitable deep learning algorithms for CT reconstruction. Additionally, two major datasets appropriate for training and developing deep learning systems were identified: 2016 NIH-AAPM-Mayo and MSCT. These datasets are important resources for the creation and assessment of CT reconstruction models. According to the results, 3D deep learning may increase the effectiveness of CT image reconstruction, boost image quality, and lower radiation exposure. By using these deep learning approaches, CT image reconstruction may be made more precise and effective, improving patient outcomes, diagnostic accuracy, and healthcare system productivity.
Collapse
Affiliation(s)
- Hameedur Rahman
- Department of Computer Games Development, Faculty of Computing & AI, Air University, E9, Islamabad 44000, Pakistan;
| | - Abdur Rehman Khan
- Department of Creative Technologies, Faculty of Computing & AI, Air University, E9, Islamabad 44000, Pakistan;
| | - Touseef Sadiq
- Centre for Artificial Intelligence Research, Department of Information and Communication Technology, University of Agder, Jon Lilletuns vei 9, 4879 Grimstad, Norway
| | - Ashfaq Hussain Farooqi
- Department of Computer Science, Faculty of Computing AI, Air University, Islamabad 44000, Pakistan;
| | - Inam Ullah Khan
- Department of Electronic Engineering, School of Engineering & Applied Sciences (SEAS), Isra University, Islamabad Campus, Islamabad 44000, Pakistan;
| | - Wei Hong Lim
- Faculty of Engineering, Technology and Built Environment, UCSI University, Kuala Lumpur 56000, Malaysia;
| |
Collapse
|
16
|
Yue T, Wang Y, Zhang L, Gu C, Xue H, Wang W, Lyu Q, Dun Y. Deep Learning for Genomics: From Early Neural Nets to Modern Large Language Models. Int J Mol Sci 2023; 24:15858. [PMID: 37958843 PMCID: PMC10649223 DOI: 10.3390/ijms242115858] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2023] [Revised: 10/24/2023] [Accepted: 10/30/2023] [Indexed: 11/15/2023] Open
Abstract
The data explosion driven by advancements in genomic research, such as high-throughput sequencing techniques, is constantly challenging conventional methods used in genomics. In parallel with the urgent demand for robust algorithms, deep learning has succeeded in various fields such as vision, speech, and text processing. Yet genomics entails unique challenges to deep learning, since we expect a superhuman intelligence that explores beyond our knowledge to interpret the genome from deep learning. A powerful deep learning model should rely on the insightful utilization of task-specific knowledge. In this paper, we briefly discuss the strengths of different deep learning models from a genomic perspective so as to fit each particular task with proper deep learning-based architecture, and we remark on practical considerations of developing deep learning architectures for genomics. We also provide a concise review of deep learning applications in various aspects of genomic research and point out current challenges and potential research directions for future genomics applications. We believe the collaborative use of ever-growing diverse data and the fast iteration of deep learning models will continue to contribute to the future of genomics.
Collapse
Affiliation(s)
- Tianwei Yue
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Yuanxin Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Longxiang Zhang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Chunming Gu
- Department of Biomedical Engineering, School of Medicine, Johns Hopkins University, Baltimore, MD 21218, USA;
| | - Haoru Xue
- The Robotics Institute, Carnegie Mellon University, Pittsburgh, PA 15213, USA;
| | - Wenping Wang
- School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA; (Y.W.); (L.Z.); (W.W.)
| | - Qi Lyu
- Department of Computational Mathematics, Science, and Engineering, Michigan State University, East Lansing, MI 48824, USA;
| | - Yujie Dun
- School of Information and Communications Engineering, Xi’an Jiaotong University, Xi’an 710049, China;
| |
Collapse
|
17
|
Xiao B, Zhang C, Zhou J, Wang S, Meng H, Wu M, Zheng Y, Yu R. Design of SC PEP with enhanced stability against pepsin digestion and increased activity by machine learning and structural parameters modeling. Int J Biol Macromol 2023; 250:125933. [PMID: 37482154 DOI: 10.1016/j.ijbiomac.2023.125933] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 06/20/2023] [Accepted: 07/20/2023] [Indexed: 07/25/2023]
Abstract
Prolyl endopeptidases from Sphingomonas capsulata (SC PEP) has attracted much attention as promising oral therapy candidate for celiac sprue, however, its low stability in the gastric environment leads to unsatisfactory clinical results. Therefore, improving its stability against pepsin digestion at low pH is crucial for clinical applications, but challenging. In this study, machine learning and physical parameter model were combined to design SC PEP mutants. After iterations, 20 mutants had higher hydrolysis activity in stomach environment, which was up to 14.1-fold compared with wild-type SC PEP. Mutant M24 involving stable and active mutations and pegylated M24 (M24-PEG) had higher activity of hydrolyzing immunogen in bread than wild-type SC PEP in vitro and in vivo, and residual immunogens in simulated gastric environment were only 1/8 and 1/10 of that in the wild-type SC PEP group. The total residual immunogens in the gastrointestinal tract of mice in the M24 and M24-PEG groups were <20 ppm, reaching the standard of non-toxic food. Our results indicate that the combination of M24 (or M24-PEG) with EP-B2 may be a promising candidate for celiac disease, and the strategies developed in this study provide a paradigm for the design of SC PEP stability mutants.
Collapse
Affiliation(s)
- Bin Xiao
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Chun Zhang
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Junxiu Zhou
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Sa Wang
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Huan Meng
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Miao Wu
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China
| | - Yongxiang Zheng
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China.
| | - Rong Yu
- Department of Biopharmaceutics, West China School of Pharmacy, Sichuan University, Chengdu 610041, PR China; Key Laboratory of Drug-Targeting and Drug Delivery System of the Education Ministry, Sichuan Engineering Laboratory for Plant-Sourced Drug and Sichuan Research Center for Drug Precision Industrial Technology, West China School of Pharmacy Sichuan University, Chengdu 610041, PR China.
| |
Collapse
|
18
|
Guo L, Qiu T, Wang J. ViTScore: A Novel Three-Dimensional Vision Transformer Method for Accurate Prediction of Protein-Ligand Docking Poses. IEEE Trans Nanobioscience 2023; 22:734-743. [PMID: 37159314 DOI: 10.1109/tnb.2023.3274640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
Protein-ligand interactions (PLIs) are essential for cellular activities and drug discovery, and due to the complexity and high cost of experimental methods, there is a great demand for computational approaches, such as protein-ligand docking, to decipher PLI patterns. One of the most challenging aspects of protein-ligand docking is to identify near-native conformations from a set of poses, but traditional scoring functions still have limited accuracy. Therefore, new scoring methods are urgently needed for methodological and/or practical implications. We present a novel deep learning-based scoring function for ranking protein-ligand docking poses based on Vision Transformer (ViT), named ViTScore. To recognize near-native poses from a set of poses, ViTScore voxelizes the protein-ligand interactional pocket into a 3D grid labeled by the occupancy contribution of atoms in different physicochemical classes. This allows ViTScore to capture the subtle differences between spatially and energetically favorable near-native poses and unfavorable non-native poses without needing extra information. After that, ViTScore will output the prediction of the root mean square deviation (rmsd) of a docking pose with reference to the native binding pose. ViTScore is extensively evaluated on diverse test sets including PDBbind2019 and CASF2016, and obtains significant improvements over existing methods in terms of RMSE, R and docking power. Moreover, the results demonstrate that ViTScore is a promising scoring function for protein-ligand docking, and it can be used to accurately identify near-native poses from a set of poses. Furthermore, the results suggest that ViTScore is a powerful tool for protein-ligand docking, and it can be used to accurately identify near-native poses from a set of poses. Additionally, ViTScore can be used to identify potential drug targets and to design new drugs with improved efficacy and safety.
Collapse
|
19
|
Sieg J, Rarey M. Searching similar local 3D micro-environments in protein structure databases with MicroMiner. Brief Bioinform 2023; 24:bbad357. [PMID: 37833838 DOI: 10.1093/bib/bbad357] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 08/28/2023] [Accepted: 09/18/2023] [Indexed: 10/15/2023] Open
Abstract
The available protein structure data are rapidly increasing. Within these structures, numerous local structural sites depict the details characterizing structure and function. However, searching and analyzing these sites extensively and at scale poses a challenge. We present a new method to search local sites in protein structure databases using residue-defined local 3D micro-environments. We implemented the method in a new tool called MicroMiner and demonstrate the capabilities of residue micro-environment search on the example of structural mutation analysis. Usually, experimental structures for both the wild-type and the mutant are unavailable for comparison. With MicroMiner, we extracted $>255 \times 10^{6}$ amino acid pairs in protein structures from the PDB, exemplifying single mutations' local structural changes for single chains and $>45 \times 10^{6}$ pairs for protein-protein interfaces. We further annotate existing data sets of experimentally measured mutation effects, like $\Delta \Delta G$ measurements, with the extracted structure pairs to combine the mutation effect measurement with the structural change upon mutation. In addition, we show how MicroMiner can bridge the gap between mutation analysis and structure-based drug design tools. MicroMiner is available as a command line tool and interactively on the https://proteins.plus/ webserver.
Collapse
Affiliation(s)
- Jochen Sieg
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
| | - Matthias Rarey
- Universität Hamburg, ZBH - Center for Bioinformatics, Bundesstraße 43, 20146 Hamburg, Germany
| |
Collapse
|
20
|
Sinha K, Ghosh N, Sil PC. A Review on the Recent Applications of Deep Learning in Predictive Drug Toxicological Studies. Chem Res Toxicol 2023; 36:1174-1205. [PMID: 37561655 DOI: 10.1021/acs.chemrestox.2c00375] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
Drug toxicity prediction is an important step in ensuring patient safety during drug design studies. While traditional preclinical studies have historically relied on animal models to evaluate toxicity, recent advances in deep-learning approaches have shown great promise in advancing drug safety science and reducing animal use in preclinical studies. However, deep-learning-based approaches also face challenges in handling large biological data sets, model interpretability, and regulatory acceptance. In this review, we provide an overview of recent developments in deep-learning-based approaches for predicting drug toxicity, highlighting their potential advantages over traditional methods and the need to address their limitations. Deep-learning models have demonstrated excellent performance in predicting toxicity outcomes from various data sources such as chemical structures, genomic data, and high-throughput screening assays. The potential of deep learning for automated feature engineering is also discussed. This review emphasizes the need to address ethical concerns related to the use of deep learning in drug toxicity studies, including the reduction of animal use and ensuring regulatory acceptance. Furthermore, emerging applications of deep learning in drug toxicity prediction, such as predicting drug-drug interactions and toxicity in rare subpopulations, are highlighted. The integration of deep-learning-based approaches with traditional methods is discussed as a way to develop more reliable and efficient predictive models for drug safety assessment, paving the way for safer and more effective drug discovery and development. Overall, this review highlights the critical role of deep learning in predictive toxicology and drug safety evaluation, emphasizing the need for continued research and development in this rapidly evolving field. By addressing the limitations of traditional methods, leveraging the potential of deep learning for automated feature engineering, and addressing ethical concerns, deep-learning-based approaches have the potential to revolutionize drug toxicity prediction and improve patient safety in drug discovery and development.
Collapse
Affiliation(s)
- Krishnendu Sinha
- Department of Zoology, Jhargram Raj College, Jhargram 721507, West Bengal, India
| | - Nabanita Ghosh
- Department of Zoology, Maulana Azad College, Kolkata 700013, West Bengal, India
| | - Parames C Sil
- Division of Molecular Medicine, Bose Institute, Kolkata 700054, West Bengal, India
| |
Collapse
|
21
|
Kulikova AV, Diaz DJ, Chen T, Cole TJ, Ellington AD, Wilke CO. Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry. Sci Rep 2023; 13:13280. [PMID: 37587128 PMCID: PMC10432456 DOI: 10.1038/s41598-023-40247-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2023] [Accepted: 08/07/2023] [Indexed: 08/18/2023] Open
Abstract
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Collapse
Affiliation(s)
- Anastasiya V Kulikova
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
| | - Daniel J Diaz
- Department of Chemistry, The University of Texas at Austin, Austin, TX, USA
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
| | - Tianlong Chen
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - T Jeffrey Cole
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA
| | - Andrew D Ellington
- The Department of Molecular Biosciences, Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, TX, USA
| | - Claus O Wilke
- Department of Integrative Biology, University of Texas at Austin, Austin, TX, USA.
| |
Collapse
|
22
|
Niazi SK, Mariam Z. Recent Advances in Machine-Learning-Based Chemoinformatics: A Comprehensive Review. Int J Mol Sci 2023; 24:11488. [PMID: 37511247 PMCID: PMC10380192 DOI: 10.3390/ijms241411488] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 06/30/2023] [Accepted: 07/12/2023] [Indexed: 07/30/2023] Open
Abstract
In modern drug discovery, the combination of chemoinformatics and quantitative structure-activity relationship (QSAR) modeling has emerged as a formidable alliance, enabling researchers to harness the vast potential of machine learning (ML) techniques for predictive molecular design and analysis. This review delves into the fundamental aspects of chemoinformatics, elucidating the intricate nature of chemical data and the crucial role of molecular descriptors in unveiling the underlying molecular properties. Molecular descriptors, including 2D fingerprints and topological indices, in conjunction with the structure-activity relationships (SARs), are pivotal in unlocking the pathway to small-molecule drug discovery. Technical intricacies of developing robust ML-QSAR models, including feature selection, model validation, and performance evaluation, are discussed herewith. Various ML algorithms, such as regression analysis and support vector machines, are showcased in the text for their ability to predict and comprehend the relationships between molecular structures and biological activities. This review serves as a comprehensive guide for researchers, providing an understanding of the synergy between chemoinformatics, QSAR, and ML. Due to embracing these cutting-edge technologies, predictive molecular analysis holds promise for expediting the discovery of novel therapeutic agents in the pharmaceutical sciences.
Collapse
Affiliation(s)
- Sarfaraz K Niazi
- College of Pharmacy, University of Illinois, Chicago, IL 61820, USA
| | - Zamara Mariam
- Zamara Mariam, School of Interdisciplinary Engineering & Sciences (SINES), National University of Sciences & Technology (NUST), Islamabad 24090, Pakistan
| |
Collapse
|
23
|
Kulikova AV, Diaz DJ, Chen T, Jeffrey Cole T, Ellington AD, Wilke CO. Two sequence- and two structure-based ML models have learned different aspects of protein biochemistry. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.20.533508. [PMID: 36993648 PMCID: PMC10055221 DOI: 10.1101/2023.03.20.533508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Deep learning models are seeing increased use as methods to predict mutational effects or allowed mutations in proteins. The models commonly used for these purposes include large language models (LLMs) and 3D Convolutional Neural Networks (CNNs). These two model types have very different architectures and are commonly trained on different representations of proteins. LLMs make use of the transformer architecture and are trained purely on protein sequences whereas 3D CNNs are trained on voxelized representations of local protein structure. While comparable overall prediction accuracies have been reported for both types of models, it is not known to what extent these models make comparable specific predictions and/or generalize protein biochemistry in similar ways. Here, we perform a systematic comparison of two LLMs and two structure-based models (CNNs) and show that the different model types have distinct strengths and weaknesses. The overall prediction accuracies are largely uncorrelated between the sequence- and structure-based models. Overall, the two structure-based models are better at predicting buried aliphatic and hydrophobic residues whereas the two LLMs are better at predicting solvent-exposed polar and charged amino acids. Finally, we find that a combined model that takes the individual model predictions as input can leverage these individual model strengths and results in significantly improved overall prediction accuracy.
Collapse
Affiliation(s)
- Anastasiya V. Kulikova
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Daniel J. Diaz
- Department of Chemistry, The University of Texas at Austin, Austin, TX, USA
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
| | - Tianlong Chen
- Institute for Foundations of Machine Learning (IFML), The University of Texas at Austin, Austin, TX, USA
- Department of Electrical and Computer Engineering, The University of Texas at Austin, Austin, TX, USA
| | - T. Jeffrey Cole
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
| | - Andrew D. Ellington
- Center for Systems and Synthetic Biology, The Department of Molecular Biosciences, The University of Texas at Austin, Austin, TX, USA
| | - Claus O. Wilke
- Department of Integrative Biology, University of Texas at Austin, Austin, Texas, USA
| |
Collapse
|
24
|
Ramakrishnan G, Baakman C, Heijl S, Vroling B, van Horck R, Hiraki J, Xue LC, Huynen MA. Understanding structure-guided variant effect predictions using 3D convolutional neural networks. Front Mol Biosci 2023; 10:1204157. [PMID: 37475887 PMCID: PMC10354367 DOI: 10.3389/fmolb.2023.1204157] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Accepted: 06/22/2023] [Indexed: 07/22/2023] Open
Abstract
Predicting pathogenicity of missense variants in molecular diagnostics remains a challenge despite the available wealth of data, such as evolutionary information, and the wealth of tools to integrate that data. We describe DeepRank-Mut, a configurable framework designed to extract and learn from physicochemically relevant features of amino acids surrounding missense variants in 3D space. For each variant, various atomic and residue-level features are extracted from its structural environment, including sequence conservation scores of the surrounding amino acids, and stored in multi-channel 3D voxel grids which are then used to train a 3D convolutional neural network (3D-CNN). The resultant model gives a probabilistic estimate of whether a given input variant is disease-causing or benign. We find that the performance of our 3D-CNN model, on independent test datasets, is comparable to other widely used resources which also combine sequence and structural features. Based on the 10-fold cross-validation experiments, we achieve an average accuracy of 0.77 on the independent test datasets. We discuss the contribution of the variant neighborhood in the model's predictive power, in addition to the impact of individual features on the model's performance. Two key features: evolutionary information of residues in the variant neighborhood and their solvent accessibilities were observed to influence the predictions. We also highlight how predictions are impacted by the underlying disease mechanisms of missense mutations and offer insights into understanding these to improve pathogenicity predictions. Our study presents aspects to take into consideration when adopting deep learning approaches for protein structure-guided pathogenicity predictions.
Collapse
Affiliation(s)
- Gayatri Ramakrishnan
- Department of Medical Biosciences, Radboud University Medical Center, Nijmegen, Netherlands
| | - Coos Baakman
- Department of Medical Biosciences, Radboud University Medical Center, Nijmegen, Netherlands
| | | | | | | | | | - Li C. Xue
- Department of Medical Biosciences, Radboud University Medical Center, Nijmegen, Netherlands
| | - Martijn A. Huynen
- Department of Medical Biosciences, Radboud University Medical Center, Nijmegen, Netherlands
| |
Collapse
|
25
|
Dürr SL, Levy A, Rothlisberger U. Metal3D: a general deep learning framework for accurate metal ion location prediction in proteins. Nat Commun 2023; 14:2713. [PMID: 37169763 PMCID: PMC10175565 DOI: 10.1038/s41467-023-37870-6] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 03/29/2023] [Indexed: 05/13/2023] Open
Abstract
Metal ions are essential cofactors for many proteins and play a crucial role in many applications such as enzyme design or design of protein-protein interactions because they are biologically abundant, tether to the protein using strong interactions, and have favorable catalytic properties. Computational design of metalloproteins is however hampered by the complex electronic structure of many biologically relevant metals such as zinc . In this work, we develop two tools - Metal3D (based on 3D convolutional neural networks) and Metal1D (solely based on geometric criteria) to improve the location prediction of zinc ions in protein structures. Comparison with other currently available tools shows that Metal3D is the most accurate zinc ion location predictor to date with predictions within 0.70 ± 0.64 Å of experimental locations. Metal3D outputs a confidence metric for each predicted site and works on proteins with few homologes in the protein data bank. Metal3D predicts a global zinc density that can be used for annotation of computationally predicted structures and a per residue zinc density that can be used in protein design workflows. Currently trained on zinc, the framework of Metal3D is readily extensible to other metals by modifying the training data.
Collapse
Affiliation(s)
- Simon L Dürr
- Laboratory of Computational Chemistry and Biochemistry,Institute of Chemical Sciences and Engineering, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
| | - Andrea Levy
- Laboratory of Computational Chemistry and Biochemistry,Institute of Chemical Sciences and Engineering, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland
| | - Ursula Rothlisberger
- Laboratory of Computational Chemistry and Biochemistry,Institute of Chemical Sciences and Engineering, Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland.
| |
Collapse
|
26
|
Xu G, Wang Q, Ma J. OPUS-Mut: Studying the Effect of Protein Mutation through Side-Chain Modeling. J Chem Theory Comput 2023; 19:1629-1640. [PMID: 36813264 PMCID: PMC10018731 DOI: 10.1021/acs.jctc.2c00847] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/24/2023]
Abstract
Predicting the effect of protein mutation is crucial in many applications such as protein design, protein evolution, and genetic disease analysis. Structurally, mutation is basically the replacement of the side chain of a particular residue. Therefore, accurate side-chain modeling is useful in studying the effect of mutation. Here, we propose a computational method, namely, OPUS-Mut, which significantly outperforms other backbone-dependent side-chain modeling methods including our previous method OPUS-Rota4. We evaluate OPUS-Mut by four case studies on Myoglobin, p53, HIV-1 protease, and T4 lysozyme. The results show that the predicted structures of side chains of different mutants are consistent well with their experimentally determined results. In addition, when the residues with significant structural shifts upon the mutation are considered, it is found that the extent of the predicted structural shift of these affected residues can be correlated reasonably well with the functional changes of the mutant measured by experiments. OPUS-Mut can also help one to identify the harmful and benign mutations and thus may guide the construction of a protein with relatively low sequence homology but with a similar structure.
Collapse
Affiliation(s)
- Gang Xu
- Multiscale Research Institute of Complex Systems, Fudan University, Shanghai 200433, China.,Zhangjiang Fudan International Innovation Center, Fudan University, Shanghai 201210, China.,Shanghai AI Laboratory, Shanghai 200030, China
| | - Qinghua Wang
- Center for Biomolecular Innovation, Harcam Biomedicines, Shanghai 200131, China
| | - Jianpeng Ma
- Multiscale Research Institute of Complex Systems, Fudan University, Shanghai 200433, China.,Zhangjiang Fudan International Innovation Center, Fudan University, Shanghai 201210, China.,Shanghai AI Laboratory, Shanghai 200030, China
| |
Collapse
|
27
|
Rappoport D, Jinich A. Enzyme Substrate Prediction from Three-Dimensional Feature Representations Using Space-Filling Curves. J Chem Inf Model 2023; 63:1637-1648. [PMID: 36802628 DOI: 10.1021/acs.jcim.3c00005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/22/2023]
Abstract
Compact and interpretable structural feature representations are required for accurately predicting properties and function of proteins. In this work, we construct and evaluate three-dimensional feature representations of protein structures based on space-filling curves (SFCs). We focus on the problem of enzyme substrate prediction, using two ubiquitous enzyme families as case studies: the short-chain dehydrogenase/reductases (SDRs) and the S-adenosylmethionine-dependent methyltransferases (SAM-MTases). Space-filling curves such as the Hilbert curve and the Morton curve generate a reversible mapping from discretized three-dimensional to one-dimensional representations and thus help to encode three-dimensional molecular structures in a system-independent way and with only a few adjustable parameters. Using three-dimensional structures of SDRs and SAM-MTases generated using AlphaFold2, we assess the performance of the SFC-based feature representations in predictions on a new benchmark database of enzyme classification tasks including their cofactor and substrate selectivity. Gradient-boosted tree classifiers yield binary prediction accuracy of 0.77-0.91 and area under curve (AUC) characteristics of 0.83-0.92 for the classification tasks. We investigate the effects of amino acid encoding, spatial orientation, and (the few) parameters of SFC-based encodings on the accuracy of the predictions. Our results suggest that geometry-based approaches such as SFCs are promising for generating protein structural representations and are complementary to the existing protein feature representations such as evolutionary scale modeling (ESM) sequence embeddings.
Collapse
Affiliation(s)
- Dmitrij Rappoport
- Department of Chemistry, University of California, Irvine, 1102 Natural Sciences 2, Irvine, California 92697, United States
| | - Adrian Jinich
- Weill Cornell Medicine, 1300 York Avenue, Box 65, New York, New York 10065, United States
| |
Collapse
|
28
|
Meller A, Ward M, Borowsky J, Kshirsagar M, Lotthammer JM, Oviedo F, Ferres JL, Bowman GR. Predicting locations of cryptic pockets from single protein structures using the PocketMiner graph neural network. Nat Commun 2023; 14:1177. [PMID: 36859488 PMCID: PMC9977097 DOI: 10.1038/s41467-023-36699-3] [Citation(s) in RCA: 58] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Accepted: 02/09/2023] [Indexed: 03/03/2023] Open
Abstract
Cryptic pockets expand the scope of drug discovery by enabling targeting of proteins currently considered undruggable because they lack pockets in their ground state structures. However, identifying cryptic pockets is labor-intensive and slow. The ability to accurately and rapidly predict if and where cryptic pockets are likely to form from a structure would greatly accelerate the search for druggable pockets. Here, we present PocketMiner, a graph neural network trained to predict where pockets are likely to open in molecular dynamics simulations. Applying PocketMiner to single structures from a newly curated dataset of 39 experimentally confirmed cryptic pockets demonstrates that it accurately identifies cryptic pockets (ROC-AUC: 0.87) >1,000-fold faster than existing methods. We apply PocketMiner across the human proteome and show that predicted pockets open in simulations, suggesting that over half of proteins thought to lack pockets based on available structures likely contain cryptic pockets, vastly expanding the potentially druggable proteome.
Collapse
Affiliation(s)
- Artur Meller
- Department of Biochemistry and Molecular Biophysics, Washington University in St. Louis, 660 S. Euclid Ave., Box 8231, St. Louis, MO, 63110, USA
- Medical Scientist Training Program, Washington University in St. Louis, 660 S. Euclid Ave., St. Louis, MO, 63110, USA
| | - Michael Ward
- Department of Biochemistry and Molecular Biophysics, Washington University in St. Louis, 660 S. Euclid Ave., Box 8231, St. Louis, MO, 63110, USA
| | - Jonathan Borowsky
- Department of Biochemistry and Molecular Biophysics, Washington University in St. Louis, 660 S. Euclid Ave., Box 8231, St. Louis, MO, 63110, USA
| | | | - Jeffrey M Lotthammer
- Department of Biochemistry and Molecular Biophysics, Washington University in St. Louis, 660 S. Euclid Ave., Box 8231, St. Louis, MO, 63110, USA
| | - Felipe Oviedo
- AI for Good Research Lab, Microsoft, Redmond, WA, USA
| | | | - Gregory R Bowman
- Department of Biochemistry and Molecular Biophysics, Washington University in St. Louis, 660 S. Euclid Ave., Box 8231, St. Louis, MO, 63110, USA.
- Department of Biochemistry and Molecular Biophysics, University of Pennsylvania, 3620 Hamilton Walk, Philadelphia, PA, 19104, USA.
| |
Collapse
|
29
|
Oostrom M, Akers S, Garrett N, Hanson E, Shaw W, Laureanti JA. Classifying metal-binding sites with neural networks. Protein Sci 2023; 32:e4591. [PMID: 36775934 PMCID: PMC9951193 DOI: 10.1002/pro.4591] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Revised: 01/30/2023] [Accepted: 02/07/2023] [Indexed: 02/14/2023]
Abstract
To advance our ability to predict impacts of the protein scaffold on catalysis, robust classification schemes to define features of proteins that will influence reactivity are needed. One of these features is a protein's metal-binding ability, as metals are critical to catalytic conversion by metalloenzymes. As a step toward realizing this goal, we used convolutional neural networks (CNNs) to enable the classification of a metal cofactor binding pocket within a protein scaffold. CNNs enable images to be classified based on multiple levels of detail in the image, from edges and corners to entire objects, and can provide rapid classification. First, six CNN models were fine-tuned to classify the 20 standard amino acids to choose a performant model for amino acid classification. This model was then trained in two parallel efforts: to classify a 2D image of the environment within a given radius of the central metal binding site, either an Fe ion or a [2Fe-2S] cofactor, with the metal visible (effort 1) or the metal hidden (effort 2). We further used two sub-classifications of the [2Fe-2S] cofactor: (1) a standard [2Fe-2S] cofactor and (2) a Rieske [2Fe-2S] cofactor. The accuracy for the model correctly identifying all three defined features was >95%, despite our perception of the increased challenge of the metalloenzyme identification. This demonstrates that machine learning methodology to classify and distinguish similar metal-binding sites, even in the absence of a visible cofactor, is indeed possible and offers an additional tool for metal-binding site identification in proteins.
Collapse
Affiliation(s)
- Marjolein Oostrom
- National Security Directorate, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Sarah Akers
- National Security Directorate, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Noah Garrett
- Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Emma Hanson
- Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Wendy Shaw
- Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington, USA
| | - Joseph A Laureanti
- Physical and Computational Sciences Directorate, Pacific Northwest National Laboratory, Richland, Washington, USA
| |
Collapse
|
30
|
Diaz DJ, Kulikova AV, Ellington AD, Wilke CO. Using machine learning to predict the effects and consequences of mutations in proteins. Curr Opin Struct Biol 2023; 78:102518. [PMID: 36603229 PMCID: PMC9908841 DOI: 10.1016/j.sbi.2022.102518] [Citation(s) in RCA: 26] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2022] [Revised: 11/07/2022] [Accepted: 11/20/2022] [Indexed: 01/05/2023]
Abstract
Machine and deep learning approaches can leverage the increasingly available massive datasets of protein sequences, structures, and mutational effects to predict variants with improved fitness. Many different approaches are being developed, but systematic benchmarking studies indicate that even though the specifics of the machine learning algorithms matter, the more important constraint comes from the data availability and quality utilized during training. In cases where little experimental data are available, unsupervised and self-supervised pre-training with generic protein datasets can still perform well after subsequent refinement via hybrid or transfer learning approaches. Overall, recent progress in this field has been staggering, and machine learning approaches will likely play a major role in future breakthroughs in protein biochemistry and engineering.
Collapse
Affiliation(s)
- Daniel J Diaz
- Department of Chemistry, The University of Texas at Austin, 105 E 24TH St., Austin, 78712, Texas, USA; Department of Molecular Biosciences, The University of Texas at Austin, 100 East 24th St., Stop A5000, Austin, 78712, Texas, USA. https://twitter.com/aiproteins
| | - Anastasiya V Kulikova
- Department of Integrative Biology, The University of Texas at Austin, 2415 Speedway, Stop C0930, Austin, 78712, Texas, USA
| | - Andrew D Ellington
- Department of Molecular Biosciences, The University of Texas at Austin, 100 East 24th St., Stop A5000, Austin, 78712, Texas, USA. https://twitter.com/CSSBatUT
| | - Claus O Wilke
- Department of Integrative Biology, The University of Texas at Austin, 2415 Speedway, Stop C0930, Austin, 78712, Texas, USA.
| |
Collapse
|
31
|
Zhang S, Yang K, Liu Z, Lai X, Yang Z, Zeng J, Li S. DrugAI: a multi-view deep learning model for predicting drug-target activating/inhibiting mechanisms. Brief Bioinform 2023; 24:6918762. [PMID: 36527428 DOI: 10.1093/bib/bbac526] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 10/17/2022] [Accepted: 11/04/2022] [Indexed: 12/23/2022] Open
Abstract
Understanding the mechanisms of candidate drugs play an important role in drug discovery. The activating/inhibiting mechanisms between drugs and targets are major types of mechanisms of drugs. Owing to the complexity of drug-target (DT) mechanisms and data scarcity, modelling this problem based on deep learning methods to accurately predict DT activating/inhibiting mechanisms remains a considerable challenge. Here, by considering network pharmacology, we propose a multi-view deep learning model, DrugAI, which combines four modules, i.e. a graph neural network for drugs, a convolutional neural network for targets, a network embedding module for drugs and targets and a deep neural network for predicting activating/inhibiting mechanisms between drugs and targets. Computational experiments show that DrugAI performs better than state-of-the-art methods and has good robustness and generalization. To demonstrate the reliability of the predictive results of DrugAI, bioassay experiments are conducted to validate two drugs (notopterol and alpha-asarone) predicted to activate TRPV1. Moreover, external validation bears out 61 pairs of mechanism relationships between natural products and their targets predicted by DrugAI based on independent literatures and PubChem bioassays. DrugAI, for the first time, provides a powerful multi-view deep learning framework for robust prediction of DT activating/inhibiting mechanisms.
Collapse
Affiliation(s)
- Siqin Zhang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics/Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Kuo Yang
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics/Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Zhenhong Liu
- Institute for Brain Disorders, Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing 100700, China
| | - Xinxing Lai
- Institute for Brain Disorders, Dongzhimen Hospital, Beijing University of Chinese Medicine, Beijing 100700, China
| | - Zhen Yang
- School of Traditional Chinese Medicine, Beijing University of Chinese Medicine, Beijing 100029, China
| | - Jianyang Zeng
- Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing 100084, China
| | - Shao Li
- Institute for TCM-X, MOE Key Laboratory of Bioinformatics/Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing 100084, China
| |
Collapse
|
32
|
Paik I, Ngo PHT, Shroff R, Diaz DJ, Maranhao AC, Walker DJ, Bhadra S, Ellington AD. Improved Bst DNA Polymerase Variants Derived via a Machine Learning Approach. Biochemistry 2023; 62:410-418. [PMID: 34762799 PMCID: PMC9514386 DOI: 10.1021/acs.biochem.1c00451] [Citation(s) in RCA: 25] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/02/2023]
Abstract
The DNA polymerase I from Geobacillus stearothermophilus (also known as Bst DNAP) is widely used in isothermal amplification reactions, where its strand displacement ability is prized. More robust versions of this enzyme should be enabled for diagnostic applications, especially for carrying out higher temperature reactions that might proceed more quickly. To this end, we appended a short fusion domain from the actin-binding protein villin that improved both stability and purification of the enzyme. In parallel, we have developed a machine learning algorithm that assesses the relative fit of individual amino acids to their chemical microenvironments at any position in a protein and applied this algorithm to predict sequence substitutions in Bst DNAP. The top predicted variants had greatly improved thermotolerance (heating prior to assay), and upon combination, the mutations showed additive thermostability, with denaturation temperatures up to 2.5 °C higher than the parental enzyme. The increased thermostability of the enzyme allowed faster loop-mediated isothermal amplification assays to be carried out at 73 °C, where both Bst DNAP and its improved commercial counterpart Bst 2.0 are inactivated. Overall, this is one of the first examples of the application of machine learning approaches to the thermostabilization of an enzyme.
Collapse
Affiliation(s)
- Inyup Paik
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Phuoc H. T. Ngo
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology and Department of Chemistry, College of Natural Sciences, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Raghav Shroff
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States; CCDC Army Research Lab-South, Austin, Texas 78712, United States
| | - Daniel J. Diaz
- Center for Systems and Synthetic Biology and Department of Chemistry, College of Natural Sciences, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Andre C. Maranhao
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - David J.F. Walker
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Sanchita Bhadra
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| | - Andrew D. Ellington
- Department of Molecular Biosciences, College of Natural Sciences, the University of Texas at Austin, Austin, Texas 78712, United States; Center for Systems and Synthetic Biology, The University of Texas at Austin, Austin, Texas 78712, United States
| |
Collapse
|
33
|
Kosonocky CW, Ellington AD. Evolving to Evolve, Dan Tawfik's Insights into Protein Engineering. Biochemistry 2023; 62:145-147. [PMID: 36647679 DOI: 10.1021/acs.biochem.2c00668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023]
|
34
|
Gupta S, Baudry J, Menon V. Big Data analytics for improved prediction of ligand binding and conformational selection. Front Mol Biosci 2023; 9:953984. [PMID: 36710883 PMCID: PMC9878559 DOI: 10.3389/fmolb.2022.953984] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 12/16/2022] [Indexed: 01/15/2023] Open
Abstract
This research introduces new machine learning and deep learning approaches, collectively referred to as Big Data analytics techniques that are unique to address the protein conformational selection mechanism for protein:ligands complexes. The novel Big Data analytics techniques presented in this work enables efficient data processing of a large number of protein:ligand complexes, and provides better identification of specific protein properties that are responsible for a high probability of correct prediction of protein:ligand binding. The GPCR proteins ADORA2A (Adenosine A2a Receptor), ADRB2 (Adrenoceptor Beta 2), OPRD1 (Opioid receptor Delta 1) and OPRK1 (Opioid Receptor Kappa 1) are examined in this study using Big Data analytics techniques, which can efficiently process a huge ensemble of protein conformations, and significantly enhance the prediction of binding protein conformation (i.e., the protein conformations that will be selected by the ligands for binding) about 10-38 times better than its random selection counterpart for protein conformation selection. In addition to providing a Big Data approach to the conformational selection mechanism, this also opens the door to the systematic identification of such "binding conformations" for proteins. The physico-chemical features that are useful in predicting the "binding conformations" are largely, but not entirely, shared among the test proteins, indicating that the biophysical properties that drive the conformation selection mechanism may, to an extent, be protein-specific for the protein properties used in this work.
Collapse
Affiliation(s)
- Shivangi Gupta
- Department of Computer Science, The University of Alabama in Huntsville, Huntsville, AL, United States
| | - Jerome Baudry
- Department of Biological Sciences, The University of Alabama in Huntsville, Huntsville, AL, United States,*Correspondence: Vineetha Menon, ; Jerome Baudry,
| | - Vineetha Menon
- Department of Computer Science, The University of Alabama in Huntsville, Huntsville, AL, United States,*Correspondence: Vineetha Menon, ; Jerome Baudry,
| |
Collapse
|
35
|
Abstract
This chapter outlines the myriad applications of machine learning (ML) in synthetic biology, specifically in engineering cell and protein activity, and metabolic pathways. Though by no means comprehensive, the chapter highlights several prominent computational tools applied in the field and their potential use cases. The examples detailed reinforce how ML algorithms can enhance synthetic biology research by providing data-driven insights into the behavior of living systems, even without detailed knowledge of their underlying mechanisms. By doing so, ML promises to increase the efficiency of research projects by modeling hypotheses in silico that can then be tested through experiments. While challenges related to training dataset generation and computational costs remain, ongoing improvements in ML tools are paving the way for smarter and more streamlined synthetic biology workflows that can be readily employed to address grand challenges across manufacturing, medicine, engineering, agriculture, and beyond.
Collapse
Affiliation(s)
- Brendan Fu-Long Sieow
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- NUS Graduate School for Integrative Sciences and Engineering Programme, National University of Singapore, Singapore, Singapore
| | - Ryan De Sotto
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Zhi Ren Darren Seet
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - In Young Hwang
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Matthew Wook Chang
- NUS Synthetic Biology for Clinical and Technological Innovation (SynCTI), National University of Singapore, Singapore, Singapore.
- Synthetic Biology Translational Research Programme, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
36
|
Molecular Dynamics Simulation of Poly(Ether Ether Ketone) (PEEK) Polymer to Analyze Intermolecular Ordering by Low Wavenumber Raman Spectroscopy and X-ray Diffraction. Polymers (Basel) 2022; 14:polym14245406. [PMID: 36559773 PMCID: PMC9786246 DOI: 10.3390/polym14245406] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Revised: 12/05/2022] [Accepted: 12/07/2022] [Indexed: 12/14/2022] Open
Abstract
Poly(ether ether ketone) (PEEK) is an important engineering plastic and evaluation of its local crystallinity in composites is critical for producing strong and reliable mechanical parts. Low wavenumber Raman spectroscopy and X-ray diffraction are promising techniques for the analysis of crystal ordering but a detailed understanding of the spectra has not been established. Here, we use molecular dynamics combined with a newly developed approximation to simulate local vibrational features to understand the effect of intermolecular ordering in the Raman spectra. We found that intermolecular ordering does affect the low wavenumber Raman spectra and the X-ray diffraction as observed in the experiment. Raman spectroscopy of intermolecular vibration modes is a promising technique to evaluate the local crystallinity of PEEK and other engineering plastics, and the present technique offers an estimation without requiring heavy computational resources.
Collapse
|
37
|
Liu H, Chen Q. Computational protein design with data‐driven approaches: Recent developments and perspectives. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Affiliation(s)
- Haiyan Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine University of Science and Technology of China Hefei Anhui China
- Biomedical Sciences and Health Laboratory of Anhui Province University of Science and Technology of China Hefei Anhui China
- School of Data Science University of Science and Technology of China Hefei Anhui China
| | - Quan Chen
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine University of Science and Technology of China Hefei Anhui China
- Biomedical Sciences and Health Laboratory of Anhui Province University of Science and Technology of China Hefei Anhui China
| |
Collapse
|
38
|
Zhao Y, Shao J, Asmann YW. Assessment and Optimization of Explainable Machine Learning Models Applied to Transcriptomic Data. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:899-911. [PMID: 35931322 PMCID: PMC10025763 DOI: 10.1016/j.gpb.2022.07.003] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 06/05/2022] [Accepted: 07/25/2022] [Indexed: 01/12/2023]
Abstract
Explainable artificial intelligence aims to interpret how machine learning models make decisions, and many model explainers have been developed in the computer vision field. However, understanding of the applicability of these model explainers to biological data is still lacking. In this study, we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction. To improve the reproducibility and interpretability of results generated by model explainers, we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron (MLP) and convolutional neural network (CNN). We observed three groups of explainer and model architecture combinations with high reproducibility. Group II, which contains three model explainers on aggregated MLP models, identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers. In summary, our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
Collapse
Affiliation(s)
- Yongbing Zhao
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| | - Jinfeng Shao
- The Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Rockville, MD 20852, USA
| | - Yan W Asmann
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL 32224, USA.
| |
Collapse
|
39
|
Wilman W, Wróbel S, Bielska W, Deszynski P, Dudzic P, Jaszczyszyn I, Kaniewski J, Młokosiewicz J, Rouyan A, Satława T, Kumar S, Greiff V, Krawczyk K. Machine-designed biotherapeutics: opportunities, feasibility and advantages of deep learning in computational antibody discovery. Brief Bioinform 2022; 23:bbac267. [PMID: 35830864 PMCID: PMC9294429 DOI: 10.1093/bib/bbac267] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 05/09/2022] [Accepted: 06/07/2022] [Indexed: 11/13/2022] Open
Abstract
Antibodies are versatile molecular binders with an established and growing role as therapeutics. Computational approaches to developing and designing these molecules are being increasingly used to complement traditional lab-based processes. Nowadays, in silico methods fill multiple elements of the discovery stage, such as characterizing antibody-antigen interactions and identifying developability liabilities. Recently, computational methods tackling such problems have begun to follow machine learning paradigms, in many cases deep learning specifically. This paradigm shift offers improvements in established areas such as structure or binding prediction and opens up new possibilities such as language-based modeling of antibody repertoires or machine-learning-based generation of novel sequences. In this review, we critically examine the recent developments in (deep) machine learning approaches to therapeutic antibody design with implications for fully computational antibody design.
Collapse
|
40
|
Liu Y, Zhang L, Wang W, Zhu M, Wang C, Li F, Zhang J, Li H, Chen Q, Liu H. Rotamer-free protein sequence design based on deep learning and self-consistency. NATURE COMPUTATIONAL SCIENCE 2022; 2:451-462. [PMID: 38177863 DOI: 10.1038/s43588-022-00273-6] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Accepted: 06/07/2022] [Indexed: 01/06/2024]
Abstract
Several previously proposed deep learning methods to design amino acid sequences that autonomously fold into a given protein backbone yielded promising results in computational tests but did not outperform conventional energy function-based methods in wet experiments. Here we present the ABACUS-R method, which uses an encoder-decoder network trained using a multitask learning strategy to predict the sidechain type of a central residue from its three-dimensional local environment, which includes, besides other features, the types but not the conformations of the surrounding sidechains. This eliminates the need to reconstruct and optimize sidechain structures, and drastically simplifies the sequence design process. Thus iteratively applying the encoder-decoder to different central residues is able to produce self-consistent overall sequences for a target backbone. Results of wet experiments, including five structures solved by X-ray crystallography, show that ABACUS-R outperforms state-of-the-art energy function-based methods in success rate and design precision.
Collapse
Affiliation(s)
- Yufeng Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Lu Zhang
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Weilun Wang
- CAS Key Laboratory of GIPAS, School of Information Science and Technology, Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, Anhui, China
| | - Min Zhu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Chenchen Wang
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
| | - Fudong Li
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China
| | - Jiahai Zhang
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China
| | - Houqiang Li
- CAS Key Laboratory of GIPAS, School of Information Science and Technology, Department of Electronic Engineering and Information Science, University of Science and Technology of China, Hefei, Anhui, China.
| | - Quan Chen
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China.
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China.
| | - Haiyan Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, Anhui, China.
- Biomedical Sciences and Health Laboratory of Anhui Province, University of Science and Technology of China, Hefei, Anhui, China.
- School of Data Science, University of Science and Technology of China, Hefei, Anhui, China.
| |
Collapse
|
41
|
Misiura M, Shroff R, Thyer R, Kolomeisky AB. DLPacker: Deep learning for prediction of amino acid side chain conformations in proteins. Proteins 2022; 90:1278-1290. [PMID: 35122328 DOI: 10.1002/prot.26311] [Citation(s) in RCA: 23] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 12/03/2021] [Accepted: 12/07/2021] [Indexed: 12/20/2022]
Abstract
Prediction of side chain conformations of amino acids in proteins (also termed "packing") is an important and challenging part of protein structure prediction with many interesting applications in protein design. A variety of methods for packing have been developed but more accurate ones are still needed. Machine learning (ML) methods have recently become a powerful tool for solving various problems in diverse areas of science, including structural biology. In this study, we evaluate the potential of deep neural networks (DNNs) for prediction of amino acid side chain conformations. We formulate the problem as image-to-image transformation and train a U-net style DNN to solve the problem. We show that our method outperforms other physics-based methods by a significant margin: reconstruction RMSDs for most amino acids are about 20% smaller compared to SCWRL4 and Rosetta Packer with RMSDs for bulky hydrophobic amino acids Phe, Tyr, and Trp being up to 50% smaller.
Collapse
Affiliation(s)
- Mikita Misiura
- Department of Chemistry, Center for Theoretical Biological Physics, Rice University, Houston, Texas, USA
| | | | - Ross Thyer
- Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA
| | - Anatoly B Kolomeisky
- Department of Chemistry, Center for Theoretical Biological Physics, Rice University, Houston, Texas, USA.,Department of Chemical and Biomolecular Engineering, Rice University, Houston, Texas, USA.,Department of Physics and Astronomy, Center for Theoretical Biological Physics, Rice University, Houston, Texas, USA
| |
Collapse
|
42
|
Horne J, Shukla D. Recent Advances in Machine Learning Variant Effect Prediction Tools for Protein Engineering. Ind Eng Chem Res 2022; 61:6235-6245. [PMID: 36051311 PMCID: PMC9432854 DOI: 10.1021/acs.iecr.1c04943] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Proteins are Nature's molecular machinery and comprise diverse roles while consisting of chemically similar building blocks. In recent years, protein engineering and design have become important research areas, with many applications in the pharmaceutical, energy, and biocatalysis fields, among others-where the aim is to ultimately create a protein given desired structural and functional properties. It is often critical to model the relationship between a protein's sequence, folded structure, and biological function to assist in such protein engineering pursuits. However, significant challenges remain in concretely mapping an amino acid sequence to specific protein properties and biological activities. Mutations may enhance or diminish molecular protein function, and the epistatic interactions between mutations result in an inherently complex mapping between genetic modifications and protein function. Therefore, estimating the quantitative effects of mutations on protein function(s) remains a grand challenge of biology, bioinformatics, and many related fields and would rapidly accelerate protein engineering tasks when successful. Such estimation is often known as variant effect prediction (VEP). However, progress has been demonstrated in recent years with the development of machine learning (ML) methods in modeling the relationship between mutations and protein function. In this Review, recent advances in variant effect prediction (VEP) are discussed as tools for protein engineering, focusing on techniques incorporating gains from the broader ML community and challenges in estimating biomolecular functional differences. Primary developments highlighted include convolutional neural networks, graph neural networks, and natural language embeddings for protein sequences.
Collapse
Affiliation(s)
- Jesse Horne
- Department of Chemical and Biomolecular Engineering, University of Illinois Urbana-Champaign, Champaign, Illinois 61801, United States
| | - Diwakar Shukla
- Department of Chemical and Biomolecular Engineering and Department of Bioengineering, University of Illinois Urbana-Champaign, Champaign, Illinois 61801, United States; Department of Plant Biology, Cancer Center at Illinois, and Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Champaign, Illinois 61801, United States
| |
Collapse
|
43
|
Gupta AK, Raghavachari K. Three-Dimensional Convolutional Neural Networks Utilizing Molecular Topological Features for Accurate Atomization Energy Predictions. J Chem Theory Comput 2022; 18:2132-2143. [PMID: 35226496 DOI: 10.1021/acs.jctc.1c00504] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Deep learning methods provide a novel way to establish a correlation between two quantities. In this context, computer vision techniques such as three-dimensional (3D)-convolutional neural networks become a natural choice to associate a molecular property with its structure due to the inherent 3D nature of a molecule. However, traditional 3D input data structures are intrinsically sparse in nature, which tend to induce instabilities during the learning process, which in turn may lead to underfitted results. To address this deficiency, in this project, we propose to use quantum-chemically derived molecular topological features, namely, localized orbital locator and electron localization function, as molecular descriptors, which provide a relatively denser input representation in a 3D space. Such topological features provide a detailed picture of the atomic and electronic configuration and interatomic interactions in the molecule and hence are ideal for predicting properties that are highly dependent on the physical or electronic structure of the molecule. Herein, we demonstrate the efficacy of our proposed model by applying it to the task of predicting atomization energies for the QM9-G4MP2 data set, which contains ∼134k molecules. Furthermore, we incorporated the Δ-machine learning approach into our model, which enabled us to reach beyond benchmark accuracy levels (∼1.0 kJ mol-1). As a result, we consistently obtain impressive mean absolute errors of the order 0.1 kcal mol-1 (∼0.42 kJ mol-1) versus the G4(MP2) theory using relatively modest models, which could potentially be improved further in a systematic manner using additional compute resources.
Collapse
Affiliation(s)
- Ankur Kumar Gupta
- Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States
| | - Krishnan Raghavachari
- Department of Chemistry, Indiana University, Bloomington, Indiana 47405, United States
| |
Collapse
|
44
|
Guo L, He J, Lin P, Huang SY, Wang J. TRScore: a three-dimensional RepVGG-based scoring method for ranking protein docking models. Bioinformatics 2022; 38:2444-2451. [PMID: 35199137 DOI: 10.1093/bioinformatics/btac120] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2021] [Revised: 01/19/2022] [Accepted: 02/21/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPI) play important roles in cellular activities. Due to the technical difficulty and high cost of experimental methods, there are considerable interests towards the development of computational approaches, such as protein docking, to decipher PPI patterns. One of the important and difficult aspects in protein docking is recognizing near-native conformations from a set of decoys, but unfortunately traditional scoring functions still suffer from limited accuracy. Therefore, new scoring methods are pressingly needed in methodological and/or practical implications. RESULTS We present a new deep learning-based scoring method for ranking protein-protein docking models based on a three-dimensional (3D) RepVGG network, named TRScore. To recognize near-native conformations from a set of decoys, TRScore voxelizes the protein-protein interface into a 3D grid labeled by the number of atoms in different physicochemical classes. Benefiting from the deep convolutional RepVGG architecture, TRScore can effectively capture the subtle differences between energetically favorable near-native models and unfavorable non-native decoys without needing extra information. TRScore was extensively evaluated on diverse test sets including protein-protein docking benchmark 5.0 update set, DockGround decoy set, as well as realistic CAPRI decoy set, and overall obtained a significant improvement over existing methods in cross validation and independent evaluations. AVAILABILITY Codes available at: https://github.com/BioinformaticsCSU/TRScore.
Collapse
Affiliation(s)
- Linyuan Guo
- School of Computer Science, Central South University, Changsha, Hunan 410083, China
| | - Jiahua He
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Peicong Lin
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Sheng-You Huang
- School of Physics, Huazhong University of Science and Technology, Wuhan, Hubei 430074, China
| | - Jianxin Wang
- School of Computer Science, Central South University, Changsha, Hunan 410083, China
| |
Collapse
|
45
|
Anand N, Eguchi R, Mathews II, Perez CP, Derry A, Altman RB, Huang PS. Protein sequence design with a learned potential. Nat Commun 2022; 13:746. [PMID: 35136054 PMCID: PMC8826426 DOI: 10.1038/s41467-022-28313-9] [Citation(s) in RCA: 77] [Impact Index Per Article: 25.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2021] [Accepted: 01/08/2022] [Indexed: 11/08/2022] Open
Abstract
The task of protein sequence design is central to nearly all rational protein engineering problems, and enormous effort has gone into the development of energy functions to guide design. Here, we investigate the capability of a deep neural network model to automate design of sequences onto protein backbones, having learned directly from crystal structure data and without any human-specified priors. The model generalizes to native topologies not seen during training, producing experimentally stable designs. We evaluate the generalizability of our method to a de novo TIM-barrel scaffold. The model produces novel sequences, and high-resolution crystal structures of two designs show excellent agreement with in silico models. Our findings demonstrate the tractability of an entirely learned method for protein sequence design.
Collapse
Affiliation(s)
- Namrata Anand
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Raphael Eguchi
- Department of Biochemistry, Stanford University, Stanford, CA, USA
| | - Irimpan I Mathews
- Stanford Synchrotron Radiation Lightsource, Menlo Park, CA, 94025, USA
| | - Carla P Perez
- Biophysics Program, Stanford University, Stanford, CA, USA
| | - Alexander Derry
- Biomedical Informatics Training Program, Stanford University, Stanford, CA, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Departments of Genetics and Medicine, Stanford University, Stanford, CA, USA
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, CA, USA.
| |
Collapse
|
46
|
Ovchinnikov S, Huang PS. Structure-based protein design with deep learning. Curr Opin Chem Biol 2021; 65:136-144. [PMID: 34547592 PMCID: PMC8671290 DOI: 10.1016/j.cbpa.2021.08.004] [Citation(s) in RCA: 43] [Impact Index Per Article: 10.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Accepted: 08/13/2021] [Indexed: 12/11/2022]
Abstract
Since the first revelation of proteins functioning as macromolecular machines through their three dimensional structures, researchers have been intrigued by the marvelous ways the biochemical processes are carried out by proteins. The aspiration to understand protein structures has fueled extensive efforts across different scientific disciplines. In recent years, it has been demonstrated that proteins with new functionality or shapes can be designed via structure-based modeling methods, and the design strategies have combined all available information - but largely piece-by-piece - from sequence derived statistics to the detailed atomic-level modeling of chemical interactions. Despite the significant progress, incorporating data-derived approaches through the use of deep learning methods can be a game changer. In this review, we summarize current progress, compare the arc of developing the deep learning approaches with the conventional methods, and describe the motivation and concepts behind current strategies that may lead to potential future opportunities.
Collapse
Affiliation(s)
- Sergey Ovchinnikov
- John Harvard Distinguished Science Fellowship Program, Harvard University, Cambridge, MA, 02138, USA.
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, CA, 94305, USA.
| |
Collapse
|
47
|
Tong X, Liu S, Gu J, Wu C, Liang Y, Shi X. Amino acid environment affinity model based on graph attention network. J Bioinform Comput Biol 2021; 20:2150032. [PMID: 34775920 DOI: 10.1142/s0219720021500323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Proteins are engines involved in almost all functions of life. They have specific spatial structures formed by twisting and folding of one or more polypeptide chains composed of amino acids. Protein sites are protein structure microenvironments that can be identified by three-dimensional locations and local neighborhoods in which the structure or function exists. Understanding the amino acid environment affinity is essential for additional protein structural or functional studies, such as mutation analysis and functional site detection. In this study, an amino acid environment affinity model based on the graph attention network was developed. Initially, we constructed a protein graph according to the distance between amino acid pairs. Then, we extracted a set of structural features for each node. Finally, the protein graph and the associated node feature set were set to input the graph attention network model and to obtain the amino acid affinities. Numerical results show that our proposed method significantly outperforms a recent 3DCNN-based method by almost 30%.
Collapse
Affiliation(s)
- Xueheng Tong
- College of Computer Science and Technology, Jilin University, Qianjing Street 2699, Changchun, Jilin 130012, China
| | - Shuqi Liu
- College of Computer Science and Technology, Jilin University, Qianjing Street 2699, Changchun, Jilin 130012, China
| | - Jiawei Gu
- College of Computer Science and Technology, Jilin University, Qianjing Street 2699, Changchun, Jilin 130012, China
| | - Chunguo Wu
- College of Computer Science and Technology, Jilin University, Qianjing Street 2699, Changchun, Jilin 130012, China
| | - Yanchun Liang
- School of Computer Science, Zhuhai College of Science and Technology Zhuhai, Guangdong 519041, China
| | - Xiaohu Shi
- College of Computer Science and Technology, Jilin University, Qianjing Street 2699, Changchun, Jilin 130012, China.,School of Computer Science, Zhuhai College of Science and Technology Zhuhai, Guangdong 519041, China
| |
Collapse
|
48
|
Learning the local landscape of protein structures with convolutional neural networks. J Biol Phys 2021; 47:435-454. [PMID: 34751854 DOI: 10.1007/s10867-021-09593-6] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 10/18/2021] [Indexed: 10/19/2022] Open
Abstract
One fundamental problem of protein biochemistry is to predict protein structure from amino acid sequence. The inverse problem, predicting either entire sequences or individual mutations that are consistent with a given protein structure, has received much less attention even though it has important applications in both protein engineering and evolutionary biology. Here, we ask whether 3D convolutional neural networks (3D CNNs) can learn the local fitness landscape of protein structure to reliably predict either the wild-type amino acid or the consensus in a multiple sequence alignment from the local structural context surrounding site of interest. We find that the network can predict wild type with good accuracy, and that network confidence is a reliable measure of whether a given prediction is likely going to be correct or not. Predictions of consensus are less accurate and are primarily driven by whether or not the consensus matches the wild type. Our work suggests that high-confidence mis-predictions of the wild type may identify sites that are primed for mutation and likely targets for protein engineering.
Collapse
|
49
|
Wang G, Zhai YJ, Xue ZZ, Xu YY. Improving Protein Subcellular Location Classification by Incorporating Three-Dimensional Structure Information. Biomolecules 2021; 11:1607. [PMID: 34827605 PMCID: PMC8615982 DOI: 10.3390/biom11111607] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2021] [Revised: 10/27/2021] [Accepted: 10/27/2021] [Indexed: 12/12/2022] Open
Abstract
The subcellular locations of proteins are closely related to their functions. In the past few decades, the application of machine learning algorithms to predict protein subcellular locations has been an important topic in proteomics. However, most studies in this field used only amino acid sequences as the data source. Only a few works focused on other protein data types. For example, three-dimensional structures, which contain far more functional protein information than sequences, remain to be explored. In this work, we extracted various handcrafted features to describe the protein structures from physical, chemical, and topological aspects, as well as the learned features obtained by deep neural networks. We then used these features to classify the protein subcellular locations. Our experimental results demonstrated that some of these structural features have a certain effect on the protein location classification, and can help improve the performance of sequence-based location predictors. Our method provides a new view for the analysis of protein spatial distribution, and is anticipated to be used in revealing the relationships between protein structures and functions.
Collapse
Affiliation(s)
- Ge Wang
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; (G.W.); (Z.-Z.X.)
- Guangdong Provincial Key Laboratory of Medical Imaging Processing, Southern Medical University, Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
| | - Yu-Jia Zhai
- Guangzhou Women and Children’s Medical Center, Department of Pharmacy, Guangzhou Medical University, Guangzhou 510623, China;
| | - Zhen-Zhen Xue
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; (G.W.); (Z.-Z.X.)
- Guangdong Provincial Key Laboratory of Medical Imaging Processing, Southern Medical University, Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
- Paul C. Lauterbur Research Center for Biomedical Imaging, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Ying-Ying Xu
- School of Biomedical Engineering, Southern Medical University, Guangzhou 510515, China; (G.W.); (Z.-Z.X.)
- Guangdong Provincial Key Laboratory of Medical Imaging Processing, Southern Medical University, Guangzhou 510515, China
- Guangdong Province Engineering Laboratory for Medical Imaging and Diagnostic Technology, Southern Medical University, Guangzhou 510515, China
| |
Collapse
|
50
|
Yu Y, Xu T, Li J, Qiu Y, Rong Y, Gong Z, Cheng X, Dong L, Liu W, Li J, Dou D, Huang J. A Novel Scalarized Scaffold Hopping Algorithm with Graph-Based Variational Autoencoder for Discovery of JAK1 Inhibitors. ACS OMEGA 2021; 6:22945-22954. [PMID: 34514265 PMCID: PMC8427782 DOI: 10.1021/acsomega.1c03613] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/08/2021] [Accepted: 08/09/2021] [Indexed: 06/13/2023]
Abstract
We have developed a graph-based Variational Autoencoder with Gaussian Mixture hidden space (GraphGMVAE), a deep learning approach for controllable magnitude of scaffold hopping in generative chemistry. It can effectively and accurately generate molecules from a given reference compound, with excellent scaffold novelty against known molecules in the literature or patents (97.9% are novel scaffolds). Moreover, a pipeline for prioritizing the generated compounds was also proposed to narrow down our validation focus. In this work, GraphGMVAE was validated by rapidly hopping the scaffold from FDA-approved upadacitinib, which is an inhibitor of human Janus kinase 1 (JAK1), to generate more potent molecules with novel chemical scaffolds. Seven compounds were synthesized and tested to be active in biochemical assays. The most potent molecule has 5.0 nM activity against JAK1 kinase, which shows that the GraphGMVAE model can design molecules like how a human expert does but with high efficiency and accuracy.
Collapse
Affiliation(s)
- Yang Yu
- Tencent
AI Lab, Tencent, Shenzhen 518057, P. R. China
| | - Tingyang Xu
- Tencent
AI Lab, Tencent, Shenzhen 518057, P. R. China
| | - Jiawen Li
- HitGen
Inc., Tianfu International Bio-Town, Chengdu 610200, Sichuan, P. R. China
| | - Yaping Qiu
- HitGen
Inc., Tianfu International Bio-Town, Chengdu 610200, Sichuan, P. R. China
| | - Yu Rong
- Tencent
AI Lab, Tencent, Shenzhen 518057, P. R. China
| | - Zhen Gong
- HitGen
Inc., Tianfu International Bio-Town, Chengdu 610200, Sichuan, P. R. China
| | - Xuemin Cheng
- HitGen
Inc., Tianfu International Bio-Town, Chengdu 610200, Sichuan, P. R. China
| | - Liming Dong
- HitGen
Inc., Tianfu International Bio-Town, Chengdu 610200, Sichuan, P. R. China
| | - Wei Liu
- Tencent
AI Lab, Tencent, Shenzhen 518057, P. R. China
| | - Jin Li
- HitGen
Inc., Tianfu International Bio-Town, Chengdu 610200, Sichuan, P. R. China
| | - Dengfeng Dou
- HitGen
Inc., Tianfu International Bio-Town, Chengdu 610200, Sichuan, P. R. China
| | - Junzhou Huang
- Tencent
AI Lab, Tencent, Shenzhen 518057, P. R. China
| |
Collapse
|