1
|
Gu C, Ghasemi SM, Cai Y, Fahrmann JF, Long JP, Katayama H, Wu C, Vykoukal J, Dennison JB, Hanash S, Do KA, Irajizad E. Grape-Pi: graph-based neural networks for enhanced protein identification in proteomics pipelines. BIOINFORMATICS ADVANCES 2025; 5:vbaf095. [PMID: 40406669 PMCID: PMC12096076 DOI: 10.1093/bioadv/vbaf095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Revised: 04/02/2025] [Accepted: 04/24/2025] [Indexed: 05/26/2025]
Abstract
Motivation Protein identification via mass spectrometry (MS) is the primary method for untargeted protein detection. However, the identification process is challenging due to data complexity and the need to control false discovery rates (FDR) of protein identification. To address these challenges, we developed a graph neural network (GNN)-based model, Graph Neural Network using Protein-Protein Interaction for Enhancing Protein Identification (Grape-Pi), which is applicable to all proteomics pipelines. This model leverages protein-protein interaction (PPI) data and employs two types of message-passing layers to integrate evidence from both the target protein and its interactors, thereby improving identification accuracy. Results Grape-Pi achieved significant improvements in area under receiver-operating characteristic curve (AUC) in differentiating present and absent proteins: 18% and 7% in two yeast samples and 9% in gastric samples over traditional methods in the test dataset. Additionally, proteins identified via Grape-Pi in gastric samples demonstrated a high correlation with mRNA data and identified gastric cancer proteins, like MAP4K4, missed by conventional methods. Availability and Implementation Grape-Pi is freely available at https://zenodo.org/records/11310518 and https://github.com/FDUguchunhui/GrapePi.
Collapse
Affiliation(s)
- Chunhui Gu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Seyyed Mahmood Ghasemi
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Yining Cai
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Johannes F Fahrmann
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - James P Long
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Hiroyuki Katayama
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Chong Wu
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Jody Vykoukal
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Jennifer B Dennison
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Samir Hanash
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Kim-Anh Do
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| | - Ehsan Irajizad
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
- Department of Clinical Cancer Prevention, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, United States
| |
Collapse
|
2
|
Dauparas J, Lee GR, Pecoraro R, An L, Anishchenko I, Glasscock C, Baker D. Atomic context-conditioned protein sequence design using LigandMPNN. Nat Methods 2025; 22:717-723. [PMID: 40155723 PMCID: PMC11978504 DOI: 10.1038/s41592-025-02626-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Accepted: 02/10/2025] [Indexed: 04/01/2025]
Abstract
Protein sequence design in the context of small molecules, nucleotides and metals is critical to enzyme and small-molecule binder and sensor design, but current state-of-the-art deep-learning-based sequence design methods are unable to model nonprotein atoms and molecules. Here we describe a deep-learning-based protein sequence design method called LigandMPNN that explicitly models all nonprotein components of biomolecular systems. LigandMPNN significantly outperforms Rosetta and ProteinMPNN on native backbone sequence recovery for residues interacting with small molecules (63.3% versus 50.4% and 50.5%), nucleotides (50.5% versus 35.2% and 34.0%) and metals (77.5% versus 36.0% and 40.6%). LigandMPNN generates not only sequences but also sidechain conformations to allow detailed evaluation of binding interactions. LigandMPNN has been used to design over 100 experimentally validated small-molecule and DNA-binding proteins with high affinity and high structural accuracy (as indicated by four X-ray crystal structures), and redesign of Rosetta small-molecule binder designs has increased binding affinity by as much as 100-fold. We anticipate that LigandMPNN will be widely useful for designing new binding proteins, sensors and enzymes.
Collapse
Affiliation(s)
- Justas Dauparas
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Gyu Rie Lee
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | - Robert Pecoraro
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
- Department of Physics, University of Washington, Seattle, WA, USA
| | - Linna An
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Ivan Anishchenko
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - Cameron Glasscock
- Department of Biochemistry, University of Washington, Seattle, WA, USA
- Institute for Protein Design, University of Washington, Seattle, WA, USA
| | - David Baker
- Department of Biochemistry, University of Washington, Seattle, WA, USA.
- Institute for Protein Design, University of Washington, Seattle, WA, USA.
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA.
| |
Collapse
|
3
|
Høie MH, Hummer AM, Olsen TH, Aguilar-Sanjuan B, Nielsen M, Deane CM. AntiFold: improved structure-based antibody design using inverse folding. BIOINFORMATICS ADVANCES 2025; 5:vbae202. [PMID: 40170886 PMCID: PMC11961221 DOI: 10.1093/bioadv/vbae202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Revised: 11/28/2024] [Accepted: 03/19/2025] [Indexed: 04/03/2025]
Abstract
Summary The design and optimization of antibodies requires an intricate balance across multiple properties. Protein inverse folding models, capable of generating diverse sequences folding into the same structure, are promising tools for maintaining structural integrity during antibody design. Here, we present AntiFold, an antibody-specific inverse folding model, fine-tuned from ESM-IF1 on solved and predicted antibody structures. AntiFold outperforms existing inverse folding tools on sequence recovery across complementarity-determining regions, with designed sequences showing high structural similarity to their solved counterpart. It additionally achieves stronger correlations when predicting antibody-antigen binding affinity in a zero-shot manner. AntiFold assigns low probabilities to mutations that disrupt antigen binding, synergizing with protein language model residue probabilities, and demonstrates promise for guiding antibody optimization while retaining structure-related properties. Availability and implementation AntiFold is freely available under the BSD 3-Clause as a web server (https://opig.stats.ox.ac.uk/webapps/antifold/) and pip-installable package (https://github.com/oxpig/AntiFold).
Collapse
Affiliation(s)
- Magnus Haraldson Høie
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Lyngby DK-2800, Denmark
| | - Alissa M Hummer
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| | - Tobias H Olsen
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| | | | - Morten Nielsen
- Section for Bioinformatics, Department of Health Technology, Technical University of Denmark, Lyngby DK-2800, Denmark
| | - Charlotte M Deane
- Department of Statistics, University of Oxford, Oxford OX1 3LB, United Kingdom
| |
Collapse
|
4
|
Zhong J, Zou Z, Qiu J, Wang S. ScFold: a GNN-based model for efficient inverse folding of short-chain proteins via spatial reduction. Brief Bioinform 2025; 26:bbaf156. [PMID: 40205854 PMCID: PMC11982017 DOI: 10.1093/bib/bbaf156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 02/24/2025] [Accepted: 03/19/2025] [Indexed: 04/11/2025] Open
Abstract
In the realm of protein design, the efficient construction of protein sequences that accurately fold into predefined structures has become an important area of research. Although advancements have been made in the study of long-chain proteins, the design of short-chain proteins requires equal consideration. The structural information inherent in short and single chains is typically less comprehensive than that of full-length chains, which can negatively impact their performance. To address this challenge, we introduce ScFold, a novel model that incorporates an innovative node module. This module utilizes spatial dimensionality reduction and positional encoding mechanisms to enhance the extraction of structural features. Experimental results indicate that ScFold achieves a recovery rate of 52.22$\%$ on the CATH4.2 dataset, demonstrating notable efficacy for short-chain proteins, with a recovery rate of 41.6$\%$. Additionally, ScFold further exhibits enhanced recovery rates of 59.32$\%$ and 61.59$\%$ on the TS50 and TS500 datasets, respectively, demonstrating its effectiveness across diverse protein types. Additionally, we performed protein length stratification on the TS500 and CATH4.2 datasets and tested ScFold on length-specific sub-datasets. The results confirm the model's superiority in handling short-chain proteins. Finally, we selected several protein sequence groups from the CATH4.2 dataset for structural visualization analysis and provided comparisons between the model-generated sequences and the target sequences.
Collapse
Affiliation(s)
- Jiancheng Zhong
- College of Information Science and Engineering, Hunan Normal University, 36 Lushan Road, Yuelu District, Changsha 410081, Hunan, China
| | - Zhiwei Zou
- College of Information Science and Engineering, Hunan Normal University, 36 Lushan Road, Yuelu District, Changsha 410081, Hunan, China
| | - Jie Qiu
- College of Information Science and Engineering, Hunan Normal University, 36 Lushan Road, Yuelu District, Changsha 410081, Hunan, China
| | - Shaokai Wang
- Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China
| |
Collapse
|
5
|
Turina P, Petrosino M, Enriquez Sandoval CA, Novak L, Pasquo A, Alexov E, Alladin MA, Ascher DB, Babbi G, Bakolitsa C, Casadio R, Cheng J, Fariselli P, Folkman L, Kamandula A, Katsonis P, Li M, Li D, Lichtarge O, Mahmud S, Martelli PL, Pal D, Panday SK, Pires DEV, Portelli S, Pucci F, Rodrigues CHM, Rooman M, Savojardo C, Schwersensky M, Shen Y, Strokach AV, Sun Y, Woo J, Radivojac P, Brenner SE, Chiaraluce R, Consalvi V, Capriotti E. Assessing the predicted impact of single amino acid substitutions in MAPK proteins for CAGI6 challenges. Hum Genet 2025; 144:265-280. [PMID: 39976676 PMCID: PMC11975483 DOI: 10.1007/s00439-024-02724-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2024] [Accepted: 12/27/2024] [Indexed: 03/05/2025]
Abstract
New thermodynamic and functional studies have been recently conducted to evaluate the impact of amino acid substitutions on the Mitogen Activated Protein Kinases 1 and 3 (MAPK1/3). The Critical Assessment of Genome Interpretation (CAGI) data provider, at Sapienza University of Rome, measured the unfolding free energy and the enzymatic activity of a set of variants (MAPK challenge dataset). Thermodynamic measurements for the denaturant-induced equilibrium unfolding of the phosphorylated and unphosphorylated forms of the MAPKs were obtained by monitoring the far-UV circular dichroism and intrinsic fluorescence changes as a function of denaturant concentration. These values have been used to calculate the change in unfolding free energy between the variant and wild-type proteins at zero concentration of denaturant ( Δ Δ G H 2 O ). The enzymatic activity of the phosphorylated MAPKs variants was also measured using Chelation-Enhanced Fluorescence to monitor the phosphorylation of a peptide substrate. The MAPK challenge dataset, composed of a total of 23 single amino acid substitutions (11 and 12 for MAPK1 and MAPK3, respectively), was used to assess the effectiveness of the computational methods in predicting the Δ Δ G H 2 O values, associated with the variants, and categorize them as destabilizing and not destabilizing. The data on the enzymatic activity of the MAPKs mutants were used to assess the performance of the methods for predicting the functional impact of the variants. For the sixth edition of CAGI, thirteen independent research groups from four continents (Asia, Australia, Europe and North America) submitted > 80 sets of predictions, obtained from different approaches. In this manuscript, we summarized the results of our assessment to highlight the possible limitations of the available algorithms.
Collapse
Affiliation(s)
- Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Maria Petrosino
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy
| | | | - Leonore Novak
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy
| | - Alessandra Pasquo
- Diagnostics and Metrology Laboratory FSN-TECFIS-DIM, ENEA CR Frascati, 00044, Frascati, Italy
| | - Emil Alexov
- Department of Physics and Astronomy, Clemson University, Clemson, SC, 29634, USA
| | - Muttaqi Ahmad Alladin
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - David B Ascher
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
- School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, University of Queensland, St Lucia, QLD, 4072, Australia
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Constantina Bakolitsa
- Department of Plant and Microbial Biology and Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Jianlin Cheng
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO, 65211, USA
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, 10126, Torino, Italy
| | - Lukas Folkman
- Institute for Integrated and Intelligent Systems, Griffith University, Southport, QLD, 4222, Australia
| | - Akash Kamandula
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Minghui Li
- School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou, 215123, Jiangsu, China
| | - Dong Li
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Sajid Mahmud
- Department of Electrical Engineering and Computer Science, NextGen Precision Health Institute, University of Missouri, Columbia, MO, 65211, USA
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Debnath Pal
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | | | - Douglas E V Pires
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3053, Australia
| | - Stephanie Portelli
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
- School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, University of Queensland, St Lucia, QLD, 4072, Australia
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Carlos H M Rodrigues
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Martin Schwersensky
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 1050, Brussels, Belgium
| | - Yang Shen
- Department of Electrical and Computer Engineering Texas, A&M University, College Station, TX, 77843, USA
| | - Alexey V Strokach
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering Texas, A&M University, College Station, TX, 77843, USA
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology and Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
- Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Roberta Chiaraluce
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy.
| | - Valerio Consalvi
- Department of Biochemical Sciences "A. Rossi Fanelli", Sapienza University of Roma, 00185, Rome, Italy.
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy.
- Computational Genomics Platform, IRCCS University Hospital of Bologna, 40138, Bologna, Italy.
| |
Collapse
|
6
|
Turina P, Dal Cortivo G, Enriquez Sandoval CA, Alexov E, Ascher DB, Babbi G, Bakolitsa C, Casadio R, Fariselli P, Folkman L, Kamandula A, Katsonis P, Li D, Lichtarge O, Martelli PL, Panday SK, Pires DEV, Portelli S, Pucci F, Rodrigues CHM, Rooman M, Savojardo C, Schwersensky M, Shen Y, Strokach AV, Sun Y, Woo J, Radivojac P, Brenner SE, Dell'Orco D, Capriotti E. Assessing the predicted impact of single amino acid substitutions in calmodulin for CAGI6 challenges. Hum Genet 2025; 144:113-125. [PMID: 39714488 PMCID: PMC11975486 DOI: 10.1007/s00439-024-02720-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Accepted: 12/02/2024] [Indexed: 12/24/2024]
Abstract
Recent thermodynamic and functional studies have been conducted to evaluate the impact of amino acid substitutions on Calmodulin (CaM). The Critical Assessment of Genome Interpretation (CAGI) data provider at University of Verona (Italy) measured the melting temperature (Tm) and the percentage of unfolding (%unfold) of a set of CaM variants (CaM challenge dataset). Thermodynamic measurements for the equilibrium unfolding of CaM were obtained by monitoring far-UV Circular Dichroism as a function of temperature. These measurements were used to determine the Tm and the percentage of protein remaining unfolded at the highest temperature. The CaM challenge dataset, comprising a total of 15 single amino acid substitutions, was used to evaluate the effectiveness of computational methods in predicting the Tm and unfolding percentages associated with the variants, and categorizing them as destabilizing or not. For the sixth edition of CAGI, nine independent research groups from four continents (Asia, Australia, Europe, and North America) submitted over 52 sets of predictions, derived from various approaches. In this manuscript, we summarize the results of our assessment to highlight the potential limitations of current algorithms and provide insights into the future development of more accurate prediction tools. By evaluating the thermodynamic stability of CaM variants, this study aims to enhance our understanding of the relationship between amino acid substitutions and protein stability, ultimately contributing to more accurate predictions of the effects of genetic variants.
Collapse
Affiliation(s)
- Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Giuditta Dal Cortivo
- Department of Neurosciences, Biomedicine, and Movement Sciences, Section of Biological Chemistry, University of Verona, 37134, Verona, Italy
| | | | - Emil Alexov
- Department of Physics and Astronomy, Clemson University, Clemson, SC, 29634, USA
| | - David B Ascher
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
- School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, University of Queensland, St Lucia, QLD, 4072, Australia
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Constantina Bakolitsa
- Department of Plant and Microbial Biology and Center for Computational Biology, University of California, Berkeley, CA, USA
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Piero Fariselli
- Department of Medical Sciences, University of Torino, Turin, Italy
| | - Lukas Folkman
- Institute for Integrated and Intelligent Systems, Griffith University, Southport, QLD, Australia
| | - Akash Kamandula
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Dong Li
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, USA
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | | | - Douglas E V Pires
- School of Computing and Information Systems, The University of Melbourne, Melbourne, VIC, 3053, Australia
| | - Stephanie Portelli
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
- School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, University of Queensland, St Lucia, QLD, 4072, Australia
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Carlos H M Rodrigues
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes Institute, Melbourne, VIC, 3004, Australia
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Martin Schwersensky
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Yang Shen
- Department of Electrical and Computer Engineering Texas, A&M University, College Station, TX, USA
| | - Alexey V Strokach
- Department of Computer Science, University of Toronto, Toronto, ON, Canada
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering Texas, A&M University, College Station, TX, USA
| | | | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology and Center for Computational Biology, University of California, Berkeley, CA, USA
- Biophysics Graduate Group, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, CA, 94720, USA
| | - Daniele Dell'Orco
- Department of Neurosciences, Biomedicine, and Movement Sciences, Section of Biological Chemistry, University of Verona, 37134, Verona, Italy.
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy.
- Computational Genomics Platform, IRCCS University Hospital of Bologna, 40138, Bologna, Italy.
| |
Collapse
|
7
|
Zhang J, Kinch L, Katsonis P, Lichtarge O, Jagota M, Song YS, Sun Y, Shen Y, Kuru N, Dereli O, Adebali O, Alladin MA, Pal D, Capriotti E, Turina MP, Savojardo C, Martelli PL, Babbi G, Casadio R, Pucci F, Rooman M, Cia G, Tsishyn M, Strokach A, Hu Z, van Loggerenberg W, Roth FP, Radivojac P, Brenner SE, Cong Q, Grishin NV. Assessing predictions on fitness effects of missense variants in HMBS in CAGI6. Hum Genet 2025; 144:173-189. [PMID: 39110250 PMCID: PMC12085147 DOI: 10.1007/s00439-024-02680-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2023] [Accepted: 05/17/2024] [Indexed: 02/21/2025]
Abstract
This paper presents an evaluation of predictions submitted for the "HMBS" challenge, a component of the sixth round of the Critical Assessment of Genome Interpretation held in 2021. The challenge required participants to predict the effects of missense variants of the human HMBS gene on yeast growth. The HMBS enzyme, critical for the biosynthesis of heme in eukaryotic cells, is highly conserved among eukaryotes. Despite the application of a variety of algorithms and methods, the performance of predictors was relatively similar, with Kendall's tau correlation coefficients between predictions and experimental scores around 0.3 for a majority of submissions. Notably, the median correlation (≥ 0.34) observed among these predictors, especially the top predictions from different groups, was greater than the correlation observed between their predictions and the actual experimental results. Most predictors were moderately successful in distinguishing between deleterious and benign variants, as evidenced by an area under the receiver operating characteristic (ROC) curve (AUC) of approximately 0.7 respectively. Compared with the recent two rounds of CAGI competitions, we noticed more predictors outperformed the baseline predictor, which is solely based on the amino acid frequencies. Nevertheless, the overall accuracy of predictions is still far short of positive control, which is derived from experimental scores, indicating the necessity for considerable improvements in the field. The most inaccurately predicted variants in this round were associated with the insertion loop, which is absent in many orthologs, suggesting the predictors still heavily rely on the information from multiple sequence alignment.
Collapse
Affiliation(s)
- Jing Zhang
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Lisa Kinch
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
- Department of Molecular Biology, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA
| | - Panagiotis Katsonis
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Olivier Lichtarge
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, 77030, USA
| | - Milind Jagota
- Computer Science Division, University of California, Berkeley, CA, 94720, USA
| | - Yun S Song
- Computer Science Division, University of California, Berkeley, CA, 94720, USA
- Department of Statistics, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, 77843, USA
| | - Nurdan Kuru
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Onur Dereli
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Ogun Adebali
- Faculty of Engineering and Natural Sciences, Sabanci University, Tuzla, Turkey
| | - Muttaqi Ahmad Alladin
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - Debnath Pal
- Department of Computational and Data Sciences, Indian Institute of Science, Bangaluru, 560012, India
| | - Emidio Capriotti
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Maria Paola Turina
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Castrense Savojardo
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Pier Luigi Martelli
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Giulia Babbi
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Rita Casadio
- Department of Pharmacy and Biotechnology, University of Bologna, Via Selmi 3, 40126, Bologna, Italy
| | - Fabrizio Pucci
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Marianne Rooman
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Gabriel Cia
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Matsvei Tsishyn
- Computational Biology and Bioinformatics, Université Libre de Bruxelles, 50 Roosevelt Ave, 1050, Brussels, Belgium
| | - Alexey Strokach
- Department of Computer Science, University of Toronto, Toronto, ON, M5S 2E4, Canada
| | - Zhiqiang Hu
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Warren van Loggerenberg
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA
- Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, M5G 1X5, Canada
| | - Frederick P Roth
- Department of Computational and Systems Biology, University of Pittsburgh School of Medicine, Pittsburgh, PA, 15213, USA
- Donnelly Centre, University of Toronto, Toronto, ON, M5S 3E1, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Lunenfeld-Tanenbaum Research Institute, Sinai Health, Toronto, ON, M5G 1X5, Canada
| | - Predrag Radivojac
- Khoury College of Computer Sciences, Northeastern University, Boston, MA, 02115, USA
| | - Steven E Brenner
- Department of Plant and Microbial Biology, University of California, Berkeley, CA, 94720, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, 94720, USA
- Biophysics Graduate Group, University of California, Berkeley, Berkeley, CA, 94720, USA
| | - Qian Cong
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Eugene McDermott Center for Human Growth and Development, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Harold C. Simmons Comprehensive Cancer Center, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| | - Nick V Grishin
- Department of Biophysics, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
- Department of Biochemistry, University of Texas Southwestern Medical Center, Dallas, TX, 75390, USA.
| |
Collapse
|
8
|
Dong Z, Jin J, Xiao Y, Xiao B, Wang S, Liu X, Zhu E. Subgraph Propagation and Contrastive Calibration for Incomplete Multiview Data Clustering. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS 2025; 36:3218-3230. [PMID: 38236668 DOI: 10.1109/tnnls.2024.3350671] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/06/2025]
Abstract
The success of multiview raw data mining relies on the integrity of attributes. However, each view faces various noises and collection failures, which leads to a condition that attributes are only partially available. To make matters worse, the attributes in multiview raw data are composed of multiple forms, which makes it more difficult to explore the structure of the data especially in multiview clustering task. Due to the missing data in some views, the clustering task on incomplete multiview data confronts the following challenges, namely: 1) mining the topology of missing data in multiview is an urgent problem to be solved; 2) most approaches do not calibrate the complemented representations with common information of multiple views; and 3) we discover that the cluster distributions obtained from incomplete views have a cluster distribution unaligned problem (CDUP) in the latent space. To solve the above issues, we propose a deep clustering framework based on subgraph propagation and contrastive calibration (SPCC) for incomplete multiview raw data. First, the global structural graph is reconstructed by propagating the subgraphs generated by the complete data of each view. Then, the missing views are completed and calibrated under the guidance of the global structural graph and contrast learning between views. In the latent space, we assume that different views have a common cluster representation in the same dimension. However, in the unsupervised condition, the fact that the cluster distributions of different views do not correspond affects the information completion process to use information from other views. Finally, the complemented cluster distributions for different views are aligned by contrastive learning (CL), thus solving the CDUP in the latent space. Our method achieves advanced performance on six benchmarks, which validates the effectiveness and superiority of our SPCC.
Collapse
|
9
|
Hui WH, Chen YL, Chang SW. GraphLOGIC: Lethality prediction of osteogenesis imperfecta on type I collagen by a mechanics-informed graph neural network. Int J Biol Macromol 2025; 291:139001. [PMID: 39706395 DOI: 10.1016/j.ijbiomac.2024.139001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2024] [Revised: 12/16/2024] [Accepted: 12/17/2024] [Indexed: 12/23/2024]
Abstract
Collagen plays a crucial role in human bodies and has a significant presence in connective tissues. As such, the impact of collagen mutations can be devastating. Osteogenesis imperfecta (OI), a rare genetic disease affecting 1 in every 15,000 to 20,000 people, is one such example characterized by brittle bones. Severe cases of OI could lead to prenatal death. Previous studies have provided insights into the impact of mutations on collagen molecules and predictions of lethality. However, these discussions have focused mainly on mutations in the α1 chain, and some mutation types exhibit poor predictive performance. Coverage of α2 mutations is also limited. We propose a method to predict the risk of lethality for OI-inducing mutations, where a novel mechanics-informed graph representation of the collagen fibril is proposed based on full atomistic simulations to encode sequential and structural information. The method demonstrated improved accuracy in predicting the risk of lethality associated with mutations occurring on both α1 and α2chains. We also found a correlation between the sequences and the predicted OI lethality with the use of a variant of the Grad-CAM technique, where the results agree well with previous studies. Our findings provide insights into the molecular mechanism of collagen on OI lethality.
Collapse
Affiliation(s)
- Wei-Han Hui
- Department of Civil Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Yen-Lin Chen
- Department of Civil Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Shu-Wei Chang
- Department of Civil Engineering, National Taiwan University, Taipei 106, Taiwan; Department of Biomedical Engineering, National Taiwan University, Taipei 106, Taiwan.
| |
Collapse
|
10
|
Sun J, Zhu T, Cui Y, Wu B. Structure-based self-supervised learning enables ultrafast protein stability prediction upon mutation. Innovation (N Y) 2025; 6:100750. [PMID: 39872490 PMCID: PMC11763918 DOI: 10.1016/j.xinn.2024.100750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Accepted: 12/02/2024] [Indexed: 01/30/2025] Open
Abstract
Predicting free energy changes (ΔΔG) is essential for enhancing our understanding of protein evolution and plays a pivotal role in protein engineering and pharmaceutical development. While traditional methods offer valuable insights, they are often constrained by computational speed and reliance on biased training datasets. These constraints become particularly evident when aiming for accurate ΔΔG predictions across a diverse array of protein sequences. Herein, we introduce Pythia, a self-supervised graph neural network specifically designed for zero-shot ΔΔG predictions. Our comparative benchmarks demonstrate that Pythia outperforms other self-supervised pretraining models and force field-based approaches while also exhibiting competitive performance with fully supervised models. Notably, Pythia shows strong correlations and achieves a remarkable increase in computational speed of up to 105-fold. We further validated Pythia's performance in predicting the thermostabilizing mutations of limonene epoxide hydrolase, leading to higher experimental success rates. This exceptional efficiency has enabled us to explore 26 million high-quality protein structures, marking a significant advancement in our ability to navigate the protein sequence space and enhance our understanding of the relationships between protein genotype and phenotype. In addition, we established a web server at https://pythia.wulab.xyz to allow users to easily perform such predictions.
Collapse
Affiliation(s)
- Jinyuan Sun
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Tong Zhu
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yinglu Cui
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Bian Wu
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
11
|
Bochtler M. How the technologies behind self-driving cars, social networks, ChatGPT, and DALL-E2 are changing structural biology. Bioessays 2025; 47:e2400155. [PMID: 39404756 DOI: 10.1002/bies.202400155] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2024] [Revised: 09/08/2024] [Accepted: 09/26/2024] [Indexed: 12/22/2024]
Abstract
The performance of deep Neural Networks (NNs) in the text (ChatGPT) and image (DALL-E2) domains has attracted worldwide attention. Convolutional NNs (CNNs), Large Language Models (LLMs), Denoising Diffusion Probabilistic Models (DDPMs)/Noise Conditional Score Networks (NCSNs), and Graph NNs (GNNs) have impacted computer vision, language editing and translation, automated conversation, image generation, and social network management. Proteins can be viewed as texts written with the alphabet of amino acids, as images, or as graphs of interacting residues. Each of these perspectives suggests the use of tools from a different area of deep learning for protein structural biology. Here, I review how CNNs, LLMs, DDPMs/NCSNs, and GNNs have led to major advances in protein structure prediction, inverse folding, protein design, and small molecule design. This review is primarily intended as a deep learning primer for practicing experimental structural biologists. However, extensive references to the deep learning literature should also make it relevant to readers who have a background in machine learning, physics or statistics, and an interest in protein structural biology.
Collapse
Affiliation(s)
- Matthias Bochtler
- International institute of Molecular and Cell Biology in Warsaw, Warsaw, Poland
- Institute of Biochemistry and Biophysics, Warsaw, Poland
| |
Collapse
|
12
|
Cohn R, Holm EA. Graph convolutional network for predicting abnormal grain growth in Monte Carlo simulations of microstructural evolution. Sci Rep 2024; 14:30259. [PMID: 39632876 PMCID: PMC11618464 DOI: 10.1038/s41598-024-81349-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2024] [Accepted: 11/26/2024] [Indexed: 12/07/2024] Open
Abstract
Recent developments in graph neural networks show promise for predicting the occurrence of abnormal grain growth, which has been a particularly challenging area of research due to its apparent stochastic nature. In this study, we generate a large dataset of Monte Carlo simulations of abnormal grain growth. We train simple graph convolution networks to predict which initial microstructures will exhibit abnormal grain growth, and compare the results to a standard computer vision approach for the same task. The graph neural network outperformed the computer vision method and achieved 73% prediction accuracy and fewer false positives. It also provided some physical insight into feature importance and the relevant length scale required to maximize predictive performance. Analysis of the uncertainty in the Monte Carlo simulations provides additional insights for ongoing work in this area.
Collapse
Affiliation(s)
- Ryan Cohn
- Department of Materials Science and Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA, USA
| | - Elizabeth A Holm
- Department of Materials Science and Engineering, University of Michigan, 500 S State St, Ann Arbor, MI, USA.
| |
Collapse
|
13
|
Soleymani F, Paquet E, Viktor HL, Michalowski W. Structure-based protein and small molecule generation using EGNN and diffusion models: A comprehensive review. Comput Struct Biotechnol J 2024; 23:2779-2797. [PMID: 39050782 PMCID: PMC11268121 DOI: 10.1016/j.csbj.2024.06.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Revised: 06/13/2024] [Accepted: 06/18/2024] [Indexed: 07/27/2024] Open
Abstract
Recent breakthroughs in deep learning have revolutionized protein sequence and structure prediction. These advancements are built on decades of protein design efforts, and are overcoming traditional time and cost limitations. Diffusion models, at the forefront of these innovations, significantly enhance design efficiency by automating knowledge acquisition. In the field of de novo protein design, the goal is to create entirely novel proteins with predetermined structures. Given the arbitrary positions of proteins in 3-D space, graph representations and their properties are widely used in protein generation studies. A critical requirement in protein modelling is maintaining spatial relationships under transformations (rotations, translations, and reflections). This property, known as equivariance, ensures that predicted protein characteristics adapt seamlessly to changes in orientation or position. Equivariant graph neural networks offer a solution to this challenge. By incorporating equivariant graph neural networks to learn the score of the probability density function in diffusion models, one can generate proteins with robust 3-D structural representations. This review examines the latest deep learning advancements, specifically focusing on frameworks that combine diffusion models with equivariant graph neural networks for protein generation.
Collapse
Affiliation(s)
- Farzan Soleymani
- Telfer School of Management, University of Ottawa, ON, K1N 6N5, Canada
| | - Eric Paquet
- National Research Council, 1200 Montreal Road, Ottawa, ON, K1A 0R6, Canada
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | - Herna Lydia Viktor
- School of Electrical Engineering and Computer Science, University of Ottawa, ON, K1N 6N5, Canada
| | | |
Collapse
|
14
|
Jiao Z, Liu Y, Wang Z. Application of graph neural network in computational heterogeneous catalysis. J Chem Phys 2024; 161:171001. [PMID: 39484893 DOI: 10.1063/5.0227821] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Accepted: 10/11/2024] [Indexed: 11/03/2024] Open
Abstract
Heterogeneous catalysis, as a key technology in modern chemical industries, plays a vital role in social progress and economic development. However, its complex reaction process poses challenges to theoretical research. Graph neural networks (GNNs) are gradually becoming a key tool in this field as they can intrinsically learn atomic representation and consider connection relationship, making them naturally applicable to atomic and molecular systems. This article introduces the basic principles, current network architectures, and datasets of GNNs and reviews the application of GNN in heterogeneous catalysis from accelerating the materials screening and exploring the potential energy surface. In the end, we summarize the main challenges and potential application prospects of GNNs in future research endeavors.
Collapse
Affiliation(s)
- Zihao Jiao
- International Research Center for Renewable Energy, State Key Laboratory of Multiphase Flow in Power Engineering, Xi'an Jiaotong University, Shaanxi 710049, China
- School of Chemical Sciences, University of Auckland, Auckland 1010, New Zealand
| | - Ya Liu
- International Research Center for Renewable Energy, State Key Laboratory of Multiphase Flow in Power Engineering, Xi'an Jiaotong University, Shaanxi 710049, China
| | - Ziyun Wang
- School of Chemical Sciences, University of Auckland, Auckland 1010, New Zealand
| |
Collapse
|
15
|
Liu J, Guo Z, You H, Zhang C, Lai L. All-Atom Protein Sequence Design Based on Geometric Deep Learning. Angew Chem Int Ed Engl 2024:e202411461. [PMID: 39295564 DOI: 10.1002/anie.202411461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Revised: 09/09/2024] [Accepted: 09/18/2024] [Indexed: 09/21/2024]
Abstract
Designing sequences for specific protein backbones is a key step in creating new functional proteins. Here, we introduce GeoSeqBuilder, a deep learning framework that integrates protein sequence generation with side chain conformation prediction to produce the complete all-atom structures for designed sequences. GeoSeqBuilder uses spatial geometric features from protein backbones and explicitly includes three-body interactions of neighboring residues. GeoSeqBuilder achieves native residue type recovery rate of 51.6 %, comparable to ProteinMPNN and other leading methods, while accurately predicting side chain conformations. We first used GeoSeqBuilder to design sequences for thioredoxin and a hallucinated three-helical bundle protein. All the 15 tested sequences expressed as soluble monomeric proteins with high thermal stability, and the 2 high-resolution crystal structures solved closely match the designed models. The generated protein sequences exhibit low similarity (minimum 23 %) to the original sequences, with significantly altered hydrophobic cores. We further redesigned the hydrophobic core of glutathione peroxidase 4, and 3 of the 5 designs showed improved enzyme activity. Although further testing is needed, the high experimental success rate in our testing demonstrates that GeoSeqBuilder is a powerful tool for designing novel sequences for predefined protein structures with atomic details. GeoSeqBuilder is available at https://github.com/PKUliujl/GeoSeqBuilder.
Collapse
Affiliation(s)
- Jiale Liu
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Zheng Guo
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Hantian You
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
| | - Changsheng Zhang
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
| | - Luhua Lai
- Center for Life Sciences Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
- BNLMS, College of Chemistry and Molecular Engineering, Peking University, Beijing, 100871, China
- Center for Quantitative Biology Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
- Chengdu Academy for Advanced Interdisciplinary Biotechnologies, Peking University, Chengdu, 510100, Sichuan, China
| |
Collapse
|
16
|
Satalkar V, Degaga GD, Li W, Pang YT, McShan AC, Gumbart JC, Mitchell JC, Torres MP. Generative β-hairpin design using a residue-based physicochemical property landscape. Biophys J 2024; 123:2790-2806. [PMID: 38297834 PMCID: PMC11393682 DOI: 10.1016/j.bpj.2024.01.029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2023] [Revised: 12/20/2023] [Accepted: 01/25/2024] [Indexed: 02/02/2024] Open
Abstract
De novo peptide design is a new frontier that has broad application potential in the biological and biomedical fields. Most existing models for de novo peptide design are largely based on sequence homology that can be restricted based on evolutionarily derived protein sequences and lack the physicochemical context essential in protein folding. Generative machine learning for de novo peptide design is a promising way to synthesize theoretical data that are based on, but unique from, the observable universe. In this study, we created and tested a custom peptide generative adversarial network intended to design peptide sequences that can fold into the β-hairpin secondary structure. This deep neural network model is designed to establish a preliminary foundation of the generative approach based on physicochemical and conformational properties of 20 canonical amino acids, for example, hydrophobicity and residue volume, using extant structure-specific sequence data from the PDB. The beta generative adversarial network model robustly distinguishes secondary structures of β hairpin from α helix and intrinsically disordered peptides with an accuracy of up to 96% and generates artificial β-hairpin peptide sequences with minimum sequence identities around 31% and 50% when compared against the current NCBI PDB and nonredundant databases, respectively. These results highlight the potential of generative models specifically anchored by physicochemical and conformational property features of amino acids to expand the sequence-to-structure landscape of proteins beyond evolutionary limits.
Collapse
Affiliation(s)
- Vardhan Satalkar
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Gemechis D Degaga
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee
| | - Wei Li
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia
| | - Yui Tik Pang
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia
| | - Andrew C McShan
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - James C Gumbart
- School of Physics, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia
| | - Julie C Mitchell
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee.
| | - Matthew P Torres
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia; School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, Georgia.
| |
Collapse
|
17
|
Ghafarollahi A, Buehler MJ. ProtAgents: protein discovery via large language model multi-agent collaborations combining physics and machine learning. DIGITAL DISCOVERY 2024; 3:1389-1409. [PMID: 38993729 PMCID: PMC11235180 DOI: 10.1039/d4dd00013g] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Accepted: 05/13/2024] [Indexed: 07/13/2024]
Abstract
Designing de novo proteins beyond those found in nature holds significant promise for advancements in both scientific and engineering applications. Current methodologies for protein design often rely on AI-based models, such as surrogate models that address end-to-end problems by linking protein structure to material properties or vice versa. However, these models frequently focus on specific material objectives or structural properties, limiting their flexibility when incorporating out-of-domain knowledge into the design process or comprehensive data analysis is required. In this study, we introduce ProtAgents, a platform for de novo protein design based on Large Language Models (LLMs), where multiple AI agents with distinct capabilities collaboratively address complex tasks within a dynamic environment. The versatility in agent development allows for expertise in diverse domains, including knowledge retrieval, protein structure analysis, physics-based simulations, and results analysis. The dynamic collaboration between agents, empowered by LLMs, provides a versatile approach to tackling protein design and analysis problems, as demonstrated through diverse examples in this study. The problems of interest encompass designing new proteins, analyzing protein structures and obtaining new first-principles data - natural vibrational frequencies - via physics simulations. The concerted effort of the system allows for powerful automated and synergistic design of de novo proteins with targeted mechanical properties. The flexibility in designing the agents, on one hand, and their capacity in autonomous collaboration through the dynamic LLM-based multi-agent environment on the other hand, unleashes great potentials of LLMs in addressing multi-objective materials problems and opens up new avenues for autonomous materials discovery and design.
Collapse
Affiliation(s)
- Alireza Ghafarollahi
- Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology 77 Massachusetts Ave. Cambridge MA 02139 USA
| | - Markus J Buehler
- Laboratory for Atomistic and Molecular Mechanics (LAMM), Massachusetts Institute of Technology 77 Massachusetts Ave. Cambridge MA 02139 USA
- Center for Computational Science and Engineering, Schwarzman College of Computing, Massachusetts Institute of Technology 77 Massachusetts Ave. Cambridge MA 02139 USA
| |
Collapse
|
18
|
Chu SKS, Narang K, Siegel JB. Protein stability prediction by fine-tuning a protein language model on a mega-scale dataset. PLoS Comput Biol 2024; 20:e1012248. [PMID: 39038042 PMCID: PMC11293664 DOI: 10.1371/journal.pcbi.1012248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2023] [Revised: 08/01/2024] [Accepted: 06/13/2024] [Indexed: 07/24/2024] Open
Abstract
Protein stability plays a crucial role in a variety of applications, such as food processing, therapeutics, and the identification of pathogenic mutations. Engineering campaigns commonly seek to improve protein stability, and there is a strong interest in streamlining these processes to enable rapid optimization of highly stabilized proteins with fewer iterations. In this work, we explore utilizing a mega-scale dataset to develop a protein language model optimized for stability prediction. ESMtherm is trained on the folding stability of 528k natural and de novo sequences derived from 461 protein domains and can accommodate deletions, insertions, and multiple-point mutations. We show that a protein language model can be fine-tuned to predict folding stability. ESMtherm performs reasonably on small protein domains and generalizes to sequences distal from the training set. Lastly, we discuss our model's limitations compared to other state-of-the-art methods in generalizing to larger protein scaffolds. Our results highlight the need for large-scale stability measurements on a diverse dataset that mirrors the distribution of sequence lengths commonly observed in nature.
Collapse
Affiliation(s)
- Simon K. S. Chu
- Biophysics Graduate Program, University of California Davis, Davis, California, United States of America
| | - Kush Narang
- College of Biological Sciences, University of California Davis, Davis, California, United States of America
| | - Justin B. Siegel
- Genome Center, University of California Davis, Davis, California, United States of America
- Department of Chemistry, University of California Davis, Davis, California, United States of America
- Department of Biochemistry and Molecular Medicine, University of California Davis, Davis, California, United States of America
| |
Collapse
|
19
|
Gurusinghe SNS, Wu Y, DeGrado W, Shifman JM. ProBASS - a language model with sequence and structural features for predicting the effect of mutations on binding affinity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.21.600041. [PMID: 38979193 PMCID: PMC11230163 DOI: 10.1101/2024.06.21.600041] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Protein-protein interactions (PPIs) govern virtually all cellular processes. Even a single mutation within PPI can significantly influence overall protein functionality and potentially lead to various types of diseases. To date, numerous approaches have emerged for predicting the change in free energy of binding (ΔΔGbind) resulting from mutations, yet the majority of these methods lack precision. In recent years, protein language models (PLMs) have been developed and shown powerful predictive capabilities by leveraging both sequence and structural data from protein-protein complexes. Yet, PLMs have not been optimized specifically for predicting ΔΔGbind. We developed an approach to predict effects of mutations on PPI binding affinity based on two most advanced protein language models ESM2 and ESM-IF1 that incorporate PPI sequence and structural features, respectively. We used the two models to generate embeddings for each PPI mutant and subsequently fine-tuned our model by training on a large dataset of experimental ΔΔGbind values. Our model, ProBASS (Protein Binding Affinity from Structure and Sequence) achieved a correlation with experimental ΔΔGbind values of 0.83 ± 0.05 for single mutations and 0.69 ± 0.04 for double mutations when model training and testing was done on the same PDB. Moreover, ProBASS exhibited very high correlation (0.81 ± 0.02) between prediction and experiment when training and testing was performed on a dataset containing 2325 single mutations in 132 PPIs. ProBASS surpasses the state-of-the-art methods in correlation with experimental data and could be further trained as more experimental data becomes available. Our results demonstrate that the integration of extensive datasets containing ΔΔGbind values across multiple PPIs to refine the pre-trained PLMs represents a successful approach for achieving a precise and broadly applicable model for ΔΔGbind prediction, greatly facilitating future protein engineering and design studies.
Collapse
Affiliation(s)
- Sagara N S Gurusinghe
- Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| | - Yibing Wu
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California San Francisco, CA, USA
| | - William DeGrado
- Department of Pharmaceutical Chemistry, School of Pharmacy, University of California San Francisco, CA, USA
| | - Julia M Shifman
- Department of Biological Chemistry, The Alexander Silberman Institute of Life Sciences, The Hebrew University of Jerusalem, Jerusalem, Israel
| |
Collapse
|
20
|
Lv S, Dong J, Wang C, Wang X, Bao Z. RB-GAT: A Text Classification Model Based on RoBERTa-BiGRU with Graph ATtention Network. SENSORS (BASEL, SWITZERLAND) 2024; 24:3365. [PMID: 38894157 PMCID: PMC11175149 DOI: 10.3390/s24113365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/21/2024] [Revised: 05/17/2024] [Accepted: 05/21/2024] [Indexed: 06/21/2024]
Abstract
With the development of deep learning, several graph neural network (GNN)-based approaches have been utilized for text classification. However, GNNs encounter challenges when capturing contextual text information within a document sequence. To address this, a novel text classification model, RB-GAT, is proposed by combining RoBERTa-BiGRU embedding and a multi-head Graph ATtention Network (GAT). First, the pre-trained RoBERTa model is exploited to learn word and text embeddings in different contexts. Second, the Bidirectional Gated Recurrent Unit (BiGRU) is employed to capture long-term dependencies and bidirectional sentence information from the text context. Next, the multi-head graph attention network is applied to analyze this information, which serves as a node feature for the document. Finally, the classification results are generated through a Softmax layer. Experimental results on five benchmark datasets demonstrate that our method can achieve an accuracy of 71.48%, 98.45%, 80.32%, 90.84%, and 95.67% on Ohsumed, R8, MR, 20NG and R52, respectively, which is superior to the existing nine text classification approaches.
Collapse
Affiliation(s)
- Shaoqing Lv
- School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; (J.D.); (C.W.); (X.W.); (Z.B.)
- Shaanxi Key Laboratory of Information Communication Network and Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
| | - Jungang Dong
- School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; (J.D.); (C.W.); (X.W.); (Z.B.)
| | - Chichi Wang
- School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; (J.D.); (C.W.); (X.W.); (Z.B.)
| | - Xuanhong Wang
- School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; (J.D.); (C.W.); (X.W.); (Z.B.)
- Shaanxi Key Laboratory of Information Communication Network and Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
| | - Zhiqiang Bao
- School of Communication and Information Engineering, Xi’an University of Posts and Telecommunications, Xi’an 710121, China; (J.D.); (C.W.); (X.W.); (Z.B.)
- Shaanxi Key Laboratory of Information Communication Network and Security, Xi’an University of Posts and Telecommunications, Xi’an 710121, China
| |
Collapse
|
21
|
Tang X, Dai H, Knight E, Wu F, Li Y, Li T, Gerstein M. A survey of generative AI for de novo drug design: new frontiers in molecule and protein generation. Brief Bioinform 2024; 25:bbae338. [PMID: 39007594 PMCID: PMC11247410 DOI: 10.1093/bib/bbae338] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 05/21/2024] [Accepted: 06/27/2024] [Indexed: 07/16/2024] Open
Abstract
Artificial intelligence (AI)-driven methods can vastly improve the historically costly drug design process, with various generative models already in widespread use. Generative models for de novo drug design, in particular, focus on the creation of novel biological compounds entirely from scratch, representing a promising future direction. Rapid development in the field, combined with the inherent complexity of the drug design process, creates a difficult landscape for new researchers to enter. In this survey, we organize de novo drug design into two overarching themes: small molecule and protein generation. Within each theme, we identify a variety of subtasks and applications, highlighting important datasets, benchmarks, and model architectures and comparing the performance of top models. We take a broad approach to AI-driven drug design, allowing for both micro-level comparisons of various methods within each subtask and macro-level observations across different fields. We discuss parallel challenges and approaches between the two applications and highlight future directions for AI-driven de novo drug design as a whole. An organized repository of all covered sources is available at https://github.com/gersteinlab/GenAI4Drug.
Collapse
Affiliation(s)
- Xiangru Tang
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Howard Dai
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Elizabeth Knight
- School of Medicine, Yale University, New Haven, CT 06520, United States
| | - Fang Wu
- Computer Science Department, Stanford University, CA 94305, United States
| | - Yunyang Li
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
| | - Tianxiao Li
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
| | - Mark Gerstein
- Department of Computer Science, Yale University, New Haven, CT 06520, United States
- Program in Computational Biology & Bioinformatics, Yale University, New Haven, CT 06520, United States
- Department of Statistics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Biomedical Informatics & Data Science, Yale University, New Haven, CT 06520, United States
- Department of Molecular Biophysics & Biochemistry, Yale University, New Haven, CT 06520, United States
| |
Collapse
|
22
|
Song Y, Wang F, Chen L, Zhang W. Engineering Fatty Acid Biosynthesis in Microalgae: Recent Progress and Perspectives. Mar Drugs 2024; 22:216. [PMID: 38786607 PMCID: PMC11122798 DOI: 10.3390/md22050216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 05/06/2024] [Accepted: 05/07/2024] [Indexed: 05/25/2024] Open
Abstract
Microalgal lipids hold significant potential for the production of biodiesel and dietary supplements. To enhance their cost-effectiveness and commercial competitiveness, it is imperative to improve microalgal lipid productivity. Metabolic engineering that targets the key enzymes of the fatty acid synthesis pathway, along with transcription factor engineering, are effective strategies for improving lipid productivity in microalgae. This review provides a summary of the advancements made in the past 5 years in engineering the fatty acid biosynthetic pathway in eukaryotic microalgae. Furthermore, this review offers insights into transcriptional regulatory mechanisms and transcription factor engineering aimed at enhancing lipid production in eukaryotic microalgae. Finally, the review discusses the challenges and future perspectives associated with utilizing microalgae for the efficient production of lipids.
Collapse
Affiliation(s)
- Yanhui Song
- Laboratory of Synthetic Microbiology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, China; (Y.S.); (L.C.)
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300350, China
| | - Fangzhong Wang
- Laboratory of Synthetic Microbiology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, China; (Y.S.); (L.C.)
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300350, China
- Center for Biosafety Research and Strategy, Tianjin University, Tianjin 300072, China
| | - Lei Chen
- Laboratory of Synthetic Microbiology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, China; (Y.S.); (L.C.)
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300350, China
| | - Weiwen Zhang
- Laboratory of Synthetic Microbiology, School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, China; (Y.S.); (L.C.)
- Key Laboratory of Systems Bioengineering (Ministry of Education), Tianjin University, Tianjin 300350, China
- Center for Biosafety Research and Strategy, Tianjin University, Tianjin 300072, China
| |
Collapse
|
23
|
Song C, Zhang L. Intelligent Design of Antithrombotic Peptide Targeting Collagen. LANGMUIR : THE ACS JOURNAL OF SURFACES AND COLLOIDS 2024; 40:9661-9668. [PMID: 38664943 DOI: 10.1021/acs.langmuir.4c00543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]
Abstract
Binding of blood components to collagen was proved to be a key step in thrombus formation. Intelligent Design of Protein Matcher (IDProMat), a neural network model, was then developed based on the principle of seq2seq to design an antithrombotic peptide targeting collagen. The encoding and decoding of peptide sequence data and the interaction patterns of peptide chains at the interface were studied, and then, IDProMat was applied to the design of peptides to cover collagen. The 99.3% decrease in seq2seq loss and 58.3% decrease in MLP loss demonstrated that IDProMat learned the interaction patterns between residues at the binding interface. An efficient peptide, LRWNSYY, was then designed using this model. Validations on its binding on collagen and its inhibition of platelet adhesion were obtained using docking, MD simulations, and experimental approaches.
Collapse
Affiliation(s)
- Changwei Song
- Department of Biochemical Engineering and Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (MOE), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, People's Republic of China
| | - Lin Zhang
- Department of Biochemical Engineering and Frontiers Science Center for Synthetic Biology and Key Laboratory of Systems Bioengineering (MOE), School of Chemical Engineering and Technology, Tianjin University, Tianjin 300350, People's Republic of China
| |
Collapse
|
24
|
Kim Y, Wang K, Lock RI, Nash TR, Fleischer S, Wang BZ, Fine BM, Vunjak-Novakovic G. BeatProfiler: Multimodal In Vitro Analysis of Cardiac Function Enables Machine Learning Classification of Diseases and Drugs. IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY 2024; 5:238-249. [PMID: 38606403 PMCID: PMC11008807 DOI: 10.1109/ojemb.2024.3377461] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2023] [Revised: 02/13/2024] [Accepted: 03/10/2024] [Indexed: 04/13/2024] Open
Abstract
Goal: Contractile response and calcium handling are central to understanding cardiac function and physiology, yet existing methods of analysis to quantify these metrics are often time-consuming, prone to mistakes, or require specialized equipment/license. We developed BeatProfiler, a suite of cardiac analysis tools designed to quantify contractile function, calcium handling, and force generation for multiple in vitro cardiac models and apply downstream machine learning methods for deep phenotyping and classification. Methods: We first validate BeatProfiler's accuracy, robustness, and speed by benchmarking against existing tools with a fixed dataset. We further confirm its ability to robustly characterize disease and dose-dependent drug response. We then demonstrate that the data acquired by our automatic acquisition pipeline can be further harnessed for machine learning (ML) analysis to phenotype a disease model of restrictive cardiomyopathy and profile cardioactive drug functional response. To accurately classify between these biological signals, we apply feature-based ML and deep learning models (temporal convolutional-bidirectional long short-term memory model or TCN-BiLSTM). Results: Benchmarking against existing tools revealed that BeatProfiler detected and analyzed contraction and calcium signals better than existing tools through improved sensitivity in low signal data, reduction in false positives, and analysis speed increase by 7 to 50-fold. Of signals accurately detected by published methods (PMs), BeatProfiler's extracted features showed high correlations to PMs, confirming that it is reliable and consistent with PMs. The features extracted by BeatProfiler classified restrictive cardiomyopathy cardiomyocytes from isogenic healthy controls with 98% accuracy and identified relax90 as a top distinguishing feature in congruence with previous findings. We also show that our TCN-BiLSTM model was able to classify drug-free control and 4 cardiac drugs with different mechanisms of action at 96% accuracy. We further apply Grad-CAM on our convolution-based models to identify signature regions of perturbations by these drugs in calcium signals. Conclusions: We anticipate that the capabilities of BeatProfiler will help advance in vitro studies in cardiac biology through rapid phenotyping, revealing mechanisms underlying cardiac health and disease, and enabling objective classification of cardiac disease and responses to drugs.
Collapse
Affiliation(s)
- Youngbin Kim
- Department of Biomedical EngineeringColumbia UniversityNew YorkNY10032USA
| | - Kunlun Wang
- Department of Biomedical EngineeringColumbia UniversityNew YorkNY10032USA
| | - Roberta I. Lock
- Department of Biomedical EngineeringColumbia UniversityNew YorkNY10032USA
| | - Trevor R. Nash
- Department of Biomedical EngineeringColumbia UniversityNew YorkNY10032USA
| | - Sharon Fleischer
- Department of Biomedical EngineeringColumbia UniversityNew YorkNY10032USA
| | - Bryan Z. Wang
- Department of Biomedical EngineeringColumbia UniversityNew YorkNY10032USA
| | - Barry M. Fine
- Department of MedicineDivision of CardiologyColumbia University Medical CenterNew YorkNY10032USA
| | - Gordana Vunjak-Novakovic
- Department of Biomedical EngineeringColumbia UniversityNew YorkNY10032USA
- Department of MedicineDivision of CardiologyColumbia University Medical CenterNew YorkNY10032USA
| |
Collapse
|
25
|
Mu J, Li Z, Zhang B, Zhang Q, Iqbal J, Wadood A, Wei T, Feng Y, Chen HF. Graphormer supervised de novo protein design method and function validation. Brief Bioinform 2024; 25:bbae135. [PMID: 38557677 PMCID: PMC10982952 DOI: 10.1093/bib/bbae135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2023] [Revised: 01/31/2024] [Accepted: 03/12/2024] [Indexed: 04/04/2024] Open
Abstract
Protein design is central to nearly all protein engineering problems, as it can enable the creation of proteins with new biological functions, such as improving the catalytic efficiency of enzymes. One key facet of protein design, fixed-backbone protein sequence design, seeks to design new sequences that will conform to a prescribed protein backbone structure. Nonetheless, existing sequence design methods present limitations, such as low sequence diversity and shortcomings in experimental validation of the designed functional proteins. These inadequacies obstruct the goal of functional protein design. To improve these limitations, we initially developed the Graphormer-based Protein Design (GPD) model. This model utilizes the Transformer on a graph-based representation of three-dimensional protein structures and incorporates Gaussian noise and a sequence random masks to node features, thereby enhancing sequence recovery and diversity. The performance of the GPD model was significantly better than that of the state-of-the-art ProteinMPNN model on multiple independent tests, especially for sequence diversity. We employed GPD to design CalB hydrolase and generated nine artificially designed CalB proteins. The results show a 1.7-fold increase in catalytic activity compared to that of the wild-type CalB and strong substrate selectivity on p-nitrophenyl acetate with different carbon chain lengths (C2-C16). Thus, the GPD method could be used for the de novo design of industrial enzymes and protein drugs. The code was released at https://github.com/decodermu/GPD.
Collapse
Affiliation(s)
- Junxi Mu
- State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
- Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, No.5 Yiheyuan Road, Beijing, 100871, China
| | - Zhengxin Li
- State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Bo Zhang
- State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Qi Zhang
- State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Jamshed Iqbal
- Centre for Advanced Drug Research, COMSATS University Islamabad, Abbottabad Campus, Abbottabad, 22060, Pakistan
| | - Abdul Wadood
- Department of Biochemistry, Abdul Wali Khan University Mardan, Mardan, 23200, Pakistan
| | - Ting Wei
- State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Yan Feng
- State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Hai-Feng Chen
- State Key Laboratory of Microbial metabolism, Joint International Research Laboratory of Metabolic Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| |
Collapse
|
26
|
Jänes J, Beltrao P. Deep learning for protein structure prediction and design-progress and applications. Mol Syst Biol 2024; 20:162-169. [PMID: 38291232 PMCID: PMC10912668 DOI: 10.1038/s44320-024-00016-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2023] [Revised: 12/21/2023] [Accepted: 01/11/2024] [Indexed: 02/01/2024] Open
Abstract
Proteins are the key molecular machines that orchestrate all biological processes of the cell. Most proteins fold into three-dimensional shapes that are critical for their function. Studying the 3D shape of proteins can inform us of the mechanisms that underlie biological processes in living cells and can have practical applications in the study of disease mutations or the discovery of novel drug treatments. Here, we review the progress made in sequence-based prediction of protein structures with a focus on applications that go beyond the prediction of single monomer structures. This includes the application of deep learning methods for the prediction of structures of protein complexes, different conformations, the evolution of protein structures and the application of these methods to protein design. These developments create new opportunities for research that will have impact across many areas of biomedical research.
Collapse
Affiliation(s)
- Jürgen Jänes
- Institute of Molecular Systems Biology, ETH Zürich, 8093, Zürich, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Pedro Beltrao
- Institute of Molecular Systems Biology, ETH Zürich, 8093, Zürich, Switzerland.
- Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
27
|
Chu AE, Lu T, Huang PS. Sparks of function by de novo protein design. Nat Biotechnol 2024; 42:203-215. [PMID: 38361073 PMCID: PMC11366440 DOI: 10.1038/s41587-024-02133-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2023] [Accepted: 01/09/2024] [Indexed: 02/17/2024]
Abstract
Information in proteins flows from sequence to structure to function, with each step causally driven by the preceding one. Protein design is founded on inverting this process: specify a desired function, design a structure executing this function, and find a sequence that folds into this structure. This 'central dogma' underlies nearly all de novo protein-design efforts. Our ability to accomplish these tasks depends on our understanding of protein folding and function and our ability to capture this understanding in computational methods. In recent years, deep learning-derived approaches for efficient and accurate structure modeling and enrichment of successful designs have enabled progression beyond the design of protein structures and towards the design of functional proteins. We examine these advances in the broader context of classical de novo protein design and consider implications for future challenges to come, including fundamental capabilities such as sequence and structure co-design and conformational control considering flexibility, and functional objectives such as antibody and enzyme design.
Collapse
Affiliation(s)
- Alexander E Chu
- Biophysics Program, Stanford University, Palo Alto, CA, USA
- Department of Bioengineering, Stanford University, Palo Alto, CA, USA
- Google DeepMind, London, UK
| | - Tianyu Lu
- Department of Bioengineering, Stanford University, Palo Alto, CA, USA
| | - Po-Ssu Huang
- Biophysics Program, Stanford University, Palo Alto, CA, USA.
- Department of Bioengineering, Stanford University, Palo Alto, CA, USA.
| |
Collapse
|
28
|
Yu J, Mu J, Wei T, Chen HF. Multi-indicator comparative evaluation for deep learning-based protein sequence design methods. Bioinformatics 2024; 40:btae037. [PMID: 38261649 PMCID: PMC10868333 DOI: 10.1093/bioinformatics/btae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2023] [Revised: 12/20/2023] [Accepted: 01/18/2024] [Indexed: 01/25/2024] Open
Abstract
MOTIVATION Proteins found in nature represent only a fraction of the vast space of possible proteins. Protein design presents an opportunity to explore and expand this protein landscape. Within protein design, protein sequence design plays a crucial role, and numerous successful methods have been developed. Notably, deep learning-based protein sequence design methods have experienced significant advancements in recent years. However, a comprehensive and systematic comparison and evaluation of these methods have been lacking, with indicators provided by different methods often inconsistent or lacking effectiveness. RESULTS To address this gap, we have designed a diverse set of indicators that cover several important aspects, including sequence recovery, diversity, root-mean-square deviation of protein structure, secondary structure, and the distribution of polar and nonpolar amino acids. In our evaluation, we have employed an improved weighted inferiority-superiority distance method to comprehensively assess the performance of eight widely used deep learning-based protein sequence design methods. Our evaluation not only provides rankings of these methods but also offers optimization suggestions by analyzing the strengths and weaknesses of each method. Furthermore, we have developed a method to select the best temperature parameter and proposed solutions for the common issue of designing sequences with consecutive repetitive amino acids, which is often encountered in protein design methods. These findings can greatly assist users in selecting suitable protein sequence design methods. Overall, our work contributes to the field of protein sequence design by providing a comprehensive evaluation system and optimization suggestions for different methods.
Collapse
Affiliation(s)
- Jinyu Yu
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Junxi Mu
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ting Wei
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Hai-Feng Chen
- State Key Laboratory of Microbial Metabolism, Joint International Research Laboratory of Metabolic & Developmental Sciences, Department of Bioinformatics and Biostatistics, National Experimental Teaching Center for Life Sciences and Biotechnology, School of Life Sciences and Biotechnology, Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
29
|
Jones RD. Information Transmission in G Protein-Coupled Receptors. Int J Mol Sci 2024; 25:1621. [PMID: 38338905 PMCID: PMC10855935 DOI: 10.3390/ijms25031621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2023] [Revised: 01/19/2024] [Accepted: 01/22/2024] [Indexed: 02/12/2024] Open
Abstract
G protein-coupled receptors (GPCRs) are the largest class of receptors in the human genome and constitute about 30% of all drug targets. In this article, intended for a non-mathematical audience, both experimental observations and new theoretical results are compared in the context of information transmission across the cell membrane. The amount of information actually currently used or projected to be used in clinical settings is a small fraction of the information transmission capacity of the GPCR. This indicates that the number of yet undiscovered drug targets within GPCRs is much larger than what is currently known. Theoretical studies with some experimental validation indicate that localized heat deposition and dissipation are key to the identification of sites and mechanisms for drug action.
Collapse
Affiliation(s)
- Roger D Jones
- European Centre for Living Technology, University of Venice, 30123 Venice, Italy
| |
Collapse
|
30
|
Bravi B. Development and use of machine learning algorithms in vaccine target selection. NPJ Vaccines 2024; 9:15. [PMID: 38242890 PMCID: PMC10798987 DOI: 10.1038/s41541-023-00795-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 12/07/2023] [Indexed: 01/21/2024] Open
Abstract
Computer-aided discovery of vaccine targets has become a cornerstone of rational vaccine design. In this article, I discuss how Machine Learning (ML) can inform and guide key computational steps in rational vaccine design concerned with the identification of B and T cell epitopes and correlates of protection. I provide examples of ML models, as well as types of data and predictions for which they are built. I argue that interpretable ML has the potential to improve the identification of immunogens also as a tool for scientific discovery, by helping elucidate the molecular processes underlying vaccine-induced immune responses. I outline the limitations and challenges in terms of data availability and method development that need to be addressed to bridge the gap between advances in ML predictions and their translational application to vaccine design.
Collapse
Affiliation(s)
- Barbara Bravi
- Department of Mathematics, Imperial College London, London, SW7 2AZ, UK.
| |
Collapse
|
31
|
Krokidis MG, Dimitrakopoulos GN, Vrahatis AG, Exarchos TP, Vlamos P. Challenges and limitations in computational prediction of protein misfolding in neurodegenerative diseases. Front Comput Neurosci 2024; 17:1323182. [PMID: 38250244 PMCID: PMC10796696 DOI: 10.3389/fncom.2023.1323182] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Accepted: 12/19/2023] [Indexed: 01/23/2024] Open
Affiliation(s)
| | | | | | | | - Panagiotis Vlamos
- Bioinformatics and Human Electrophysiology Laboratory, Department of Informatics, Ionian University, Corfu, Greece
| |
Collapse
|
32
|
Xu B, Chen Y, Xue W. Computational Protein Design - Where it goes? Curr Med Chem 2024; 31:2841-2854. [PMID: 37272467 DOI: 10.2174/0929867330666230602143700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2022] [Revised: 02/18/2023] [Accepted: 03/15/2023] [Indexed: 06/06/2023]
Abstract
Proteins have been playing a critical role in the regulation of diverse biological processes related to human life. With the increasing demand, functional proteins are sparse in this immense sequence space. Therefore, protein design has become an important task in various fields, including medicine, food, energy, materials, etc. Directed evolution has recently led to significant achievements. Molecular modification of proteins through directed evolution technology has significantly advanced the fields of enzyme engineering, metabolic engineering, medicine, and beyond. However, it is impossible to identify desirable sequences from a large number of synthetic sequences alone. As a result, computational methods, including data-driven machine learning and physics-based molecular modeling, have been introduced to protein engineering to produce more functional proteins. This review focuses on recent advances in computational protein design, highlighting the applicability of different approaches as well as their limitations.
Collapse
Affiliation(s)
- Binbin Xu
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Yingjun Chen
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| | - Weiwei Xue
- Chongqing Key Laboratory of Natural Product Synthesis and Drug Research, School of Pharmaceutical Sciences, Chongqing University, Chongqing 401331, China
| |
Collapse
|
33
|
Wang J, Chen C, Yao G, Ding J, Wang L, Jiang H. Intelligent Protein Design and Molecular Characterization Techniques: A Comprehensive Review. Molecules 2023; 28:7865. [PMID: 38067593 PMCID: PMC10707872 DOI: 10.3390/molecules28237865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2023] [Revised: 11/13/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
In recent years, the widespread application of artificial intelligence algorithms in protein structure, function prediction, and de novo protein design has significantly accelerated the process of intelligent protein design and led to many noteworthy achievements. This advancement in protein intelligent design holds great potential to accelerate the development of new drugs, enhance the efficiency of biocatalysts, and even create entirely new biomaterials. Protein characterization is the key to the performance of intelligent protein design. However, there is no consensus on the most suitable characterization method for intelligent protein design tasks. This review describes the methods, characteristics, and representative applications of traditional descriptors, sequence-based and structure-based protein characterization. It discusses their advantages, disadvantages, and scope of application. It is hoped that this could help researchers to better understand the limitations and application scenarios of these methods, and provide valuable references for choosing appropriate protein characterization techniques for related research in the field, so as to better carry out protein research.
Collapse
Affiliation(s)
| | | | | | - Junjie Ding
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Liangliang Wang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| | - Hui Jiang
- State Key Laboratory of NBC Protection for Civilian, Beijing 102205, China; (J.W.); (C.C.); (G.Y.)
| |
Collapse
|
34
|
Lategan FA, Schreiber C, Patterton HG. SeqPredNN: a neural network that generates protein sequences that fold into specified tertiary structures. BMC Bioinformatics 2023; 24:373. [PMID: 37789284 PMCID: PMC10546711 DOI: 10.1186/s12859-023-05498-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 09/25/2023] [Indexed: 10/05/2023] Open
Abstract
BACKGROUND The relationship between the sequence of a protein, its structure, and the resulting connection between its structure and function, is a foundational principle in biological science. Only recently has the computational prediction of protein structure based only on protein sequence been addressed effectively by AlphaFold, a neural network approach that can predict the majority of protein structures with X-ray crystallographic accuracy. A question that is now of acute relevance is the "inverse protein folding problem": predicting the sequence of a protein that folds into a specified structure. This will be of immense value in protein engineering and biotechnology, and will allow the design and expression of recombinant proteins that can, for instance, fold into specified structures as a scaffold for the attachment of recombinant antigens, or enzymes with modified or novel catalytic activities. Here we describe the development of SeqPredNN, a feed-forward neural network trained with X-ray crystallographic structures from the RCSB Protein Data Bank to predict the identity of amino acids in a protein structure using only the relative positions, orientations, and backbone dihedral angles of nearby residues. RESULTS We predict the sequence of a protein expected to fold into a specified structure and assess the accuracy of the prediction using both AlphaFold and RoseTTAFold to computationally generate the fold of the derived sequence. We show that the sequences predicted by SeqPredNN fold into a structure with a median TM-score of 0.638 when compared to the crystal structure according to AlphaFold predictions, yet these sequences are unique and only 28.4% identical to the sequence of the crystallized protein. CONCLUSIONS We propose that SeqPredNN will be a valuable tool to generate proteins of defined structure for the design of novel biomaterials, pharmaceuticals, catalysts, and reporter systems. The low sequence identity of its predictions compared to the native sequence could prove useful for developing proteins with modified physical properties, such as water solubility and thermal stability. The speed and ease of use of SeqPredNN offers a significant advantage over physics-based protein design methods.
Collapse
Affiliation(s)
- F Adriaan Lategan
- Center for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, 7600, South Africa
| | - Caroline Schreiber
- Center for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, 7600, South Africa
| | - Hugh G Patterton
- Center for Bioinformatics and Computational Biology, Stellenbosch University, Stellenbosch, 7600, South Africa.
| |
Collapse
|
35
|
Wu F, Wu L, Radev D, Xu J, Li SZ. Integration of pre-trained protein language models into geometric deep learning networks. Commun Biol 2023; 6:876. [PMID: 37626165 PMCID: PMC10457366 DOI: 10.1038/s42003-023-05133-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2023] [Accepted: 07/11/2023] [Indexed: 08/27/2023] Open
Abstract
Geometric deep learning has recently achieved great success in non-Euclidean domains, and learning on 3D structures of large biomolecules is emerging as a distinct research area. However, its efficacy is largely constrained due to the limited quantity of structural data. Meanwhile, protein language models trained on substantial 1D sequences have shown burgeoning capabilities with scale in a broad range of applications. Several preceding studies consider combining these different protein modalities to promote the representation power of geometric neural networks but fail to present a comprehensive understanding of their benefits. In this work, we integrate the knowledge learned by well-trained protein language models into several state-of-the-art geometric networks and evaluate a variety of protein representation learning benchmarks, including protein-protein interface prediction, model quality assessment, protein-protein rigid-body docking, and binding affinity prediction. Our findings show an overall improvement of 20% over baselines. Strong evidence indicates that the incorporation of protein language models' knowledge enhances geometric networks' capacity by a significant margin and can be generalized to complex tasks.
Collapse
Affiliation(s)
- Fang Wu
- AI Research and Innovation Laboratory, Westlake University, 310030, Hangzhou, China
| | - Lirong Wu
- AI Research and Innovation Laboratory, Westlake University, 310030, Hangzhou, China
| | - Dragomir Radev
- Department of Computer Science, Yale University, New Haven, CT, 06511, USA
| | - Jinbo Xu
- Institute of AI Industry Research, Tsinghua University, Haidian Street, 100084, Beijing, China
- Toyota Technological Institute at Chicago, Chicago, IL, 60637, USA
| | - Stan Z Li
- AI Research and Innovation Laboratory, Westlake University, 310030, Hangzhou, China.
| |
Collapse
|
36
|
Ichikawa DM, Abdin O, Alerasool N, Kogenaru M, Mueller AL, Wen H, Giganti DO, Goldberg GW, Adams S, Spencer JM, Razavi R, Nim S, Zheng H, Gionco C, Clark FT, Strokach A, Hughes TR, Lionnet T, Taipale M, Kim PM, Noyes MB. A universal deep-learning model for zinc finger design enables transcription factor reprogramming. Nat Biotechnol 2023; 41:1117-1129. [PMID: 36702896 PMCID: PMC10421740 DOI: 10.1038/s41587-022-01624-4] [Citation(s) in RCA: 36] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2021] [Accepted: 11/17/2022] [Indexed: 01/27/2023]
Abstract
Cys2His2 zinc finger (ZF) domains engineered to bind specific target sequences in the genome provide an effective strategy for programmable regulation of gene expression, with many potential therapeutic applications. However, the structurally intricate engagement of ZF domains with DNA has made their design challenging. Here we describe the screening of 49 billion protein-DNA interactions and the development of a deep-learning model, ZFDesign, that solves ZF design for any genomic target. ZFDesign is a modern machine learning method that models global and target-specific differences induced by a range of library environments and specifically takes into account compatibility of neighboring fingers using a novel hierarchical transformer architecture. We demonstrate the versatility of designed ZFs as nucleases as well as activators and repressors by seamless reprogramming of human transcription factors. These factors could be used to upregulate an allele of haploinsufficiency, downregulate a gain-of-function mutation or test the consequence of regulation of a single gene as opposed to the many genes that a transcription factor would normally influence.
Collapse
Affiliation(s)
- David M Ichikawa
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
- Department of Biochemistry and Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY, USA
| | - Osama Abdin
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| | - Nader Alerasool
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Manjunatha Kogenaru
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - April L Mueller
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Han Wen
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - David O Giganti
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Gregory W Goldberg
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Samantha Adams
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Jeffrey M Spencer
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Rozita Razavi
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Satra Nim
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Hong Zheng
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Courtney Gionco
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Finnegan T Clark
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Alexey Strokach
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Timothy R Hughes
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Timothee Lionnet
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA
| | - Mikko Taipale
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Philip M Kim
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada.
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
| | - Marcus B Noyes
- Institute for Systems Genetics, NYU Grossman School of Medicine, New York, NY, USA.
- Department of Biochemistry and Molecular Pharmacology, NYU Grossman School of Medicine, New York, NY, USA.
| |
Collapse
|
37
|
Jin W, Brannan KW, Kapeli K, Park SS, Tan HQ, Gosztyla ML, Mujumdar M, Ahdout J, Henroid B, Rothamel K, Xiang JS, Wong L, Yeo GW. HydRA: Deep-learning models for predicting RNA-binding capacity from protein interaction association context and protein sequence. Mol Cell 2023; 83:2595-2611.e11. [PMID: 37421941 PMCID: PMC11098078 DOI: 10.1016/j.molcel.2023.06.019] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2023] [Revised: 03/20/2023] [Accepted: 06/13/2023] [Indexed: 07/10/2023]
Abstract
RNA-binding proteins (RBPs) control RNA metabolism to orchestrate gene expression and, when dysfunctional, underlie human diseases. Proteome-wide discovery efforts predict thousands of RBP candidates, many of which lack canonical RNA-binding domains (RBDs). Here, we present a hybrid ensemble RBP classifier (HydRA), which leverages information from both intermolecular protein interactions and internal protein sequence patterns to predict RNA-binding capacity with unparalleled specificity and sensitivity using support vector machines (SVMs), convolutional neural networks (CNNs), and Transformer-based protein language models. Occlusion mapping by HydRA robustly detects known RBDs and predicts hundreds of uncharacterized RNA-binding associated domains. Enhanced CLIP (eCLIP) for HydRA-predicted RBP candidates reveals transcriptome-wide RNA targets and confirms RNA-binding activity for HydRA-predicted RNA-binding associated domains. HydRA accelerates construction of a comprehensive RBP catalog and expands the diversity of RNA-binding associated domains.
Collapse
Affiliation(s)
- Wenhao Jin
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Kristopher W Brannan
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Katannya Kapeli
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Samuel S Park
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Hui Qing Tan
- Department of Physiology, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore
| | - Maya L Gosztyla
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Mayuresh Mujumdar
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Joshua Ahdout
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Bryce Henroid
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Katherine Rothamel
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Joy S Xiang
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA
| | - Limsoon Wong
- Department of Computer Science, National University of Singapore, Singapore, Singapore
| | - Gene W Yeo
- Department of Cellular and Molecular Medicine, University of Califorinia, San Diego, La Jolla, CA, USA; Institute for Genomic Medicine and UCSD Stem Cell Program, University of California, San Diego, La Jolla, CA, USA; Stem Cell Program, University of California, San Diego, La Jolla, CA, USA.
| |
Collapse
|
38
|
Yan J, Li S, Zhang Y, Hao A, Zhao Q. ZetaDesign: an end-to-end deep learning method for protein sequence design and side-chain packing. Brief Bioinform 2023; 24:bbad257. [PMID: 37429578 DOI: 10.1093/bib/bbad257] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2023] [Revised: 06/05/2023] [Accepted: 06/21/2023] [Indexed: 07/12/2023] Open
Abstract
Computational protein design has been demonstrated to be the most powerful tool in the last few years among protein designing and repacking tasks. In practice, these two tasks are strongly related but often treated separately. Besides, state-of-the-art deep-learning-based methods cannot provide interpretability from an energy perspective, affecting the accuracy of the design. Here we propose a new systematic approach, including both a posterior probability and a joint probability parts, to solve the two essential questions once for all. This approach takes the physicochemical property of amino acids into consideration and uses the joint probability model to ensure the convergence between structure and amino acid type. Our results demonstrated that this method could generate feasible, high-confidence sequences with low-energy side conformations. The designed sequences can fold into target structures with high confidence and maintain relatively stable biochemical properties. The side chain conformation has a significantly lower energy landscape without delegating to a rotamer library or performing the expensive conformational searches. Overall, we propose an end-to-end method that combines the advantages of both deep learning and energy-based methods. The design results of this model demonstrate high efficiency, and precision, as well as a low energy state and good interpretability.
Collapse
Affiliation(s)
- Junyu Yan
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| | - Shuai Li
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| | - Ying Zhang
- The Key Laboratory of Cell Proliferation and Regulation Biology, Ministry of Education, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Aimin Hao
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| | - Qinping Zhao
- State Key Laboratory of Virtual Reality Technology and Systems, Beihang University, Beijing, China
| |
Collapse
|
39
|
McFee M, Kim PM. GDockScore: a graph-based protein-protein docking scoring function. BIOINFORMATICS ADVANCES 2023; 3:vbad072. [PMID: 37359726 PMCID: PMC10290236 DOI: 10.1093/bioadv/vbad072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/14/2023] [Revised: 05/30/2023] [Accepted: 06/10/2023] [Indexed: 06/28/2023]
Abstract
Summary Protein complexes play vital roles in a variety of biological processes, such as mediating biochemical reactions, the immune response and cell signalling, with 3D structure specifying function. Computational docking methods provide a means to determine the interface between two complexed polypeptide chains without using time-consuming experimental techniques. The docking process requires the optimal solution to be selected with a scoring function. Here, we propose a novel graph-based deep learning model that utilizes mathematical graph representations of proteins to learn a scoring function (GDockScore). GDockScore was pre-trained on docking outputs generated with the Protein Data Bank biounits and the RosettaDock protocol, and then fine-tuned on HADDOCK decoys generated on the ZDOCK Protein Docking Benchmark. GDockScore performs similarly to the Rosetta scoring function on docking decoys generated using the RosettaDock protocol. Furthermore, state-of-the-art is achieved on the CAPRI score set, a challenging dataset for developing docking scoring functions. Availability and implementation The model implementation is available at https://gitlab.com/mcfeemat/gdockscore. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Matthew McFee
- Department of Molecular Genetics, The University of Toronto, Toronto, ON M5S 1A8, Canada
- Donnelly Centre for Cellular and Biomolecular Research, The University of Toronto, Toronto, ON M5S 3E1, Canada
| | | |
Collapse
|
40
|
Zhang H, Li X, Li Z, Huang D, Zhang L. Estimation of Particle Location in Granular Materials Based on Graph Neural Networks. MICROMACHINES 2023; 14:714. [PMID: 37420946 DOI: 10.3390/mi14040714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/21/2023] [Revised: 03/20/2023] [Accepted: 03/21/2023] [Indexed: 07/09/2023]
Abstract
Particle locations determine the whole structure of a granular system, which is crucial to understanding various anomalous behaviors in glasses and amorphous solids. How to accurately determine the coordinates of each particle in such materials within a short time has always been a challenge. In this paper, we use an improved graph convolutional neural network to estimate the particle locations in two-dimensional photoelastic granular materials purely from the knowledge of the distances for each particle, which can be estimated in advance via a distance estimation algorithm. The robustness and effectiveness of our model are verified by testing other granular systems with different disorder degrees, as well as systems with different configurations. In this study, we attempt to provide a new route to the structural information of granular systems irrelevant to dimensionality, compositions, or other material properties.
Collapse
Affiliation(s)
- Hang Zhang
- School of Automation, Central South University, Changsha 410083, China
| | - Xingqiao Li
- School of Automation, Central South University, Changsha 410083, China
| | - Zirui Li
- School of Automation, Central South University, Changsha 410083, China
| | - Duan Huang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ling Zhang
- School of Automation, Central South University, Changsha 410083, China
| |
Collapse
|
41
|
Omar SI, Keasar C, Ben-Sasson AJ, Haber E. Protein Design Using Physics Informed Neural Networks. Biomolecules 2023; 13:biom13030457. [PMID: 36979392 PMCID: PMC10046838 DOI: 10.3390/biom13030457] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2023] [Revised: 02/16/2023] [Accepted: 02/27/2023] [Indexed: 03/06/2023] Open
Abstract
The inverse protein folding problem, also known as protein sequence design, seeks to predict an amino acid sequence that folds into a specific structure and performs a specific function. Recent advancements in machine learning techniques have been successful in generating functional sequences, outperforming previous energy function-based methods. However, these machine learning methods are limited in their interoperability and robustness, especially when designing proteins that must function under non-ambient conditions, such as high temperature, extreme pH, or in various ionic solvents. To address this issue, we propose a new Physics-Informed Neural Networks (PINNs)-based protein sequence design approach. Our approach combines all-atom molecular dynamics simulations, a PINNs MD surrogate model, and a relaxation of binary programming to solve the protein design task while optimizing both energy and the structural stability of proteins. We demonstrate the effectiveness of our design framework in designing proteins that can function under non-ambient conditions.
Collapse
Affiliation(s)
| | - Chen Keasar
- Department of Computer Science, Ben Gurion University of the Negev, Be’er Sheva 84105, Israel
| | - Ariel J. Ben-Sasson
- Independent Researcher, Haifa 3436301, Israel
- Correspondence: (A.J.B.-S.); (E.H.)
| | - Eldad Haber
- Department of Earth Ocean and Atmospheric Sciences, University of British Columbia, Vancouver, BC V6T 1Z4, Canada
- Correspondence: (A.J.B.-S.); (E.H.)
| |
Collapse
|
42
|
Yuan Y, Xin K, Liu J, Zhao P, Lu MP, Yan Y, Hu Y, Huo H, Li Z, Fang T. A GNN-based model for capturing spatio-temporal changes in locomotion behaviors of aging C. elegans. Comput Biol Med 2023; 155:106694. [PMID: 36812812 DOI: 10.1016/j.compbiomed.2023.106694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Revised: 01/27/2023] [Accepted: 02/14/2023] [Indexed: 02/17/2023]
Abstract
Investigating the locomotion of aging C. elegans is an important way for understanding the basic mechanisms behind age-related changes in organisms. However, the locomotion of aging C. elegans is often quantified using insufficient physical variables, which makes it challenging to capture essential dynamics. To study changes in the locomotion pattern of aging C. elegans, we developed a novel data-driven model based on graph neural networks, in which the C. elegans body is modeled as a long chain with interactions within and between adjacent segments, and their interactions are described by high-dimensional variables. Using this model, we discovered that each segment of the C. elegans body generally tends to maintain its locomotion, i.e., tries to keep the bending angle unchanged, and expects to change the locomotion of the adjacent segments. The ability to maintain its locomotion strengthens with age. Besides, a subtle distinguish in the changes in the locomotion pattern of C. elegans at various aging stages were observed. Our model is anticipated to provide a data-driven method for quantifying the changes in the locomotion pattern of aging C. elegans and for mining the underlying causes of these changes.
Collapse
Affiliation(s)
- Ye Yuan
- Institute of Machine Intelligence, University of Shanghai for Science and Technology, Shanghai, 200093, China; Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, China; Key Laboratory of System Control and Information Processing, Ministry of Education, China
| | - Kuankuan Xin
- Queensland Brain Institute, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Jian Liu
- Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, China; Key Laboratory of System Control and Information Processing, Ministry of Education, China
| | - Peng Zhao
- Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, China; Key Laboratory of System Control and Information Processing, Ministry of Education, China
| | - Man Pok Lu
- Queensland Brain Institute, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Yuner Yan
- Queensland Brain Institute, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Yuchen Hu
- Queensland Brain Institute, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Hong Huo
- Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, China; Key Laboratory of System Control and Information Processing, Ministry of Education, China.
| | - Zhaoyu Li
- Queensland Brain Institute, The University of Queensland, Brisbane, QLD, 4072, Australia.
| | - Tao Fang
- Department of Automation, Shanghai Jiao Tong University, Shanghai, 200240, China; Key Laboratory of System Control and Information Processing, Ministry of Education, China.
| |
Collapse
|
43
|
Li AJ, Lu M, Desta I, Sundar V, Grigoryan G, Keating AE. Neural network-derived Potts models for structure-based protein design using backbone atomic coordinates and tertiary motifs. Protein Sci 2023; 32:e4554. [PMID: 36564857 PMCID: PMC9854172 DOI: 10.1002/pro.4554] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 11/15/2022] [Accepted: 12/20/2022] [Indexed: 12/25/2022]
Abstract
Designing novel proteins to perform desired functions, such as binding or catalysis, is a major goal in synthetic biology. A variety of computational approaches can aid in this task. An energy-based framework rooted in the sequence-structure statistics of tertiary motifs (TERMs) can be used for sequence design on predefined backbones. Neural network models that use backbone coordinate-derived features provide another way to design new proteins. In this work, we combine the two methods to make neural structure-based models more suitable for protein design. Specifically, we supplement backbone-coordinate features with TERM-derived data, as inputs, and we generate energy functions as outputs. We present two architectures that generate Potts models over the sequence space: TERMinator, which uses both TERM-based and coordinate-based information, and COORDinator, which uses only coordinate-based information. Using these two models, we demonstrate that TERMs can be utilized to improve native sequence recovery performance of neural models. Furthermore, we demonstrate that sequences designed by TERMinator are predicted to fold to their target structures by AlphaFold. Finally, we show that both TERMinator and COORDinator learn notions of energetics, and these methods can be fine-tuned on experimental data to improve predictions. Our results suggest that using TERM-based and coordinate-based features together may be beneficial for protein design and that structure-based neural models that produce Potts energy tables have utility for flexible applications in protein science.
Collapse
Affiliation(s)
- Alex J. Li
- Department of ChemistryMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Mindren Lu
- Department of Electrical Engineering and Computer ScienceMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
- Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Israel Desta
- Department of BiologyMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Vikram Sundar
- Computational and Systems Biology ProgramMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| | - Gevorg Grigoryan
- Department of Computer ScienceDartmouth CollegeHanoverNew HampshireUSA
| | - Amy E. Keating
- Department of Biological EngineeringMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
- Department of BiologyMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
- Koch Institute for Integrative Cancer ResearchMassachusetts Institute of TechnologyCambridgeMassachusettsUSA
| |
Collapse
|
44
|
Castorina LV, Petrenas R, Subr K, Wood CW. PDBench: evaluating computational methods for protein-sequence design. Bioinformatics 2023; 39:btad027. [PMID: 36637198 PMCID: PMC9869650 DOI: 10.1093/bioinformatics/btad027] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Revised: 11/14/2022] [Accepted: 01/12/2023] [Indexed: 01/14/2023] Open
Abstract
SUMMARY Ever increasing amounts of protein structure data, combined with advances in machine learning, have led to the rapid proliferation of methods available for protein-sequence design. In order to utilize a design method effectively, it is important to understand the nuances of its performance and how it varies by design target. Here, we present PDBench, a set of proteins and a number of standard tests for assessing the performance of sequence-design methods. PDBench aims to maximize the structural diversity of the benchmark, compared with previous benchmarking sets, in order to provide useful biological insight into the behaviour of sequence-design methods, which is essential for evaluating their performance and practical utility. We believe that these tools are useful for guiding the development of novel sequence design algorithms and will enable users to choose a method that best suits their design target. AVAILABILITY AND IMPLEMENTATION https://github.com/wells-wood-research/PDBench. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Leonardo V Castorina
- School of Informatics, University of Edinburgh, 10 Crichton Street, Newington, Edinburgh EH8 9AB, UK
| | - Rokas Petrenas
- School of Biological Sciences, University of Edinburgh, Roger Land Building, Edinburgh EH9 3FF, UK
| | - Kartic Subr
- School of Informatics, University of Edinburgh, 10 Crichton Street, Newington, Edinburgh EH8 9AB, UK
| | - Christopher W Wood
- School of Biological Sciences, University of Edinburgh, Roger Land Building, Edinburgh EH9 3FF, UK
| |
Collapse
|
45
|
Nallasamy V, Seshiah M. Energy Profile Bayes and Thompson Optimized Convolutional Neural Network protein structure prediction. Neural Comput Appl 2023; 35:1983-2006. [PMID: 36245797 PMCID: PMC9542649 DOI: 10.1007/s00521-022-07868-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2021] [Accepted: 09/21/2022] [Indexed: 01/12/2023]
Abstract
In living organisms, proteins are considered as the executants of biological functions. Owing to its pivotal role played in protein folding patterns, comprehension of protein structure is a challenging issue. Moreover, owing to numerous protein sequence exploration in protein data banks and complication of protein structures, experimental methods are found to be inadequate for protein structural class prediction. Hence, it is very much advantageous to design a reliable computational method to predict protein structural classes from protein sequences. In the recent few years there has been an elevated interest in using deep learning to assist protein structure prediction as protein structure prediction models can be utilized to screen a large number of novel sequences. In this regard, we propose a model employing Energy Profile for atom pairs in conjunction with the Legion-Class Bayes function called Energy Profile Legion-Class Bayes Protein Structure Identification model. Followed by this, we use a Thompson Optimized convolutional neural network to extract features between amino acids and then the Thompson Optimized SoftMax function is employed to extract associations between protein sequences for predicting secondary protein structure. The proposed Energy Profile Bayes and Thompson Optimized Convolutional Neural Network (EPB-OCNN) method tested distinct unique protein data and was compared to the state-of-the-art methods, the Template-Based Modeling, Protein Design using Deep Graph Neural Networks, a deep learning-based S-glutathionylation sites prediction tool called a Computational Framework, the Deep Learning and a distance-based protein structure prediction using deep learning. The results obtained when applied with the Biopython tool with respect to protein structure prediction time, protein structure prediction accuracy, specificity, recall, F-measure, and precision, respectively, are measured. The proposed EPB-OCNN method outperformed the state-of-the-art methods, thereby corroborating the objective.
Collapse
Affiliation(s)
- Varanavasi Nallasamy
- Cognizant Technology Solutions Pvt. Ltd, CHIL SEZ IT Park, Keeranatham, Saravanam Patti, Coimbatore, Tamil Nadu 641035 India
| | - Malarvizhi Seshiah
- Department of Computer Science, Thiruvalluvar Government Arts College, Rasipuram, Namakkal, Tamil Nadu India
| |
Collapse
|
46
|
Durairaj J, de Ridder D, van Dijk AD. Beyond sequence: Structure-based machine learning. Comput Struct Biotechnol J 2022; 21:630-643. [PMID: 36659927 PMCID: PMC9826903 DOI: 10.1016/j.csbj.2022.12.039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Revised: 12/21/2022] [Accepted: 12/21/2022] [Indexed: 12/31/2022] Open
Abstract
Recent breakthroughs in protein structure prediction demarcate the start of a new era in structural bioinformatics. Combined with various advances in experimental structure determination and the uninterrupted pace at which new structures are published, this promises an age in which protein structure information is as prevalent and ubiquitous as sequence. Machine learning in protein bioinformatics has been dominated by sequence-based methods, but this is now changing to make use of the deluge of rich structural information as input. Machine learning methods making use of structures are scattered across literature and cover a number of different applications and scopes; while some try to address questions and tasks within a single protein family, others aim to capture characteristics across all available proteins. In this review, we look at the variety of structure-based machine learning approaches, how structures can be used as input, and typical applications of these approaches in protein biology. We also discuss current challenges and opportunities in this all-important and increasingly popular field.
Collapse
Affiliation(s)
- Janani Durairaj
- Biozentrum, University of Basel, Basel, Switzerland
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Dick de Ridder
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| | - Aalt D.J. van Dijk
- Bioinformatics Group, Department of Plant Sciences, Wageningen University and Research, Wageningen, the Netherlands
| |
Collapse
|
47
|
Liu J, Zhang C, Lai L. GeoPacker: A novel deep learning framework for protein side-chain modeling. Protein Sci 2022; 31:e4484. [PMID: 36309961 PMCID: PMC9667900 DOI: 10.1002/pro.4484] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2022] [Revised: 10/23/2022] [Accepted: 10/26/2022] [Indexed: 12/13/2022]
Abstract
Atomic interactions play essential roles in protein folding, structure stabilization, and function performance. Recent advances in deep learning-based methods have achieved impressive success not only in protein structure prediction, but also in protein sequence design. However, highly efficient and accurate protein side-chain prediction methods that can give detailed atomic interactions are still lacking. In the present study, we developed a deep learning based method, GeoPacker, that uses geometric deep learning coupled ResNet for protein side-chain modeling. GeoPacker explicitly represents atomic interactions with rotational and translational invariance for information extraction of relative locations. GeoPacker outperformed the state-of-the-art energy function-based methods in side-chain structure prediction accuracy and runs about 10 and 700 times faster than the deep learning-based method DLPacker and OPUS-rota4 with comparable prediction accuracy, respectively. The performance of GeoPacker does not depend on the secondary structures that the residues belong to. GeoPacker gives highly accurate predictions for buried residues in the protein core as well as protein-protein interface, making it a useful tool for protein structure modeling, protein, and interaction design.
Collapse
Affiliation(s)
- Jiale Liu
- Center for Life Sciences, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
| | - Changsheng Zhang
- BNLMS, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
| | - Luhua Lai
- Center for Life Sciences, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
- BNLMS, College of Chemistry and Molecular EngineeringPeking UniversityBeijingChina
- Center for Quantitative Biology, Academy for Advanced Interdisciplinary StudiesPeking UniversityBeijingChina
| |
Collapse
|
48
|
Ferruz N, Heinzinger M, Akdel M, Goncearenco A, Naef L, Dallago C. From sequence to function through structure: Deep learning for protein design. Comput Struct Biotechnol J 2022; 21:238-250. [PMID: 36544476 PMCID: PMC9755234 DOI: 10.1016/j.csbj.2022.11.014] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Revised: 11/05/2022] [Accepted: 11/05/2022] [Indexed: 11/20/2022] Open
Abstract
The process of designing biomolecules, in particular proteins, is witnessing a rapid change in available tooling and approaches, moving from design through physicochemical force fields, to producing plausible, complex sequences fast via end-to-end differentiable statistical models. To achieve conditional and controllable protein design, researchers at the interface of artificial intelligence and biology leverage advances in natural language processing (NLP) and computer vision techniques, coupled with advances in computing hardware to learn patterns from growing biological databases, curated annotations thereof, or both. Once learned, these patterns can be used to provide novel insights into mechanistic biology and the design of biomolecules. However, navigating and understanding the practical applications for the many recent protein design tools is complex. To facilitate this, we 1) document recent advances in deep learning (DL) assisted protein design from the last three years, 2) present a practical pipeline that allows to go from de novo-generated sequences to their predicted properties and web-powered visualization within minutes, and 3) leverage it to suggest a generated protein sequence which might be used to engineer a biosynthetic gene cluster to produce a molecular glue-like compound. Lastly, we discuss challenges and highlight opportunities for the protein design field.
Collapse
Key Words
- ADMM, Alternating Direction Method of Multipliers
- CNN, Convolutional Neural Network
- DL, Deep learning
- Deep learning
- Drug discovery
- FNN, fully-connected neural network
- GAN, Generative Adversarial Network
- GCN, Graph Convolutional Network
- GNN, Graph Neural Network
- GO, Gene Ontology
- GVP, Geometric Vector Perceptron
- LSTM, Long-Short Term Memory
- MLP, Multilayer Perceptron
- MSA, Multiple Sequence Alignment
- NLP, Natural Language Processing
- NSR, Natural Sequence Recovery
- Protein design
- Protein language models
- Protein prediction
- VAE, Variational Autoencoder
- pLM, protein Language Model
Collapse
Affiliation(s)
- Noelia Ferruz
- Institute of Informatics and Applications, University of Girona, Girona, Spain
- Department of Biochemistry, University of Bayreuth, Bayreuth, Germany
| | - Michael Heinzinger
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
| | - Mehmet Akdel
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | | | - Luca Naef
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
| | - Christian Dallago
- Department of Informatics, Bioinformatics & Computational Biology, Technische Universität München, 85748 Garching, Germany
- VantAI, 151 W 42nd Street, New York, NY 10036, United States
- NVIDIA DE GmbH, Einsteinstraße 172, 81677 München, Germany
| |
Collapse
|
49
|
Liu H, Chen Q. Computational protein design with data‐driven approaches: Recent developments and perspectives. WIRES COMPUTATIONAL MOLECULAR SCIENCE 2022. [DOI: 10.1002/wcms.1646] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Affiliation(s)
- Haiyan Liu
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine University of Science and Technology of China Hefei Anhui China
- Biomedical Sciences and Health Laboratory of Anhui Province University of Science and Technology of China Hefei Anhui China
- School of Data Science University of Science and Technology of China Hefei Anhui China
| | - Quan Chen
- MOE Key Laboratory for Membraneless Organelles and Cellular Dynamics, School of Life Sciences, Division of Life Sciences and Medicine University of Science and Technology of China Hefei Anhui China
- Biomedical Sciences and Health Laboratory of Anhui Province University of Science and Technology of China Hefei Anhui China
| |
Collapse
|
50
|
Gill ML. The rise of the machines in chemistry. MAGNETIC RESONANCE IN CHEMISTRY : MRC 2022; 60:1044-1051. [PMID: 35976263 DOI: 10.1002/mrc.5304] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 08/07/2022] [Accepted: 08/09/2022] [Indexed: 06/15/2023]
Abstract
The use of artificial intelligence and, more specifically, deep learning methods in chemistry is becoming increasingly common. Applications in informatics fields, such as cheminformatics and proteomics, structural biology, and spectroscopy, including NMR, are on the rise. Recent developments in model architectures, such as graph convolutional neural networks and transformers, have been enabled by advancements in computational hardware and software. However, model architectures with more predictive power often require larger amounts of training data, which can be challenging to acquire, but this requirement can be mitigated through techniques like pretraining and fine-tuning. In spite of these successes, challenges remain, such as normalization and scaling of data, availability of experimentally acquired data, and model explainability.
Collapse
|