1
|
Zheng Y, Young ND, Wang T, Chang BCH, Song J, Gasser RB. Systems biology of Haemonchus contortus - Advancing biotechnology for parasitic nematode control. Biotechnol Adv 2025; 81:108567. [PMID: 40127743 DOI: 10.1016/j.biotechadv.2025.108567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Revised: 03/19/2025] [Accepted: 03/21/2025] [Indexed: 03/26/2025]
Abstract
Parasitic nematodes represent a substantial global burden, impacting animal health, agriculture and economies worldwide. Of these worms, Haemonchus contortus - a blood-feeding nematode of ruminants - is a major pathogen and a model for molecular and applied parasitology research. This review synthesises some key advances in understanding the molecular biology, genetic diversity and host-parasite interactions of H. contortus, highlighting its value for comparative studies with the free-living nematode Caenorhabditis elegans. Key themes include recent developments in genomic, transcriptomic and proteomic technologies and resources, which are illuminating critical molecular pathways, including the ubiquitination pathway, protease/protease inhibitor systems and the secretome of H. contortus. Some of these insights are providing a foundation for identifying essential genes and exploring their potential as targets for novel anthelmintics or vaccines, particularly in the face of widespread anthelmintic resistance. Advanced bioinformatic tools, such as machine learning (ML) algorithms and artificial intelligence (AI)-driven protein structure prediction, are enhancing annotation capabilities, facilitating and accelerating analyses of gene functions, and biological pathways and processes. This review also discusses the integration of these tools with cutting-edge single-cell sequencing and spatial transcriptomics to dissect host-parasite interactions at the cellular level. The discussion emphasises the importance of curated databases, improved culture systems and functional genomics platforms to translate molecular discoveries into practical outcomes, such as novel interventions. New research findings and resources not only advance research on H. contortus and related nematodes but may also pave the way for innovative solutions to the global challenges with anthelmintic resistance.
Collapse
Affiliation(s)
- Yuanting Zheng
- Department of Veterinary Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Neil D Young
- Department of Veterinary Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Tao Wang
- Department of Veterinary Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Bill C H Chang
- Department of Veterinary Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, Victoria 3010, Australia
| | - Jiangning Song
- Faculty of IT, Department of Data Science and AI, Monash University, Victoria, Australia; Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Victoria, Australia; Monash Data Futures Institute, Monash University, Victoria, Australia
| | - Robin B Gasser
- Department of Veterinary Biosciences, Melbourne Veterinary School, Faculty of Science, The University of Melbourne, Parkville, Victoria 3010, Australia.
| |
Collapse
|
2
|
Tan Y, Zhou B, Zheng L, Fan G, Hong L. Semantical and geometrical protein encoding toward enhanced bioactivity and thermostability. eLife 2025; 13:RP98033. [PMID: 40314227 PMCID: PMC12048155 DOI: 10.7554/elife.98033] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025] Open
Abstract
Protein engineering is a pivotal aspect of synthetic biology, involving the modification of amino acids within existing protein sequences to achieve novel or enhanced functionalities and physical properties. Accurate prediction of protein variant effects requires a thorough understanding of protein sequence, structure, and function. Deep learning methods have demonstrated remarkable performance in guiding protein modification for improved functionality. However, existing approaches predominantly rely on protein sequences, which face challenges in efficiently encoding the geometric aspects of amino acids' local environment and often fall short in capturing crucial details related to protein folding stability, internal molecular interactions, and bio-functions. Furthermore, there lacks a fundamental evaluation for developed methods in predicting protein thermostability, although it is a key physical property that is frequently investigated in practice. To address these challenges, this article introduces a novel pre-training framework that integrates sequential and geometric encoders for protein primary and tertiary structures. This framework guides mutation directions toward desired traits by simulating natural selection on wild-type proteins and evaluates variant effects based on their fitness to perform specific functions. We assess the proposed approach using three benchmarks comprising over 300 deep mutational scanning assays. The prediction results showcase exceptional performance across extensive experiments compared to other zero-shot learning methods, all while maintaining a minimal cost in terms of trainable parameters. This study not only proposes an effective framework for more accurate and comprehensive predictions to facilitate efficient protein engineering, but also enhances the in silico assessment system for future deep learning models to better align with empirical requirements. The PyTorch implementation is available at https://github.com/ai4protein/ProtSSN.
Collapse
Affiliation(s)
- Yang Tan
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
| | - Bingxin Zhou
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| | - Lirong Zheng
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
| | - Guisheng Fan
- School of Information Science and Engineering, East China University of Science and TechnologyShanghaiChina
| | - Liang Hong
- Shanghai-Chongqing Institute of Artificial Intelligence, Shanghai Jiao Tong UniversityChongqingChina
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong UniversityShanghaiChina
- Shanghai Artificial Intelligence LaboratoryShanghaiChina
- Shanghai Jiao Tong University, Institute of Natural SciencesShanghaiChina
- Shanghai National Center for Applied Mathematics (SJTU Center), Shanghai Jiao Tong UniversityShanghaiChina
| |
Collapse
|
3
|
Kawabata T, Kinoshita K. Assessing Structural Classification Using AlphaFold2 Models Through ECOD-Based Comparative Analysis. Proteins 2025. [PMID: 40251890 DOI: 10.1002/prot.26828] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2024] [Revised: 03/27/2025] [Accepted: 03/30/2025] [Indexed: 04/21/2025]
Abstract
Identifying homologous proteins is a fundamental task in structural bioinformatics. While AlphaFold2 has revolutionized protein structure prediction, the extent to which structure comparison of its models can reliably detect homologs remains unclear. In this study, we evaluate the feasibility of homology detection using AlphaFold2-predicted structures through structural comparisons. We considered the classification of the ECOD database for experimental structures as the correct standard and obtained their corresponding predicted models from AlphaFoldDB. To ensure blind assessment, we divided the structures into test and train sets according to their release date. Predicted and experimental 3D structures in the test and train sets were compared using 3D structure comparisons (MATRAS, Dali, and Foldseek) and sequence comparisons (BLAST and HHsearch). The results were evaluated based on the homology annotations in the ECOD database. For top-1 accuracy, the performance of structural comparisons was comparable to that of HHsearch. However, when considering metrics that included all structural pairs, including more remote homology, structural comparisons outperformed HHsearch. No significant differences were observed between comparisons of experimental versus experimental, predicted versus experimental, and predicted versus predicted structures with pLDDT (prediction confidence) values greater than 60. We also demonstrate that predicted protein structures, determined by NMR, had lower pLDDT values and contained fewer coils than their experimental counterparts. These findings highlight the potential of AlphaFold2 models in structural classification and suggest that 3D structural searches should be conducted not only against the PDB but also against AlphaFoldDB to identify more potential homologs.
Collapse
Affiliation(s)
- Takeshi Kawabata
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| | - Kengo Kinoshita
- Graduate School of Information Sciences, Tohoku University, Sendai, Japan
| |
Collapse
|
4
|
Tawfeeq C, Wang J, Khaniya U, Madej T, Song J, Abrol R, Youkharibache P. IgStrand: A universal residue numbering scheme for the immunoglobulin-fold (Ig-fold) to study Ig-proteomes and Ig-interactomes. PLoS Comput Biol 2025; 21:e1012813. [PMID: 40228037 DOI: 10.1371/journal.pcbi.1012813] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Accepted: 01/20/2025] [Indexed: 04/16/2025] Open
Abstract
The Immunoglobulin fold (Ig-fold) is found in proteins from all domains of life and represents the most populous fold in the human genome, with current estimates ranging from 2 to 3% of protein coding regions. That proportion is much higher in the surfaceome where Ig and Ig-like domains orchestrate cell-cell recognition, adhesion and signaling. The ability of Ig-domains to reliably fold and self-assemble through highly specific interfaces represents a remarkable property of these domains, making them key elements of molecular interaction systems: the immune system, the nervous system, the vascular system and the muscular system. We define a universal residue numbering scheme, common to all domains sharing the Ig-fold in order to study the wide spectrum of Ig-domain variants constituting the Ig-proteome and Ig-Ig interactomes at the heart of these systems. The "IgStrand numbering scheme" enables the identification of Ig structural proteomes and interactomes in and between any species, and comparative structural, functional, and evolutionary analyses. We review how Ig-domains are classified today as topological and structural variants and highlight the "Ig-fold irreducible structural signature" shared by all of them. The IgStrand numbering scheme lays the foundation for the systematic annotation of structural proteomes by detecting and accurately labeling Ig-, Ig-like and Ig-extended domains in proteins, which are poorly annotated in current databases and opens the door to accurate machine learning. Importantly, it sheds light on the robust Ig protein folding algorithm used by nature to form beta sandwich supersecondary structures. The numbering scheme powers an algorithm implemented in the interactive structural analysis software iCn3D to systematically recognize Ig-domains, annotate them and perform detailed analyses comparing any domain sharing the Ig-fold in sequence, topology and structure, regardless of their diverse topologies or origin. The scheme provides a robust fold detection and labeling mechanism that reveals unsuspected structural homologies among protein structures beyond currently identified Ig- and Ig-like domain variants. Indeed, multiple folds classified independently contain a common structural signature, in particular jelly-rolls. Examples of folds that harbor an "Ig-extended" architecture are given. Applications in protein engineering around the Ig-architecture are straightforward based on the universal numbering.
Collapse
Affiliation(s)
- Caesar Tawfeeq
- Department of Chemistry and Biochemistry, California State University Northridge, Northridge, California, United States of America
| | - Jiyao Wang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Umesh Khaniya
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Thomas Madej
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - James Song
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, United States of America
| | - Ravinder Abrol
- Department of Chemistry and Biochemistry, California State University Northridge, Northridge, California, United States of America
| | - Philippe Youkharibache
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, Maryland, United States of America
| |
Collapse
|
5
|
Meng L, Wei L, Wu R. MVGNN-PPIS: A novel multi-view graph neural network for protein-protein interaction sites prediction based on Alphafold3-predicted structures and transfer learning. Int J Biol Macromol 2025; 300:140096. [PMID: 39848362 DOI: 10.1016/j.ijbiomac.2025.140096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 01/04/2025] [Accepted: 01/17/2025] [Indexed: 01/25/2025]
Abstract
Protein-protein interactions (PPI) are crucial for understanding numerous biological processes and pathogenic mechanisms. Identifying interaction sites is essential for biomedical research and targeted drug development. Compared to experimental methods, accurate computational approaches for protein-protein interaction sites (PPIS) prediction can save significant time and costs. In this study, we propose a novel model named MVGNN-PPIS. To the best of our knowledge, it is the first to utilize predicted structures generated by AlphaFold3, and combined with transfer learning techniques, for predicting PPIS. This approach addresses the limitations of traditional methods that depend on native protein structures and multiple sequence alignments (MSA). Additionally, we introduced a multi-view graph framework based on two types of graph structures: the k-nearest neighbor graph and the adjacency matrix. By alternately employing a Graph Transformer and Graph Convolutional Networks (GCN) to aggregate node information, this framework effectively captures both local and global dependencies of each residue in the predicted structures, thereby significantly enhancing the model's sensitivity to binding sites. This framework further integrates direction, distances and angular information between the 3D coordinates of side-chain atom centroids to construct a relative coordinate system, generating enhanced edge features that ensure the model's equivariance to molecular translations and rotations in space. During training, the Focal Loss function is employed to effectively address the class imbalance in the dataset. Experimental results demonstrate that MVGNN outperforms the current state-of-the-art methods across multiple PPIS benchmark datasets. To further validate the model's generalization capability, we extended MVGNN to the domain of predicting protein-nucleic acid interaction sites, where it also achieved superior performance.
Collapse
Affiliation(s)
- Lu Meng
- College of Information Science and Engineering, Northeastern University, China.
| | - Lishuai Wei
- College of Information Science and Engineering, Northeastern University, China
| | - Rina Wu
- College of Information Science and Engineering, Northeastern University, China
| |
Collapse
|
6
|
Norton T, Bhattacharya D. Sifting through the noise: A survey of diffusion probabilistic models and their applications to biomolecules. J Mol Biol 2025; 437:168818. [PMID: 39389290 PMCID: PMC11885034 DOI: 10.1016/j.jmb.2024.168818] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2024] [Revised: 09/20/2024] [Accepted: 10/03/2024] [Indexed: 10/12/2024]
Abstract
Diffusion probabilistic models have made their way into a number of high-profile applications since their inception. In particular, there has been a wave of research into using diffusion models in the prediction and design of biomolecular structures and sequences. Their growing ubiquity makes it imperative for researchers in these fields to understand them. This paper serves as a general overview for the theory behind these models and the current state of research. We first introduce diffusion models and discuss common motifs used when applying them to biomolecules. We then present the significant outcomes achieved through the application of these models in generative and predictive tasks. This survey aims to provide readers with a comprehensive understanding of the increasingly critical role of diffusion models.
Collapse
Affiliation(s)
- Trevor Norton
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24061, United States
| | | |
Collapse
|
7
|
Zhong J, Zou Z, Qiu J, Wang S. ScFold: a GNN-based model for efficient inverse folding of short-chain proteins via spatial reduction. Brief Bioinform 2025; 26:bbaf156. [PMID: 40205854 PMCID: PMC11982017 DOI: 10.1093/bib/bbaf156] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2024] [Revised: 02/24/2025] [Accepted: 03/19/2025] [Indexed: 04/11/2025] Open
Abstract
In the realm of protein design, the efficient construction of protein sequences that accurately fold into predefined structures has become an important area of research. Although advancements have been made in the study of long-chain proteins, the design of short-chain proteins requires equal consideration. The structural information inherent in short and single chains is typically less comprehensive than that of full-length chains, which can negatively impact their performance. To address this challenge, we introduce ScFold, a novel model that incorporates an innovative node module. This module utilizes spatial dimensionality reduction and positional encoding mechanisms to enhance the extraction of structural features. Experimental results indicate that ScFold achieves a recovery rate of 52.22$\%$ on the CATH4.2 dataset, demonstrating notable efficacy for short-chain proteins, with a recovery rate of 41.6$\%$. Additionally, ScFold further exhibits enhanced recovery rates of 59.32$\%$ and 61.59$\%$ on the TS50 and TS500 datasets, respectively, demonstrating its effectiveness across diverse protein types. Additionally, we performed protein length stratification on the TS500 and CATH4.2 datasets and tested ScFold on length-specific sub-datasets. The results confirm the model's superiority in handling short-chain proteins. Finally, we selected several protein sequence groups from the CATH4.2 dataset for structural visualization analysis and provided comparisons between the model-generated sequences and the target sequences.
Collapse
Affiliation(s)
- Jiancheng Zhong
- College of Information Science and Engineering, Hunan Normal University, 36 Lushan Road, Yuelu District, Changsha 410081, Hunan, China
| | - Zhiwei Zou
- College of Information Science and Engineering, Hunan Normal University, 36 Lushan Road, Yuelu District, Changsha 410081, Hunan, China
| | - Jie Qiu
- College of Information Science and Engineering, Hunan Normal University, 36 Lushan Road, Yuelu District, Changsha 410081, Hunan, China
| | - Shaokai Wang
- Department of Mathematics, Hong Kong University of Science and Technology, Clear Water Bay, Hong Kong SAR, China
| |
Collapse
|
8
|
Träger LK, Degen M, Pereira J, Durairaj J, Teixeira R, Hiller S, Huguenin-Dezot N. Structural basis for cooperative ssDNA binding by bacteriophage protein filament P12. Nucleic Acids Res 2025; 53:gkaf132. [PMID: 40052821 PMCID: PMC11886824 DOI: 10.1093/nar/gkaf132] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 02/05/2025] [Accepted: 02/11/2025] [Indexed: 03/10/2025] Open
Abstract
Protein-primed DNA replication is a unique mechanism, bioorthogonal to other known DNA replication modes. It relies on specialised single-stranded DNA (ssDNA)-binding proteins (SSBs) to stabilise ssDNA intermediates by unknown mechanisms. Here, we present the structural and biochemical characterisation of P12, an SSB from bacteriophage PRD1. High-resolution cryo-electron microscopy reveals that P12 forms a unique, cooperative filament along ssDNA. Each protomer binds the phosphate backbone of 6 nucleotides in a sequence-independent manner, protecting ssDNA from nuclease degradation. Filament formation is driven by an intrinsically disordered C-terminal tail, facilitating cooperative binding. We identify residues essential for ssDNA interaction and link the ssDNA-binding ability of P12 to toxicity in host cells. Bioinformatic analyses place the P12 fold as a distinct branch within the OB-like fold family. This work offers new insights into protein-primed DNA replication and lays a foundation for biotechnological applications.
Collapse
Affiliation(s)
- Lena K Träger
- Department of Biosystems Science and Engineering, ETH Zurich, Schanzenstrasse 44, 4056 Basel, Switzerland
| | - Morris Degen
- Biozentrum, University of Basel, Spitalstrasse 41, 4056 Basel, Switzerland
- Swiss Nanoscience Institute, University of Basel, Klingelbergstrasse 82, 4056 Basel, Switzerland
| | - Joana Pereira
- Biozentrum, University of Basel, Spitalstrasse 41, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Basel, Elisabethenstrasse 43, 4051 Basel, Switzerland
| | - Janani Durairaj
- Biozentrum, University of Basel, Spitalstrasse 41, 4056 Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, University of Basel, Elisabethenstrasse 43, 4051 Basel, Switzerland
| | | | - Sebastian Hiller
- Biozentrum, University of Basel, Spitalstrasse 41, 4056 Basel, Switzerland
| | - Nicolas Huguenin-Dezot
- Department of Biosystems Science and Engineering, ETH Zurich, Schanzenstrasse 44, 4056 Basel, Switzerland
| |
Collapse
|
9
|
Ambreen S, Umar M, Noor A, Jain H, Ali R. Advanced AI and ML frameworks for transforming drug discovery and optimization: With innovative insights in polypharmacology, drug repurposing, combination therapy and nanomedicine. Eur J Med Chem 2025; 284:117164. [PMID: 39721292 DOI: 10.1016/j.ejmech.2024.117164] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2024] [Revised: 11/24/2024] [Accepted: 11/27/2024] [Indexed: 12/28/2024]
Abstract
Artificial Intelligence (AI) and Machine Learning (ML) are transforming drug discovery by overcoming traditional challenges like high costs, time-consuming, and frequent failures. AI-driven approaches streamline key phases, including target identification, lead optimization, de novo drug design, and drug repurposing. Frameworks such as deep neural networks (DNNs), convolutional neural networks (CNNs), and deep reinforcement learning (DRL) models have shown promise in identifying drug targets, optimizing delivery systems, and accelerating drug repurposing. Generative adversarial networks (GANs) and variational autoencoders (VAEs) aid de novo drug design by creating novel drug-like compounds with desired properties. Case studies, such as DDR1 kinase inhibitors designed using generative models and CDK20 inhibitors developed via structure-based methods, highlight AI's ability to produce highly specific therapeutics. Models like SNF-CVAE and DeepDR further advance drug repurposing by uncovering new therapeutic applications for existing drugs. Advanced ML algorithms enhance precision in predicting drug efficacy, toxicity, and ADME-Tox properties, reducing development costs and improving drug-target interactions. AI also supports polypharmacology by optimizing multi-target drug interactions and enhances combination therapy through predictions of drug synergies and antagonisms. In nanomedicine, AI models like CURATE.AI and the Hartung algorithm optimize personalized treatments by predicting toxicological risks and real-time dosing adjustments with high accuracy. Despite its potential, challenges like data quality, model interpretability, and ethical concerns must be addressed. High-quality datasets, transparent models, and unbiased algorithms are essential for reliable AI applications. As AI continues to evolve, it is poised to revolutionize drug discovery and personalized medicine, advancing therapeutic development and patient care.
Collapse
Affiliation(s)
- Subiya Ambreen
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Mohammad Umar
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Aaisha Noor
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Himangini Jain
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India
| | - Ruhi Ali
- Department of Pharmaceutical Chemistry, Delhi Institute of Pharmaceutical Sciences and Research (DIPSAR), DPSRU, Pushp Vihar, New Delhi, 110017, India.
| |
Collapse
|
10
|
Lu T, Liu M, Chen Y, Kim J, Huang PS. Assessing Generative Model Coverage of Protein Structures with SHAPES. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.09.632260. [PMID: 39868321 PMCID: PMC11761634 DOI: 10.1101/2025.01.09.632260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
Recent advances in generative modeling enable efficient sampling of protein structures, but their tendency to optimize for designability imposes a bias toward idealized structures at the expense of loops and other complex structural motifs critical for function. We introduce SHAPES (Structural and Hierarchical Assessment of Proteins with Embedding Similarity) to evaluate five state-of-the-art generative models of protein structures. Using structural embeddings across multiple structural hierarchies, ranging from local geometries to global protein architectures, we reveal substantial undersampling of the observed protein structure space by these models. We use Fréchet Protein Distance (FPD) to quantify distributional coverage. Different models are distinct in their coverage behavior across different sampling noise scales and temperatures; the frequency of TERtiary Motifs (TERMs) further supports the observations. More robust sequence design and structure prediction methods are likely crucial in guiding the development of models with improved coverage of the designable protein space.
Collapse
Affiliation(s)
- Tianyu Lu
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Equal contribution
| | - Melissa Liu
- Department of Bioengineering, Stanford University, Stanford, CA, USA
- Equal contribution
| | - Yilin Chen
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Jinho Kim
- Department of Physics, Stanford University, Stanford, CA, USA
| | - Po-Ssu Huang
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| |
Collapse
|
11
|
Paul SK, Saddam M, Tabassum N, Hasan M. Molecular dynamics simulation of wild and mutant proteasome subunit beta type 8 (PSMB8) protein: Implications for restoration of inflammation in experimental autoimmune encephalomyelitis pathogenesis. Heliyon 2025; 11:e41166. [PMID: 39802026 PMCID: PMC11719297 DOI: 10.1016/j.heliyon.2024.e41166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 12/03/2024] [Accepted: 12/11/2024] [Indexed: 01/16/2025] Open
Abstract
Multiple Sclerosis (MS) is an autoimmune and chronic disease in the brain and spinal cord. MS has inflammatory progression characterized by its hallmark inflammatory plaques. The histological and clinical characteristics of MS are shared by Experimental Autoimmune Encephalomyelitis (EAE). Genetic and environmental factors contribute to the development of MS. In EAE-MS disease, the level of proteasome subunit beta type-8 (PSMB8), encoded by the PSMB8 gene, is increased and regulates the inflammatory response in this disease. In humans, the Nakajo-Nishimura Syndrome is caused by a mutation in the gene PSMB8, a part of the immunoproteasome subunit. Therefore, special attention to wild and mutant (G210V) PSMB8 protein is imperative. In this study, we performed a 100 ns molecular dynamics (MD) simulation for wild-type PSMB8 and the mutant G210V. Then, we analyzed the fundamental and essential simulation results using another Google Colab system. The energy analysis ensures the structural deviation due to point mutation. The trajectory of the fundamental simulation (RMSD, RMSF, and Rg) describes that the G210V mutated protein is more flexible and less stable than the wild type. We observed the conformational changes due to mutation by analyzing the RMSD average linkage hierarchical clustering, total SASA, and SASA autocorrelation. The differences in the protein's overall motion and the atoms' precise location are identified by the principal component analysis, showing that the overall motion and location of the atoms are different. Our study provides valuable insights into the dynamics and structure of this protein, which can aid in further understanding its biological functions and potential implications for disease.
Collapse
Affiliation(s)
- Shamrat Kumar Paul
- Department of Biochemistry and Molecular Biology, Life Science Faculty, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, 8100, Bangladesh
| | - Md Saddam
- Department of Biochemistry and Molecular Biology, Life Science Faculty, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, 8100, Bangladesh
| | - Nisat Tabassum
- Department of Biotechnology and Genetic Engineering, Life Science Faculty, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, 8100, Bangladesh
| | - Mahbub Hasan
- Department of Biochemistry and Molecular Biology, Life Science Faculty, Bangabandhu Sheikh Mujibur Rahman Science and Technology University, Gopalganj, 8100, Bangladesh
| |
Collapse
|
12
|
Sun J, Zhu T, Cui Y, Wu B. Structure-based self-supervised learning enables ultrafast protein stability prediction upon mutation. Innovation (N Y) 2025; 6:100750. [PMID: 39872490 PMCID: PMC11763918 DOI: 10.1016/j.xinn.2024.100750] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2024] [Accepted: 12/02/2024] [Indexed: 01/30/2025] Open
Abstract
Predicting free energy changes (ΔΔG) is essential for enhancing our understanding of protein evolution and plays a pivotal role in protein engineering and pharmaceutical development. While traditional methods offer valuable insights, they are often constrained by computational speed and reliance on biased training datasets. These constraints become particularly evident when aiming for accurate ΔΔG predictions across a diverse array of protein sequences. Herein, we introduce Pythia, a self-supervised graph neural network specifically designed for zero-shot ΔΔG predictions. Our comparative benchmarks demonstrate that Pythia outperforms other self-supervised pretraining models and force field-based approaches while also exhibiting competitive performance with fully supervised models. Notably, Pythia shows strong correlations and achieves a remarkable increase in computational speed of up to 105-fold. We further validated Pythia's performance in predicting the thermostabilizing mutations of limonene epoxide hydrolase, leading to higher experimental success rates. This exceptional efficiency has enabled us to explore 26 million high-quality protein structures, marking a significant advancement in our ability to navigate the protein sequence space and enhance our understanding of the relationships between protein genotype and phenotype. In addition, we established a web server at https://pythia.wulab.xyz to allow users to easily perform such predictions.
Collapse
Affiliation(s)
- Jinyuan Sun
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Tong Zhu
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yinglu Cui
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| | - Bian Wu
- AIM Center, College of Life Sciences and Technology, Beijing University of Chemical Technology, Institute of Microbiology, Chinese Academy of Sciences, Beijing, China
| |
Collapse
|
13
|
Waman V, Bordin N, Lau A, Kandathil S, Wells J, Miller D, Velankar S, Jones D, Sillitoe I, Orengo C. CATH v4.4: major expansion of CATH by experimental and predicted structural data. Nucleic Acids Res 2025; 53:D348-D355. [PMID: 39565206 PMCID: PMC11701635 DOI: 10.1093/nar/gkae1087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/18/2024] [Accepted: 10/24/2024] [Indexed: 11/21/2024] Open
Abstract
CATH (https://www.cathdb.info) is a structural classification database that assigns domains to the structures in the Protein Data Bank (PDB) and AlphaFold Protein Structure Database (AFDB) and adds layers of biological information, including homology and functional annotation. This article covers developments in the CATH classification since 2021. We report the significant expansion of structural information (180-fold) for CATH superfamilies through classification of PDB domains and predicted domain structures from the Encyclopedia of Domains (TED) resource. TED provides information on predicted domains in AFDB. CATH v4.4 represents an expansion of ∼64 844 experimentally determined domain structures from PDB. We also present a mapping of ∼90 million predicted domains from TED to CATH superfamilies. New PDB and TED data increases the number of superfamilies from 5841 to 6573, folds from 1349 to 2078 and architectures from 41 to 77. TED data comprises predicted structures, so these new folds and architectures remain hypothetical until experimentally confirmed. CATH also classifies domains into functional families (FunFams) within a superfamily. We have updated sequences in FunFams by scanning FunFam-HMMs against UniProt release 2024_02, giving a 276% increase in FunFams coverage. The mapping of TED structural domains has resulted in a 4-fold increase in FunFams with structural information.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Andy Lau
- Department of Computer Science, University College London, London WC1E 6BT, UK
- InstaDeep Ltd, 5 Merchant Square, London W2 1AY, UK
| | - Shaun Kandathil
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Jude Wells
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Centre for Artificial Intelligence, University College London, London WC1V 6BH, UK
| | - David Miller
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Centre for Artificial Intelligence, University College London, London WC1V 6BH, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - David T Jones
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| |
Collapse
|
14
|
Kim J, Woo J, Park JY, Kim KJ, Kim D. Deep learning for NAD/NADP cofactor prediction and engineering using transformer attention analysis in enzymes. Metab Eng 2025; 87:86-94. [PMID: 39571721 DOI: 10.1016/j.ymben.2024.11.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 09/25/2024] [Accepted: 11/17/2024] [Indexed: 12/13/2024]
Abstract
Understanding and manipulating the cofactor preferences of NAD(P)-dependent oxidoreductases, the most widely distributed enzyme group in nature, is increasingly crucial in bioengineering. However, large-scale identification of the cofactor preferences and the design of mutants to switch cofactor specificity remain as complex tasks. Here, we introduce DISCODE (Deep learning-based Iterative pipeline to analyze Specificity of COfactors and to Design Enzyme), a novel transformer-based deep learning model to predict NAD(P) cofactor preferences. For model training, a total of 7,132 NAD(P)-dependent enzyme sequences were collected. Leveraging whole-length sequence information, DISCODE classifies the cofactor preferences of NAD(P)-dependent oxidoreductase protein sequences without structural or taxonomic limitation. The model showed 97.4% and 97.3% of accuracy and F1 score, respectively. A notable feature of DISCODE is the interpretability of its transformer layers. Analysis of attention layers in the model enables identification of several residues that showed significantly higher attention weights. They were well aligned with structurally important residues that closely interact with NAD(P), facilitating the identification of key residues for determining cofactor specificities. These key residues showed high consistency with verified cofactor switching mutants. Integrated into an enzyme design pipeline, DISCODE coupled with attention analysis, enables a fully automated approach to redesign cofactor specificity.
Collapse
Affiliation(s)
- Jaehyung Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Jihoon Woo
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Joon Young Park
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea
| | - Kyung-Jin Kim
- School of Life Sciences, BK21 FOUR KNU Creative BioResearch Group, KNU Institute of Microbiology, Kyungpook National University, Daegu, 41566, Republic of Korea
| | - Donghyuk Kim
- School of Energy and Chemical Engineering, Ulsan National Institute of Science and Technology (UNIST), Ulsan, 44919, Republic of Korea.
| |
Collapse
|
15
|
Kister AE. Beta Sandwich-Like Folds: Sequences, Contacts, Classification of Invariant Substructures and Beta Sandwich Protein Grammar. Methods Mol Biol 2025; 2870:51-62. [PMID: 39543030 DOI: 10.1007/978-1-0716-4213-9_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
This chapter addresses the following fundamental question: Do sequences of protein domains with sandwich architecture have common sequence characteristics even though they belong to different superfamilies and folds? The analysis was carried out in two stages: (1) determination of domain substructures shared by all sandwich proteins and (2) detection of common sequence characteristics within the substructures. Analysis of supersecondary structures in domains of proteins revealed two types of four-strand substructures that are common to sandwich proteins. At least one of these common substructures was found in proteins of 42 sandwich-like folds (per structural classification in the CATH database). A comparison of sequence fragments and residue-residue contacts constituting common substructures revealed specific distributions of hydrophobic residues in these chains. The shared sequences and structural characteristics can be conceptualized as the "grammatical rules of beta protein linguistics." Understanding the structural and sequence commonalities of sandwich proteins may prove useful for rational protein design.
Collapse
|
16
|
Górna MW, Merski M. Discovery and Analysis of Repeat and Low-Complexity Architectures in Proteins and Their Conserved Evolutionary Relationships Using Self-Homology Dot Plots. Methods Mol Biol 2025; 2870:95-116. [PMID: 39543033 DOI: 10.1007/978-1-0716-4213-9_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
Proteins that contain sequence repetitions and low complexity regions can be analyzed using self-homology dot plot analysis. Dot plots can readily identify protein sequence repeats; the number of repeats and their length and location within the protein sequence are readily identifiable from the dot plots without the need to pre-define any of these attributes, making this method largely model-independent. We discuss the criteria for statistical identification of protein repeats and recommend simple ways of identifying protein repeats. While higher levels of sequence conservation within the repeats do make them easier to formally identify, this method can identify protein repeats with fairly low levels of conservation, as well as notably non-tandem repetitions with sizeable sections of complex, non-repeat sequence separating the individual repeat instances. Furthermore, even simple visual examination of these dot plots can discover conserved patterns within families of closely related proteins, and the level of this conservation can be readily quantified using a Jaccard index. Exhaustive pairwise comparisons can be assembled using hierarchical clustering methods to get a picture of the conserved repeat architectures within families of repeat proteins.
Collapse
Affiliation(s)
- Maria W Górna
- Structural Biology Group, Biological and Chemical Research Centre, Faculty of Chemistry, University of Warsaw, Warsaw, Poland
| | - Matthew Merski
- i3S - Instituto de Investigação e Inovação em Saúde, Universidade do Porto, Porto, Portugal
| |
Collapse
|
17
|
Perin C, Cretin G, Gelly JC. Hierarchical Analysis of Protein Structures: From Secondary Structures to Protein Units and Domains. Methods Mol Biol 2025; 2870:357-370. [PMID: 39543044 DOI: 10.1007/978-1-0716-4213-9_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
The three-dimensional structure of proteins is traditionally organized into hierarchical levels, specifically secondary structures and domains. However, different studies suggest the existence of intermediate levels, such as Protein Units (PUs), which provide a refined understanding of protein architecture. PUs, characterized by their compactness and independence, serve as an intermediate organizational level, bridging the gap between secondary structures and domains. This new view not only enhances our comprehension of protein structure, folding, and evolutionary mechanisms but also provides a robust methodology for identifying and categorizing protein domains. Based on the concept of PUs, alternative structural partitioning solutions can be proposed that address the structural ambiguity of proteins, leading to more meaningful domain identification.
Collapse
Affiliation(s)
- Charlotte Perin
- TBI, Université de Toulouse, CNRS, INRAE, INSA, Toulouse, France
| | | | | |
Collapse
|
18
|
Wang J, Abrol R, Youkharibache P. Ig or Not Ig? That Is the Question: The Nucleating Supersecondary Structure of the Ig-Fold and the Extended Ig Universe. Methods Mol Biol 2025; 2870:371-396. [PMID: 39543045 DOI: 10.1007/978-1-0716-4213-9_19] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2024]
Abstract
Observing the omnipresence of the Ig-fold in all domains of life, one may wonder why this fold among all is such a wunderkind of evolution. Culminating in vertebrates, it enables a myriad of functions at the heart of the immune, nervous, vascular, and muscular systems. We suggest the Ig-fold resilience lies in the robust folding of a core supersecondary structure (SSS) that can accommodate a myriad of topological variations. In this chapter, we focus on the core supersecondary structure common to all topostructural variants of the Ig-fold and will see that this pattern can also be found in other β-sandwich folds. It represents a highly resilient central SSS that accommodates a very high plasticity observed among β-sandwiches. We have recently developed a universal numbering system to identify and annotate Ig-domains, Ig-like domains, and what we now call Ig-extended domains, i.e., β-sandwiches that contain and extend the Ig-fold topology (to be published). A universal numbering scheme, common to all topological and structural variants of any domain sharing the Ig-fold, allows a direct comparison of any Ig, Ig-like, and Ig-extended domain in sequence, topology, and structure. This can therefore help understand the robust patterns in Ig-folding and interactions with other Ig or non-Ig proteins, as well as help trace evolutionary patterns of immunoglobulin domains. The universal numbering scheme, called IgStrand, is now at the heart of an algorithm that can label secondary structure elements of the Ig-fold for any topological variant. It is implemented in the open-source web-based iCn3D program from NCBI (Wang, Youkharibache, Zhang, Lanczycki, Geer, Madej, Phan, Ward, Lu, Marchler, Bioinformatics 36:131-135, 2020). Interestingly, that algorithm captures SSS homologies across a very large spectrum of β-sandwiches, and one can envision classifying numerous such sandwiches as "Ig-extended" domains and their variable topological arrangements. In this chapter, we go through examples of Ig, Ig-like, and Ig-extended domains as in a journey through cells: in the cell nucleus, in the cytoplasm, or on extracellular regions of cell surface receptors, and in viruses.
Collapse
Affiliation(s)
- Jiyao Wang
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Ravinder Abrol
- Department of Chemistry and Biochemistry, California State University, Northridge, CA, USA
| | - Philippe Youkharibache
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
19
|
Le Berre M, Tubiana T, Reuterswärd Waldner P, Lazar N, Li de la Sierra-Gallay I, Santos JM, Llinás M, Nessler S. Structural characterization of the ACDC domain from ApiAP2 proteins, a potential molecular target against apicomplexan parasites. Acta Crystallogr D Struct Biol 2025; 81:38-48. [PMID: 39820027 PMCID: PMC11740583 DOI: 10.1107/s2059798324012518] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Accepted: 12/28/2024] [Indexed: 01/19/2025] Open
Abstract
The apicomplexan AP2 (ApiAP2) proteins are the best characterized family of DNA-binding proteins in Plasmodium spp. malaria parasites. Apart from the AP2 DNA-binding domain, there is little sequence similarity between ApiAP2 proteins. However, a conserved AP2-coincident domain mostly at the C-terminus (ACDC domain) is observed in a subset of the ApiAP2 proteins. The structure and function of this domain remain unknown. We report two crystal structures of ACDC domains derived from distinct Plasmodium ApiAP2 proteins, revealing a conserved, unique, noncanonical, four-helix bundle architecture. We used these structures to perform in silico docking calculations against a library of known antimalarial compounds and identified potential small-molecule ligands that bind in a highly conserved hydrophobic pocket that is present in all apicomplexan ACDC domains. These ligands provide a new molecular basis for the future design of ACDC inhibitors.
Collapse
Affiliation(s)
- Marine Le Berre
- Institute for Integrative Biology of the Cell (I2BC)Université Paris-SaclayCEA, CNRS91198Gif-sur-YvetteFrance
| | - Thibault Tubiana
- Institute for Integrative Biology of the Cell (I2BC)Université Paris-SaclayCEA, CNRS91198Gif-sur-YvetteFrance
| | - Philippa Reuterswärd Waldner
- Department of Biochemistry and Molecular BiologyThe Pennsylvania State UniversityState CollegePA16802USA
- Huck Center for Malaria ResearchThe Pennsylvania State UniversityState CollegePA16802USA
| | - Noureddine Lazar
- Institute for Integrative Biology of the Cell (I2BC)Université Paris-SaclayCEA, CNRS91198Gif-sur-YvetteFrance
| | - Ines Li de la Sierra-Gallay
- Institute for Integrative Biology of the Cell (I2BC)Université Paris-SaclayCEA, CNRS91198Gif-sur-YvetteFrance
| | - Joana M. Santos
- Institute for Integrative Biology of the Cell (I2BC)Université Paris-SaclayCEA, CNRS91198Gif-sur-YvetteFrance
| | - Manuel Llinás
- Department of Biochemistry and Molecular BiologyThe Pennsylvania State UniversityState CollegePA16802USA
- Huck Center for Malaria ResearchThe Pennsylvania State UniversityState CollegePA16802USA
- Department of ChemistryThe Pennsylvania State UniversityState CollegePA16802USA
| | - Sylvie Nessler
- Institute for Integrative Biology of the Cell (I2BC)Université Paris-SaclayCEA, CNRS91198Gif-sur-YvetteFrance
| |
Collapse
|
20
|
Caetano-Anollés G, Mughal F, Aziz MF, Caetano-Anollés K. Tracing the birth and intrinsic disorder of loops and domains in protein evolution. Biophys Rev 2024; 16:723-735. [PMID: 39830125 PMCID: PMC11735766 DOI: 10.1007/s12551-024-01251-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/11/2024] [Accepted: 10/29/2024] [Indexed: 01/22/2025] Open
Abstract
Protein loops and structural domains are building blocks of molecular structure. They hold evolutionary memory and are largely responsible for the many functions and processes that drive the living world. Here, we briefly review two decades of phylogenomic data-driven research focusing on the emergence and evolution of these elemental architects of protein structure. Phylogenetic trees of domains reconstructed from the proteomes of organisms belonging to all three superkingdoms and viruses were used to build chronological timelines describing the origin of each domain and its embedded loops at different levels of structural abstraction. These timelines consistently recovered six distinct evolutionary phases and a most parsimonious evolutionary progression of cellular life. The timelines also traced the birth of domain structures from loops, which allowed to model their growth ab initio with AlphaFold2. Accretion decreased the disorder of the growing molecules, suggesting disorder is molecular size-dependent. A phylogenomic survey of disorder revealed that loops and domains evolved differently. Loops were highly disordered, disorder increased early in evolution, and ordered and moderate disordered structures were derived. Gradual replacement of loops with α-helix and β-strand bracing structures over time paved the way for the dominance of more disordered loop types. In contrast, ancient domains were ordered, with disorder evolving as a benefit acquired later in evolution. These evolutionary patterns explain inverse correlations between disorder and sequence length of loops and domains. Our findings provide a deep evolutionary view of the link between structure, disorder, flexibility, and function.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
| | - Fizza Mughal
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
| | - M. Fayez Aziz
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
| | - Kelsey Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences and Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801 USA
- Callout Biotech, Albuquerque, NM 87112 USA
| |
Collapse
|
21
|
Douglas J, Cui H, Perona JJ, Vargas‐Rodriguez O, Tyynismaa H, Carreño CA, Ling J, Ribas de Pouplana L, Yang X, Ibba M, Becker H, Fischer F, Sissler M, Carter CW, Wills PR. AARS Online: A collaborative database on the structure, function, and evolution of the aminoacyl-tRNA synthetases. IUBMB Life 2024; 76:1091-1105. [PMID: 39247978 PMCID: PMC11580382 DOI: 10.1002/iub.2911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2024] [Accepted: 08/07/2024] [Indexed: 09/10/2024]
Abstract
The aminoacyl-tRNA synthetases (aaRS) are a large group of enzymes that implement the genetic code in all known biological systems. They attach amino acids to their cognate tRNAs, moonlight in various translational and non-translational activities beyond aminoacylation, and are linked to many genetic disorders. The aaRS have a subtle ontology characterized by structural and functional idiosyncrasies that vary from organism to organism, and protein to protein. Across the tree of life, the 22 coded amino acids are handled by 16 evolutionary families of Class I aaRS and 21 families of Class II aaRS. We introduce AARS Online, an interactive Wikipedia-like tool curated by an international consortium of field experts. This platform systematizes existing knowledge about the aaRS by showcasing a taxonomically diverse selection of aaRS sequences and structures. Through its graphical user interface, AARS Online facilitates a seamless exploration between protein sequence and structure, providing a friendly introduction to the material for non-experts and a useful resource for experts. Curated multiple sequence alignments can be extracted for downstream analyses. Accessible at www.aars.online, AARS Online is a free resource to delve into the world of the aaRS.
Collapse
Affiliation(s)
- Jordan Douglas
- Department of PhysicsUniversity of AucklandNew Zealand
- Centre for Computational EvolutionUniversity of AucklandNew Zealand
| | - Haissi Cui
- Department of ChemistryUniversity of TorontoCanada
| | - John J. Perona
- Department of ChemistryPortland State UniversityPortlandOregonUSA
| | - Oscar Vargas‐Rodriguez
- Department of Molecular Biology and BiophysicsUniversity of ConnecticutStorrsConnecticutUSA
| | - Henna Tyynismaa
- Stem Cells and Metabolism Research Program, Faculty of MedicineUniversity of HelsinkiFinland
| | | | - Jiqiang Ling
- Department of Cell Biology and Molecular GeneticsUniversity of MarylandCollege ParkMarylandUSA
| | - Lluís Ribas de Pouplana
- Institute for Research in BiomedicineThe Barcelona Institute of Science and TechnologyBarcelonaCataloniaSpain
- Catalan Institution for Research and Advanced StudiesBarcelonaCataloniaSpain
| | - Xiang‐Lei Yang
- Department of Molecular MedicineThe Scripps Research InstituteLa JollaCaliforniaUSA
| | - Michael Ibba
- Biological SciencesChapman UniversityOrangeCaliforniaUSA
| | - Hubert Becker
- Génétique Moléculaire, Génomique MicrobiologiqueUniversity of StrasbourgFrance
| | - Frédéric Fischer
- Génétique Moléculaire, Génomique MicrobiologiqueUniversity of StrasbourgFrance
| | - Marie Sissler
- Génétique Moléculaire, Génomique MicrobiologiqueUniversity of StrasbourgFrance
| | - Charles W. Carter
- Department of Biochemistry and BiophysicsUniversity of North Carolina at Chapel HillChapel HillNorth CarolinaUSA
| | - Peter R. Wills
- Department of PhysicsUniversity of AucklandNew Zealand
- Centre for Computational EvolutionUniversity of AucklandNew Zealand
| |
Collapse
|
22
|
Mi Y, Marcu SB, Tabirca S, Yallapragada VV. PS-GO parametric protein search engine. Comput Struct Biotechnol J 2024; 23:1499-1509. [PMID: 38633387 PMCID: PMC11021831 DOI: 10.1016/j.csbj.2024.04.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 04/01/2024] [Accepted: 04/01/2024] [Indexed: 04/19/2024] Open
Abstract
With the explosive growth of protein-related data, we are confronted with a critical scientific inquiry: How can we effectively retrieve, compare, and profoundly comprehend these protein structures to maximize the utilization of such data resources? PS-GO, a parametric protein search engine, has been specifically designed and developed to maximize the utilization of the rapidly growing volume of protein-related data. This innovative tool addresses the critical need for effective retrieval, comparison, and deep understanding of protein structures. By integrating computational biology, bioinformatics, and data science, PS-GO is capable of managing large-scale data and accurately predicting and comparing protein structures and functions. The engine is built upon the concept of parametric protein design, a computer-aided method that adjusts and optimizes protein structures and sequences to achieve desired biological functions and structural stability. PS-GO utilizes key parameters such as amino acid sequence, side chain angle, and solvent accessibility, which have a significant influence on protein structure and function. Additionally, PS-GO leverages computable parameters, derived computationally, which are crucial for understanding and predicting protein behavior. The development of PS-GO underscores the potential of parametric protein design in a variety of applications, including enhancing enzyme activity, improving antibody affinity, and designing novel functional proteins. This advancement not only provides a robust theoretical foundation for the field of protein engineering and biotechnology but also offers practical guidelines for future progress in this domain.
Collapse
Affiliation(s)
- Yanlin Mi
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
- SFI Centre for Research Training in Artificial Intelligence, University College Cork, Cork, Ireland
| | - Stefan-Bogdan Marcu
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
| | - Sabin Tabirca
- School of Computer Science and Information Technology, University College Cork, Cork, Ireland
- Faculty of Mathematics and Informatics, Transylvania University of Brasov, Brasov, Romania
| | - Venkata V.B. Yallapragada
- Centre for Advanced Photonics and Process Analytics, Munster Technological University, Cork, Ireland
| |
Collapse
|
23
|
Mirarchi A, Giorgino T, De Fabritiis G. mdCATH: A Large-Scale MD Dataset for Data-Driven Computational Biophysics. Sci Data 2024; 11:1299. [PMID: 39609442 PMCID: PMC11604666 DOI: 10.1038/s41597-024-04140-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Accepted: 11/15/2024] [Indexed: 11/30/2024] Open
Abstract
Recent advancements in protein structure determination are revolutionizing our understanding of proteins. Still, a significant gap remains in the availability of comprehensive datasets that focus on the dynamics of proteins, which are crucial for understanding protein function, folding, and interactions. To address this critical gap, we introduce mdCATH, a dataset generated through an extensive set of all-atom molecular dynamics simulations of a diverse and representative collection of protein domains. This dataset comprises all-atom systems for 5,398 domains, modeled with a state-of-the-art classical force field, and simulated in five replicates each at five temperatures from 320 K to 450 K. The mdCATH dataset records coordinates and forces every 1 ns, for over 62 ms of accumulated simulation time, effectively capturing the dynamics of the various classes of domains and providing a unique resource for proteome-wide statistical analyses of protein unfolding thermodynamics and kinetics. We outline the dataset structure and showcase its potential through four easily reproducible case studies, highlighting its capabilities in advancing protein science.
Collapse
Affiliation(s)
- Antonio Mirarchi
- Computational Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, Barcelona, 08003, Spain
| | - Toni Giorgino
- Biophysics Institute, National Research Council (CNR-IBF), Via Celoria 26, Milan, 20133, Italy.
| | - Gianni De Fabritiis
- Computational Science Laboratory, Universitat Pompeu Fabra, Barcelona Biomedical Research Park (PRBB), Carrer Dr. Aiguader 88, Barcelona, 08003, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Passeig Lluis Companys 23, Barcelona, 08010, Spain.
- Acellera Labs, Doctor Trueta 183, Barcelona, 08005, Spain.
| |
Collapse
|
24
|
Haque N, Wagenknecht JB, Ratnasinghe BD, Zimmermann MT. Systematic analysis of the relationship between fold-dependent flexibility and artificial intelligence protein structure prediction. PLoS One 2024; 19:e0313308. [PMID: 39591473 PMCID: PMC11594405 DOI: 10.1371/journal.pone.0313308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2024] [Accepted: 10/23/2024] [Indexed: 11/28/2024] Open
Abstract
Artificial Intelligence (AI)-based deep learning methods for predicting protein structures are reshaping knowledge development and scientific discovery. Recent large-scale application of AI models for protein structure prediction has changed perceptions about complicated biological problems and empowered a new generation of structure-based hypothesis testing. It is well-recognized that proteins have a modular organization according to archetypal folds. However, it is yet to be determined if predicted structures are tuned to one conformation of flexible proteins or if they represent average conformations. Further, whether or not the answer is protein fold-dependent. Therefore, in this study, we analyzed 2878 proteins with at least ten distinct experimental structures available, from which we can estimate protein topological rigidity verses heterogeneity from experimental measurements. We found that AlphaFold v2 (AF2) predictions consistently return one specific form to high accuracy, with 99.68% of distinct folds (n = 623 out of 628) having an experimental structure within 2.5Å RMSD from a predicted structure. Yet, 27.70% and 10.82% of folds (174 and 68 out of 628 folds) have at least one experimental structure over 2.5Å and 5Å RMSD, respectively, from their AI-predicted structure. This information is important for how researchers apply and interpret the output of AF2 and similar tools. Additionally, it enabled us to score fold types according to how homogeneous versus heterogeneous their conformations are. Importantly, folds with high heterogeneity are enriched among proteins which regulate vital biological processes including immune cell differentiation, immune activation, and metabolism. This result demonstrates that a large amount of protein fold flexibility has already been experimentally measured, is vital for critical cellular processes, and is currently unaccounted for in structure prediction databases. Therefore, the structure-prediction revolution begets the protein dynamics revolution!
Collapse
Affiliation(s)
- Neshatul Haque
- Computational Structural Genomics Unit, Linda T. and John A. Mellowes Center for Genomic Sciences and Precision Medicine, Medical College of Wisconsin, Milwaukee, WI, United States of America
| | - Jessica B. Wagenknecht
- Computational Structural Genomics Unit, Linda T. and John A. Mellowes Center for Genomic Sciences and Precision Medicine, Medical College of Wisconsin, Milwaukee, WI, United States of America
| | - Brian D. Ratnasinghe
- Computational Structural Genomics Unit, Linda T. and John A. Mellowes Center for Genomic Sciences and Precision Medicine, Medical College of Wisconsin, Milwaukee, WI, United States of America
| | - Michael T. Zimmermann
- Computational Structural Genomics Unit, Linda T. and John A. Mellowes Center for Genomic Sciences and Precision Medicine, Medical College of Wisconsin, Milwaukee, WI, United States of America
- Clinical and Translational Sciences Institute, Medical College of Wisconsin, Milwaukee, WI, United States of America
- Department of Biochemistry, Medical College of Wisconsin, Milwaukee, WI, United States of America
| |
Collapse
|
25
|
Lau AM, Bordin N, Kandathil SM, Sillitoe I, Waman VP, Wells J, Orengo CA, Jones DT. Exploring structural diversity across the protein universe with The Encyclopedia of Domains. Science 2024; 386:eadq4946. [PMID: 39480926 DOI: 10.1126/science.adq4946] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 08/30/2024] [Indexed: 11/02/2024]
Abstract
The AlphaFold Protein Structure Database (AFDB) contains more than 214 million predicted protein structures composed of domains, which are independently folding units found in multiple structural and functional contexts. Identifying domains can enable many functional and evolutionary analyses but has remained challenging because of the sheer scale of the data. Using deep learning methods, we have detected and classified every domain in the AFDB, producing The Encyclopedia of Domains. We detected nearly 365 million domains, over 100 million more than can be found by sequence methods, covering more than 1 million taxa. Reassuringly, 77% of the nonredundant domains are similar to known superfamilies, greatly expanding representation of their domain space. We uncovered more than 10,000 new structural interactions between superfamilies and thousands of new folds across the fold space continuum.
Collapse
Affiliation(s)
- Andy M Lau
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Shaun M Kandathil
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Jude Wells
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Centre for Artificial Intelligence, University College London, London WC1V 6BH, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - David T Jones
- Department of Computer Science, University College London, London WC1E 6BT, UK
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| |
Collapse
|
26
|
Iovino BG, Tang H, Ye Y. Protein domain embeddings for fast and accurate similarity search. Genome Res 2024; 34:1434-1444. [PMID: 39237301 PMCID: PMC11529836 DOI: 10.1101/gr.279127.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Accepted: 09/03/2024] [Indexed: 09/07/2024]
Abstract
Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.
Collapse
Affiliation(s)
- Benjamin Giovanni Iovino
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Haixu Tang
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| | - Yuzhen Ye
- Luddy School of Informatics, Computing and Engineering, Indiana University, Bloomington, Indiana 47408, USA
| |
Collapse
|
27
|
Viswanathan R, Carroll M, Roffe A, Fajardo JE, Fiser A. Computational prediction of multiple antigen epitopes. Bioinformatics 2024; 40:btae556. [PMID: 39271143 PMCID: PMC11453099 DOI: 10.1093/bioinformatics/btae556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2024] [Revised: 08/08/2024] [Accepted: 09/11/2024] [Indexed: 09/15/2024] Open
Abstract
MOTIVATION Identifying antigen epitopes is essential in medical applications, such as immunodiagnostic reagent discovery, vaccine design, and drug development. Computational approaches can complement low-throughput, time-consuming, and costly experimental determination of epitopes. Currently available prediction methods, however, have moderate success predicting epitopes, which limits their applicability. Epitope prediction is further complicated by the fact that multiple epitopes may be located on the same antigen and complete experimental data is often unavailable. RESULTS Here, we introduce the antigen epitope prediction program ISPIPab that combines information from two feature-based methods and a docking-based method. We demonstrate that ISPIPab outperforms each of its individual classifiers as well as other state-of-the-art methods, including those designed specifically for epitope prediction. By combining the prediction algorithm with hierarchical clustering, we show that we can effectively capture epitopes that align with available experimental data while also revealing additional novel targets for future experimental investigations.
Collapse
Affiliation(s)
- Rajalakshmi Viswanathan
- Department of Chemistry and Biochemistry, Yeshiva College, New York, NY 10033, United States
| | - Moshe Carroll
- Department of Chemistry and Biochemistry, Yeshiva College, New York, NY 10033, United States
| | - Alexandra Roffe
- Department of Chemistry and Biochemistry, Stern College for Women, New York, NY 10016, United States
| | - Jorge E Fajardo
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, United States
| | - Andras Fiser
- Department of Systems and Computational Biology, Albert Einstein College of Medicine, Bronx, NY 10461, United States
| |
Collapse
|
28
|
Kombo DC, LaMarche MJ, Konkankit CC, Rackovsky S. Application of artificial intelligence and machine learning techniques to the analysis of dynamic protein sequences. Proteins 2024; 92:1234-1241. [PMID: 38808365 PMCID: PMC11511649 DOI: 10.1002/prot.26704] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2024] [Revised: 05/07/2024] [Accepted: 05/13/2024] [Indexed: 05/30/2024]
Abstract
We apply methods of Artificial Intelligence and Machine Learning to protein dynamic bioinformatics. We rewrite the sequences of a large protein data set, containing both folded and intrinsically disordered molecules, using a representation developed previously, which encodes the intrinsic dynamic properties of the naturally occurring amino acids. We Fourier analyze the resulting sequences. It is demonstrated that classification models built using several different supervised learning methods are able to successfully distinguish folded from intrinsically disordered proteins from sequence alone. It is further shown that the most important sequence property for this discrimination is the sequence mobility, which is the sequence averaged value of the residue-specific average alpha carbon B factor. This is in agreement with previous work, in which we have demonstrated the central role played by the sequence mobility in protein dynamic bioinformatics and biophysics. This finding opens a path to the application of dynamic bioinformatics, in combination with machine learning algorithms, to a range of significant biomedical problems.
Collapse
Affiliation(s)
- David C. Kombo
- Dept. of Medicinal Chemistry, Integrated Drug Discovery, Sanofi 350 Water St., Cambridge, MA 02141
| | - Matthew J. LaMarche
- Dept. of Medicinal Chemistry, Integrated Drug Discovery, Sanofi 350 Water St., Cambridge, MA 02141
| | - Chilaluck C. Konkankit
- Dept. of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, NY 14853
| | - S. Rackovsky
- Dept. of Chemistry and Chemical Biology, Baker Laboratory, Cornell University, Ithaca, NY 14853
| |
Collapse
|
29
|
Kennedy L, Sajjad M, Herrera MA, Szieber P, Rybacka N, Zhao Y, Steven C, Alghamdi Z, Zlatkov I, Hagen J, Lauder C, Rudolfova N, Abramiuk M, Bolimowska K, Joynt D, Lucero A, Ortiz GP, Lilienkampf A, Hulme AN, Campopiano DJ. Developing deprotectase biocatalysts for synthesis. Faraday Discuss 2024; 252:174-187. [PMID: 38856717 PMCID: PMC11389852 DOI: 10.1039/d4fd00016a] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Organic synthesis often requires multiple steps where a functional group (FG) is concealed from reaction by a protecting group (PG). Common PGs include N-carbobenzyloxy (Cbz or Z) of amines and tert-butyloxycarbonyl (OtBu) of acids. An essential step is the removal of the PG, but this often requires excess reagents, extensive time and can have low % yield. An overarching goal of biocatalysis is to use "green" or "enzymatic" methods to catalyse chemical transformations. One under-utilised approach is the use of "deprotectase" biocatalysts to selectively remove PGs from various organic substrates. The advantage of this methodology is the exquisite selectivity of the biocatalyst to only act on its target, leaving other FGs and PGs untouched. A number of deprotectase biocatalysts have been reported but they are not commonly used in mainstream synthetic routes. This study describes the construction of a cascade to deprotect doubly-protected amino acids. The well known Bacillus BS2 esterase was used to remove the OtBu PG from various amino acid substrates. The more obscure Sphingomonas Cbz-ase (amidohydrolase) was screened with a range of N-Cbz-modified amino acid substrates. We then combined both the BS2 and Cbz-ase together for a 1 pot, 2 step deprotection of the model substrate CBz-L-Phe OtBu to produce the free L-Phe. We also provide some insight into the residues involved in substrate recognition and catalysis using docked ligands in the crystal structure of BS2. Similarly, a structural model of the Cbz-ase identifies a potential di-metal binding site and reveals conserved active site residues. This new biocatalytic cascade should be further explored for its application in chemical synthesis.
Collapse
Affiliation(s)
- Lisa Kennedy
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Mariyah Sajjad
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Michael A Herrera
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Peter Szieber
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Natasza Rybacka
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Yinan Zhao
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Craig Steven
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Zainab Alghamdi
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Ivan Zlatkov
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Julie Hagen
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Chloe Lauder
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Natalie Rudolfova
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Magdalena Abramiuk
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Karolina Bolimowska
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Daniel Joynt
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Angelica Lucero
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Gustavo Perez Ortiz
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Annamaria Lilienkampf
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Alison N Hulme
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| | - Dominic J Campopiano
- School of Chemistry, University of Edinburgh, David Brewster Road, Edinburgh, EH9 3FJ, UK.
| |
Collapse
|
30
|
Zhou B, Zheng L, Wu B, Yi K, Zhong B, Tan Y, Liu Q, Liò P, Hong L. A conditional protein diffusion model generates artificial programmable endonuclease sequences with enhanced activity. Cell Discov 2024; 10:95. [PMID: 39251570 PMCID: PMC11385924 DOI: 10.1038/s41421-024-00728-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Accepted: 08/13/2024] [Indexed: 09/11/2024] Open
Abstract
Deep learning-based methods for generating functional proteins address the growing need for novel biocatalysts, allowing for precise tailoring of functionalities to meet specific requirements. This advancement leads to the development of highly efficient and specialized proteins with diverse applications across scientific, technological, and biomedical fields. This study establishes a pipeline for protein sequence generation with a conditional protein diffusion model, namely CPDiffusion, to create diverse sequences of proteins with enhanced functions. CPDiffusion accommodates protein-specific conditions, such as secondary structures and highly conserved amino acids. Without relying on extensive training data, CPDiffusion effectively captures highly conserved residues and sequence features for specific protein families. We applied CPDiffusion to generate artificial sequences of Argonaute (Ago) proteins based on the backbone structures of wild-type (WT) Kurthia massiliensis Ago (KmAgo) and Pyrococcus furiosus Ago (PfAgo), which are complex multi-domain programmable endonucleases. The generated sequences deviate by up to nearly 400 amino acids from their WT templates. Experimental tests demonstrated that the majority of the generated proteins for both KmAgo and PfAgo show unambiguous activity in DNA cleavage, with many of them exhibiting superior activity as compared to the WT. These findings underscore CPDiffusion's remarkable success rate in generating novel sequences for proteins with complex structures and functions in a single step, leading to enhanced activity. This approach facilitates the design of enzymes with multi-domain molecular structures and intricate functions through in silico generation and screening, all accomplished without the need for supervision from labeled data.
Collapse
Affiliation(s)
- Bingxin Zhou
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
- Shanghai National Center for Applied Mathematics (SJTU center), Shanghai Jiao Tong University, Shanghai, China
| | - Lirong Zheng
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China.
- Department of Cell and Developmental Biology & Michigan Neuroscience Institute, University of Michigan Medical School, Ann Arbor, MI, USA.
| | - Banghao Wu
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Kai Yi
- School of Mathematics and Statistics, University of New South Wales, Sydney, NSW, Australia
| | - Bozitao Zhong
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
| | - Yang Tan
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China
| | - Qian Liu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Pietro Liò
- Department of Computer Science and Technology, University of Cambridge, Cambridge, UK.
| | - Liang Hong
- Institute of Natural Sciences, Shanghai Jiao Tong University, Shanghai, China.
- Shanghai National Center for Applied Mathematics (SJTU center), Shanghai Jiao Tong University, Shanghai, China.
- Zhangjiang Institute for Advanced Study, Shanghai Jiao Tong University, Shanghai, China.
- Shanghai Artificial Intelligence Laboratory, Shanghai, China.
| |
Collapse
|
31
|
Kutlu Y, Axel G, Kolodny R, Ben-Tal N, Haliloglu T. Reused Protein Segments Linked to Functional Dynamics. Mol Biol Evol 2024; 41:msae184. [PMID: 39226145 PMCID: PMC11412252 DOI: 10.1093/molbev/msae184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2024] [Revised: 08/10/2024] [Accepted: 08/26/2024] [Indexed: 09/05/2024] Open
Abstract
Protein space is characterized by extensive recurrence, or "reuse," of parts, suggesting that new proteins and domains can evolve by mixing-and-matching of existing segments. From an evolutionary perspective, for a given combination to persist, the protein segments should presumably not only match geometrically but also dynamically communicate with each other to allow concerted motions that are key to function. Evidence from protein space supports the premise that domains indeed combine in this manner; we explore whether a similar phenomenon can be observed at the sub-domain level. To this end, we use Gaussian Network Models (GNMs) to calculate the so-called soft modes, or low-frequency modes of motion for a dataset of 150 protein domains. Modes of motion can be used to decompose a domain into segments of consecutive amino acids that we call "dynamic elements", each of which belongs to one of two parts that move in opposite senses. We find that, in many cases, the dynamic elements, detected based on GNM analysis, correspond to established "themes": Sub-domain-level segments that have been shown to recur in protein space, and which were detected in previous research using sequence similarity alone (i.e. completely independently of the GNM analysis). This statistically significant correlation hints at the importance of dynamics in evolution. Overall, the results are consistent with an evolutionary scenario where proteins have emerged from themes that need to match each other both geometrically and dynamically, e.g. to facilitate allosteric regulation.
Collapse
Affiliation(s)
- Yiğit Kutlu
- Department of Chemical Engineering and Polymer Research Center, Bogazici University, Istanbul, Turkey
| | - Gabriel Axel
- School of Neurobiology, Biochemistry & Biophysics, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Rachel Kolodny
- Department of Computer Science, University of Haifa, Haifa, Israel
| | - Nir Ben-Tal
- School of Neurobiology, Biochemistry & Biophysics, George S. Wise Faculty of Life Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Turkan Haliloglu
- Department of Chemical Engineering and Polymer Research Center, Bogazici University, Istanbul, Turkey
| |
Collapse
|
32
|
Tanoz I, Timsit Y. Protein Fold Usages in Ribosomes: Another Glance to the Past. Int J Mol Sci 2024; 25:8806. [PMID: 39201491 PMCID: PMC11354259 DOI: 10.3390/ijms25168806] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2024] [Revised: 08/07/2024] [Accepted: 08/08/2024] [Indexed: 09/02/2024] Open
Abstract
The analysis of protein fold usage, similar to codon usage, offers profound insights into the evolution of biological systems and the origins of modern proteomes. While previous studies have examined fold distribution in modern genomes, our study focuses on the comparative distribution and usage of protein folds in ribosomes across bacteria, archaea, and eukaryotes. We identify the prevalence of certain 'super-ribosome folds,' such as the OB fold in bacteria and the SH3 domain in archaea and eukaryotes. The observed protein fold distribution in the ribosomes announces the future power-law distribution where only a few folds are highly prevalent, and most are rare. Additionally, we highlight the presence of three copies of proto-Rossmann folds in ribosomes across all kingdoms, showing its ancient and fundamental role in ribosomal structure and function. Our study also explores early mechanisms of molecular convergence, where different protein folds bind equivalent ribosomal RNA structures in ribosomes across different kingdoms. This comparative analysis enhances our understanding of ribosomal evolution, particularly the distinct evolutionary paths of the large and small subunits, and underscores the complex interplay between RNA and protein components in the transition from the RNA world to modern cellular life. Transcending the concept of folds also makes it possible to group a large number of ribosomal proteins into five categories of urfolds or metafolds, which could attest to their ancestral character and common origins. This work also demonstrates that the gradual acquisition of extensions by simple but ordered folds constitutes an inexorable evolutionary mechanism. This observation supports the idea that simple but structured ribosomal proteins preceded the development of their disordered extensions.
Collapse
Affiliation(s)
- Inzhu Tanoz
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, 13288 Marseille, France;
| | - Youri Timsit
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, 13288 Marseille, France;
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara GOSEE, 3 Rue Michel-Ange, 75016 Paris, France
| |
Collapse
|
33
|
Carroll M, Rosenbaum E, Viswanathan R. Computational Methods to Predict Conformational B-Cell Epitopes. Biomolecules 2024; 14:983. [PMID: 39199371 PMCID: PMC11352882 DOI: 10.3390/biom14080983] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Revised: 08/04/2024] [Accepted: 08/08/2024] [Indexed: 09/01/2024] Open
Abstract
Accurate computational prediction of B-cell epitopes can greatly enhance biomedical research and rapidly advance efforts to develop therapeutics, monoclonal antibodies, vaccines, and immunodiagnostic reagents. Previous research efforts have primarily focused on the development of computational methods to predict linear epitopes rather than conformational epitopes; however, the latter is much more biologically predominant. Several conformational B-cell epitope prediction methods have recently been published, but their predictive performances are weak. Here, we present a review of the latest computational methods and assess their performances on a diverse test set of 29 non-redundant unbound antigen structures. Our results demonstrate that ISPIPab performs better than most methods and compares favorably with other recent antigen-specific methods. Finally, we suggest new strategies and opportunities to improve computational predictions of conformational B-cell epitopes.
Collapse
Affiliation(s)
| | | | - R. Viswanathan
- Department of Chemistry and Biochemistry, Yeshiva College, Yeshiva University, New York, NY 10033, USA; (M.C.); (E.R.)
| |
Collapse
|
34
|
Viswanathan R, Carroll M, Roffe A, Fajardo JE, Fiser A. Computational Prediction of Multiple Antigen Epitopes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.08.08.607232. [PMID: 39211281 PMCID: PMC11360938 DOI: 10.1101/2024.08.08.607232] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/04/2024]
Abstract
Motivation Identifying antigen epitopes is essential in medical applications, such as immunodiagnostic reagent discovery, vaccine design, and drug development. Computational approaches can complement low-throughput, time-consuming, and costly experimental determination of epitopes. Currently available prediction methods, however, have moderate success predicting epitopes, which limits their applicability. Epitope prediction is further complicated by the fact that multiple epitopes may be located on the same antigen and complete experimental data is often unavailable. Results Here, we introduce the antigen epitope prediction program ISPIPab that combines information from two feature-based methods and a docking-based method. We demonstrate that ISPIPab outperforms each of its individual classifiers as well as other state-of-the-art methods, including those designed specifically for epitope prediction. By combining the prediction algorithm with hierarchical clustering, we show that we can effectively capture epitopes that align with available experimental data while also revealing additional novel targets for future experimental investigations. Contact raji@yu.edu. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
35
|
Geist JL, Lee CY, Strom JM, de Jesús Naveja J, Luck K. Generation of a high confidence set of domain-domain interface types to guide protein complex structure predictions by AlphaFold. Bioinformatics 2024; 40:btae482. [PMID: 39171834 PMCID: PMC11361816 DOI: 10.1093/bioinformatics/btae482] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2024] [Revised: 07/10/2024] [Accepted: 08/20/2024] [Indexed: 08/23/2024] Open
Abstract
MOTIVATION While the release of AlphaFold (AF) represented a breakthrough for the prediction of protein complex structures, its sensitivity, especially when using full length protein sequences, still remains limited. Modeling success rates might increase if AF predictions were guided by likely interacting protein fragments. This approach requires available sets of highly confident protein-protein interface types. Computational resources, such as 3did, infer interacting globular domain types from observed contacts in protein structures. Assessing the accuracy of these predicted interface types is difficult because we lack hand-curated reference sets of verified domain-domain interface (DDI) types. RESULTS To improve protein complex modeling of DDIs by AF, we manually inspected 80 randomly selected DDI types from the 3did resource to generate a first reference set of DDI types. Identified cases of DDI type nonapproval (40%) primarily resulted from inaccurate Pfam domain matches, crystal contacts, and synthetic protein constructs. Using logistic regression, we predicted a subset of 2411 out of 5724 considered DDI types in 3did to be of high confidence, which we subsequently applied to 53 000 human-protein interactions to predict DDIs followed by AF modeling. We obtained highly confident AF models for 604 out of 1129 predicted DDIs. Of note, for 47% of them no confident AF structural model could be obtained using full length protein sequences. AVAILABILITY AND IMPLEMENTATION Code is available at https://github.com/KatjaLuckLab/DDI_manuscript.
Collapse
Affiliation(s)
| | - Chop Yan Lee
- Institute of Molecular Biology (IMB) gGmbH, Mainz 55128, Germany
| | | | - José de Jesús Naveja
- Institute of Molecular Biology (IMB) gGmbH, Mainz 55128, Germany
- 3rd Medical Department, University Medical Center, Johannes Gutenberg University Mainz, Mainz 55131, Germany
- University Cancer Center, University Medical Center, Johannes Gutenberg University Mainz, Mainz 55131, Germany
| | - Katja Luck
- Institute of Molecular Biology (IMB) gGmbH, Mainz 55128, Germany
| |
Collapse
|
36
|
Murata H, Toko K, Chikenji G. Protein superfolds are characterised as frustration-free topologies: A case study of pure parallel β-sheet topologies. PLoS Comput Biol 2024; 20:e1012282. [PMID: 39110764 PMCID: PMC11333010 DOI: 10.1371/journal.pcbi.1012282] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2023] [Revised: 08/19/2024] [Accepted: 06/26/2024] [Indexed: 08/21/2024] Open
Abstract
A protein superfold is a type of protein fold that is observed in at least three distinct, non-homologous protein families. Structural classification studies have revealed a limited number of prevalent superfolds alongside several infrequent occurring folds, and in α/β type superfolds, the C-terminal β-strand tends to favor the edge of the β-sheet, while the N-terminal β-strand is often found in the middle. The reasons behind these observations, whether they are due to evolutionary sampling bias or physical interactions, remain unclear. This article offers a physics-based explanation for these observations, specifically for pure parallel β-sheet topologies. Our investigation is grounded in several established structural rules that are based on physical interactions. We have identified "frustration-free topologies" which are topologies that can satisfy all the rules simultaneously. In contrast, topologies that cannot are termed "frustrated topologies." Our findings reveal that frustration-free topologies represent only a fraction of all theoretically possible patterns, these topologies strongly favor positioning the C-terminal β-strand at the edge of the β-sheet and the N-terminal β-strand in the middle, and there is significant overlap between frustration-free topologies and superfolds. We also used a lattice protein model to thoroughly investigate sequence-structure relationships. Our results show that frustration-free structures are highly designable, while frustrated structures are poorly designable. These findings suggest that superfolds are highly designable due to their lack of frustration, and the preference for positioning C-terminal β-strands at the edge of the β-sheet is a direct result of frustration-free topologies. These insights not only enhance our understanding of sequence-structure relationships but also have significant implications for de novo protein design.
Collapse
Affiliation(s)
- Hiroto Murata
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| | - Kazuma Toko
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| | - George Chikenji
- Department of Applied Physics, Nagoya University, Nagoya, Aichi, Japan
| |
Collapse
|
37
|
Ahdritz G, Bouatta N, Floristean C, Kadyan S, Xia Q, Gerecke W, O'Donnell TJ, Berenberg D, Fisk I, Zanichelli N, Zhang B, Nowaczynski A, Wang B, Stepniewska-Dziubinska MM, Zhang S, Ojewole A, Guney ME, Biderman S, Watkins AM, Ra S, Lorenzo PR, Nivon L, Weitzner B, Ban YEA, Chen S, Zhang M, Li C, Song SL, He Y, Sorger PK, Mostaque E, Zhang Z, Bonneau R, AlQuraishi M. OpenFold: retraining AlphaFold2 yields new insights into its learning mechanisms and capacity for generalization. Nat Methods 2024; 21:1514-1524. [PMID: 38744917 PMCID: PMC11645889 DOI: 10.1038/s41592-024-02272-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Accepted: 04/03/2024] [Indexed: 05/16/2024]
Abstract
AlphaFold2 revolutionized structural biology with the ability to predict protein structures with exceptionally high accuracy. Its implementation, however, lacks the code and data required to train new models. These are necessary to (1) tackle new tasks, like protein-ligand complex structure prediction, (2) investigate the process by which the model learns and (3) assess the model's capacity to generalize to unseen regions of fold space. Here we report OpenFold, a fast, memory efficient and trainable implementation of AlphaFold2. We train OpenFold from scratch, matching the accuracy of AlphaFold2. Having established parity, we find that OpenFold is remarkably robust at generalizing even when the size and diversity of its training set is deliberately limited, including near-complete elisions of classes of secondary structure elements. By analyzing intermediate structures produced during training, we also gain insights into the hierarchical manner in which OpenFold learns to fold. In sum, our studies demonstrate the power and utility of OpenFold, which we believe will prove to be a crucial resource for the protein modeling community.
Collapse
Affiliation(s)
- Gustaf Ahdritz
- Department of Systems Biology, Columbia University, New York, NY, USA
- Harvard University, Cambridge, MA, USA
| | - Nazim Bouatta
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA.
| | | | - Sachin Kadyan
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - Qinghui Xia
- Department of Systems Biology, Columbia University, New York, NY, USA
| | - William Gerecke
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
| | | | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
| | - Ian Fisk
- Flatiron Institute, New York, NY, USA
| | | | - Bo Zhang
- Scientific Computing and Imaging Institute, University of Utah, Salt Lake City, UT, USA
| | | | | | | | | | | | | | - Stella Biderman
- EleutherAI, New York, NY, USA
- Booz Allen Hamilton, McLean, VA, USA
| | | | - Stephen Ra
- Prescient Design, Genentech, New York, NY, USA
| | | | | | | | | | | | - Minjia Zhang
- University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | | | | | | | - Peter K Sorger
- Laboratory of Systems Pharmacology, Harvard Medical School, Boston, MA, USA
| | | | - Zhao Zhang
- Rutgers University, New Brunswick, NJ, USA
| | | | | |
Collapse
|
38
|
Cuturello F, Celoria M, Ansuini A, Cazzaniga A. Enhancing predictions of protein stability changes induced by single mutations using MSA-based Language Models. Bioinformatics 2024; 40:btae447. [PMID: 39012369 PMCID: PMC11269464 DOI: 10.1093/bioinformatics/btae447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 06/19/2024] [Accepted: 07/10/2024] [Indexed: 07/17/2024] Open
Abstract
MOTIVATION Protein Language Models offer a new perspective for addressing challenges in structural biology, while relying solely on sequence information. Recent studies have investigated their effectiveness in forecasting shifts in thermodynamic stability caused by single amino acid mutations, a task known for its complexity due to the sparse availability of data, constrained by experimental limitations. To tackle this problem, we introduce two key novelties: leveraging a Protein Language Model that incorporates Multiple Sequence Alignments to capture evolutionary information, and using a recently released mega-scale dataset with rigorous data pre-processing to mitigate overfitting. RESULTS We ensure comprehensive comparisons by fine-tuning various pre-trained models, taking advantage of analyses such as ablation studies and baselines evaluation. Our methodology introduces a stringent policy to reduce the widespread issue of data leakage, rigorously removing sequences from the training set when they exhibit significant similarity with the test set. The MSA Transformer emerges as the most accurate among the models under investigation, given its capability to leverage co-evolution signals encoded in aligned homologous sequences. Moreover, the optimized MSA Transformer outperforms existing methods and exhibits enhanced generalization power, leading to a notable improvement in predicting changes in protein stability resulting from point mutations. AVAILABILITY AND IMPLEMENTATION Code and data at https://github.com/RitAreaSciencePark/PLM4Muts. SUPPLEMENTARY INFORMATION Supplementary Information is available at Bioinformatics online.
Collapse
Affiliation(s)
- Francesca Cuturello
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Marco Celoria
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
- HPC Department, , CINECA National Supercomputing Center, Bologna 40033, Italy
| | - Alessio Ansuini
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| | - Alberto Cazzaniga
- Research and Technology Institute, , AREA Science Park, Trieste 34149, Italy
| |
Collapse
|
39
|
Caetano-Anollés G. Are Viruses Taxonomic Units? A Protein Domain and Loop-Centric Phylogenomic Assessment. Viruses 2024; 16:1061. [PMID: 39066224 PMCID: PMC11281659 DOI: 10.3390/v16071061] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2024] [Revised: 06/26/2024] [Accepted: 06/27/2024] [Indexed: 07/28/2024] Open
Abstract
Virus taxonomy uses a Linnaean-like subsumption hierarchy to classify viruses into taxonomic units at species and higher rank levels. Virus species are considered monophyletic groups of mobile genetic elements (MGEs) often delimited by the phylogenetic analysis of aligned genomic or metagenomic sequences. Taxonomic units are assumed to be independent organizational, functional and evolutionary units that follow a 'natural history' rationale. Here, I use phylogenomic and other arguments to show that viruses are not self-standing genetically-driven systems acting as evolutionary units. Instead, they are crucial components of holobionts, which are units of biological organization that dynamically integrate the genetics, epigenetic, physiological and functional properties of their co-evolving members. Remarkably, phylogenomic analyses show that viruses share protein domains and loops with cells throughout history via massive processes of reticulate evolution, helping spread evolutionary innovations across a wider taxonomic spectrum. Thus, viruses are not merely MGEs or microbes. Instead, their genomes and proteomes conduct cellularly integrated processes akin to those cataloged by the GO Consortium. This prompts the generation of compositional hierarchies that replace the 'is-a-kind-of' by a 'is-a-part-of' logic to better describe the mereology of integrated cellular and viral makeup. My analysis demands a new paradigm that integrates virus taxonomy into a modern evolutionarily centered taxonomy of organisms.
Collapse
Affiliation(s)
- Gustavo Caetano-Anollés
- Evolutionary Bioinformatics Laboratory, Department of Crop Sciences, C. R. Woese Institute for Genomic Biology, University of Illinois, Urbana, IL 61801, USA
| |
Collapse
|
40
|
Wankowicz SA, Ravikumar A, Sharma S, Riley B, Raju A, Hogan DW, Flowers J, van den Bedem H, Keedy DA, Fraser JS. Automated multiconformer model building for X-ray crystallography and cryo-EM. eLife 2024; 12:RP90606. [PMID: 38904665 PMCID: PMC11192534 DOI: 10.7554/elife.90606] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/22/2024] Open
Abstract
In their folded state, biomolecules exchange between multiple conformational states that are crucial for their function. Traditional structural biology methods, such as X-ray crystallography and cryogenic electron microscopy (cryo-EM), produce density maps that are ensemble averages, reflecting molecules in various conformations. Yet, most models derived from these maps explicitly represent only a single conformation, overlooking the complexity of biomolecular structures. To accurately reflect the diversity of biomolecular forms, there is a pressing need to shift toward modeling structural ensembles that mirror the experimental data. However, the challenge of distinguishing signal from noise complicates manual efforts to create these models. In response, we introduce the latest enhancements to qFit, an automated computational strategy designed to incorporate protein conformational heterogeneity into models built into density maps. These algorithmic improvements in qFit are substantiated by superior Rfree and geometry metrics across a wide range of proteins. Importantly, unlike more complex multicopy ensemble models, the multiconformer models produced by qFit can be manually modified in most major model building software (e.g., Coot) and fit can be further improved by refinement using standard pipelines (e.g., Phenix, Refmac, Buster). By reducing the barrier of creating multiconformer models, qFit can foster the development of new hypotheses about the relationship between macromolecular conformational dynamics and function.
Collapse
Affiliation(s)
- Stephanie A Wankowicz
- Department of Bioengineering and Therapeutic Sciences, University of California, San FranciscoSan FranciscoUnited States
| | - Ashraya Ravikumar
- Department of Bioengineering and Therapeutic Sciences, University of California, San FranciscoSan FranciscoUnited States
| | - Shivani Sharma
- Structural Biology Initiative, CUNY Advanced Science Research CenterNew YorkUnited States
- Ph.D. Program in Biology, The Graduate Center, City University of New YorkNew YorkUnited States
| | - Blake Riley
- Structural Biology Initiative, CUNY Advanced Science Research CenterNew YorkUnited States
| | - Akshay Raju
- Structural Biology Initiative, CUNY Advanced Science Research CenterNew YorkUnited States
| | - Daniel W Hogan
- Department of Bioengineering and Therapeutic Sciences, University of California, San FranciscoSan FranciscoUnited States
| | - Jessica Flowers
- Department of Bioengineering and Therapeutic Sciences, University of California, San FranciscoSan FranciscoUnited States
| | - Henry van den Bedem
- Department of Bioengineering and Therapeutic Sciences, University of California, San FranciscoSan FranciscoUnited States
- Atomwise IncSan FranciscoUnited States
| | - Daniel A Keedy
- Structural Biology Initiative, CUNY Advanced Science Research CenterNew YorkUnited States
- Department of Chemistry and Biochemistry, City College of New YorkNew YorkUnited States
- Ph.D. Programs in Biochemistry, Biology and Chemistry, The Graduate Center, City University of New YorkNew YorkUnited States
| | - James S Fraser
- Department of Bioengineering and Therapeutic Sciences, University of California, San FranciscoSan FranciscoUnited States
| |
Collapse
|
41
|
Waman VP, Ashford P, Lam SD, Sen N, Abbasian M, Woodridge L, Goldtzvik Y, Bordin N, Wu J, Sillitoe I, Orengo CA. Predicting human and viral protein variants affecting COVID-19 susceptibility and repurposing therapeutics. Sci Rep 2024; 14:14208. [PMID: 38902252 PMCID: PMC11190248 DOI: 10.1038/s41598-024-61541-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 05/07/2024] [Indexed: 06/22/2024] Open
Abstract
The COVID-19 disease is an ongoing global health concern. Although vaccination provides some protection, people are still susceptible to re-infection. Ostensibly, certain populations or clinical groups may be more vulnerable. Factors causing these differences are unclear and whilst socioeconomic and cultural differences are likely to be important, human genetic factors could influence susceptibility. Experimental studies indicate SARS-CoV-2 uses innate immune suppression as a strategy to speed-up entry and replication into the host cell. Therefore, it is necessary to understand the impact of variants in immunity-associated human proteins on susceptibility to COVID-19. In this work, we analysed missense coding variants in several SARS-CoV-2 proteins and their human protein interactors that could enhance binding affinity to SARS-CoV-2. We curated a dataset of 19 SARS-CoV-2: human protein 3D-complexes, from the experimentally determined structures in the Protein Data Bank and models built using AlphaFold2-multimer, and analysed the impact of missense variants occurring in the protein-protein interface region. We analysed 468 missense variants from human proteins and 212 variants from SARS-CoV-2 proteins and computationally predicted their impacts on binding affinities for the human viral protein complexes. We predicted a total of 26 affinity-enhancing variants from 13 human proteins implicated in increased binding affinity to SARS-CoV-2. These include key-immunity associated genes (TOMM70, ISG15, IFIH1, IFIT2, RPS3, PALS1, NUP98, AXL, ARF6, TRIMM, TRIM25) as well as important spike receptors (KREMEN1, AXL and ACE2). We report both common (e.g., Y13N in IFIH1) and rare variants in these proteins and discuss their likely structural and functional impact, using information on known and predicted functional sites. Potential mechanisms associated with immune suppression implicated by these variants are discussed. Occurrence of certain predicted affinity-enhancing variants should be monitored as they could lead to increased susceptibility and reduced immune response to SARS-CoV-2 infection in individuals/populations carrying them. Our analyses aid in understanding the potential impact of genetic variation in immunity-associated proteins on COVID-19 susceptibility and help guide drug-repurposing strategies.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Paul Ashford
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Su Datt Lam
- Department of Applied Physics, Faculty of Science and Technology, Universiti Kebangsaan Malaysia, Bangi, Malaysia
| | - Neeladri Sen
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Mahnaz Abbasian
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Laurel Woodridge
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Yonathan Goldtzvik
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Jiaxin Wu
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK
| | - Christine A Orengo
- Institute of Structural and Molecular Biology, University College London, London, WC1E 6BT, UK.
| |
Collapse
|
42
|
Hamamsy T, Morton JT, Blackwell R, Berenberg D, Carriero N, Gligorijevic V, Strauss CEM, Leman JK, Cho K, Bonneau R. Protein remote homology detection and structural alignment using deep learning. Nat Biotechnol 2024; 42:975-985. [PMID: 37679542 PMCID: PMC11180608 DOI: 10.1038/s41587-023-01917-2] [Citation(s) in RCA: 29] [Impact Index Per Article: 29.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2022] [Accepted: 07/26/2023] [Indexed: 09/09/2023]
Abstract
Exploiting sequence-structure-function relationships in biotechnology requires improved methods for aligning proteins that have low sequence similarity to previously annotated proteins. We develop two deep learning methods to address this gap, TM-Vec and DeepBLAST. TM-Vec allows searching for structure-structure similarities in large sequence databases. It is trained to accurately predict TM-scores as a metric of structural similarity directly from sequence pairs without the need for intermediate computation or solution of structures. Once structurally similar proteins have been identified, DeepBLAST can structurally align proteins using only sequence information by identifying structurally homologous regions between proteins. It outperforms traditional sequence alignment methods and performs similarly to structure-based alignment methods. We show the merits of TM-Vec and DeepBLAST on a variety of datasets, including better identification of remotely homologous proteins compared with state-of-the-art sequence alignment and structure prediction methods.
Collapse
Grants
- R35GM122515 National Science Foundation (NSF)
- IOS-1546218 National Science Foundation (NSF)
- R35 GM122515 NIGMS NIH HHS
- R01 DK103358 NIDDK NIH HHS
- CBET- 1728858 National Science Foundation (NSF)
- R01 AI130945 NIAID NIH HHS
- This research was supported by NIH R01DK103358, the Simons Foundation, NSF- IOS-1546218, R35GM122515, NSF CBET- 1728858, NIH R01AI130945, to T.H. This research was supported by the intramural research program of the Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD) to J.T.M. This research was supported by the Flatiron Institute as part of the Simons Foundation to Robert Blackwell, J.K.L., and N.C. This research was supported by Los Alamos National Lab to C.S. This research was supported by the Samsung Advanced Institute of Technology (Next Generation Deep Learning: from pattern recognition to AI), Samsung Research (Improving Deep Learning using Latent Structure), and NSF Award 1922658 to K.C.
- Simons Foundation
- U.S. Department of Health & Human Services | NIH | Eunice Kennedy Shriver National Institute of Child Health and Human Development (NICHD)
Collapse
Affiliation(s)
- Tymor Hamamsy
- Center for Data Science, New York University, New York, NY, USA
| | - James T Morton
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
- Biostatistics and Bioinformatics Branch, Eunice Kennedy Shriver National Institute of Child Health and Human Development, National Institutes of Health, Bethesda, MD, USA
| | - Robert Blackwell
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Daniel Berenberg
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA
- Prescient Design, New York, NY, USA
| | - Nicholas Carriero
- Scientific Computing Core, Flatiron Institute, Simons Foundation, New York, NY, USA
| | | | | | - Julia Koehler Leman
- Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA
| | - Kyunghyun Cho
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- CIFAR, Toronto, Ontario, Canada.
| | - Richard Bonneau
- Center for Data Science, New York University, New York, NY, USA.
- Department of Computer Science, Courant Institute of Mathematical Sciences, New York University, New York, NY, USA.
- Prescient Design, New York, NY, USA.
- Department of Biology, New York University, New York, NY, USA.
| |
Collapse
|
43
|
Barco RA, Merino N, Lam B, Budnik B, Kaplan M, Wu F, Amend JP, Nealson KH, Emerson D. Comparative proteomics of a versatile, marine, iron-oxidizing chemolithoautotroph. Environ Microbiol 2024; 26:e16632. [PMID: 38861374 DOI: 10.1111/1462-2920.16632] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2024] [Accepted: 04/20/2024] [Indexed: 06/13/2024]
Abstract
This study conducted a comparative proteomic analysis to identify potential genetic markers for the biological function of chemolithoautotrophic iron oxidation in the marine bacterium Ghiorsea bivora. To date, this is the only characterized species in the class Zetaproteobacteria that is not an obligate iron-oxidizer, providing a unique opportunity to investigate differential protein expression to identify key genes involved in iron-oxidation at circumneutral pH. Over 1000 proteins were identified under both iron- and hydrogen-oxidizing conditions, with differentially expressed proteins found in both treatments. Notably, a gene cluster upregulated during iron oxidation was identified. This cluster contains genes encoding for cytochromes that share sequence similarity with the known iron-oxidase, Cyc2. Interestingly, these cytochromes, conserved in both Bacteria and Archaea, do not exhibit the typical β-barrel structure of Cyc2. This cluster potentially encodes a biological nanowire-like transmembrane complex containing multiple redox proteins spanning the inner membrane, periplasm, outer membrane, and extracellular space. The upregulation of key genes associated with this complex during iron-oxidizing conditions was confirmed by quantitative reverse transcription-PCR. These findings were further supported by electromicrobiological methods, which demonstrated negative current production by G. bivora in a three-electrode system poised at a cathodic potential. This research provides significant insights into the biological function of chemolithoautotrophic iron oxidation.
Collapse
Affiliation(s)
- Roman A Barco
- Department of Earth Sciences, University of Southern California, Los Angeles, California, USA
- Department of Biological Sciences, University of Southern California, Los Angeles, California, USA
- Bigelow Laboratory for Ocean Sciences, East Boothbay, Maine, USA
| | - N Merino
- Department of Earth Sciences, University of Southern California, Los Angeles, California, USA
- Earth-Life Science Institute, Tokyo Institute of Technology, Tokyo, Japan
- Lawrence Livermore National Lab, Biosciences and Biotechnology Division, Livermore, California, USA
| | - B Lam
- Department of Biological Sciences, University of Southern California, Los Angeles, California, USA
| | - B Budnik
- Mass Spectrometry and Proteomics Resource Laboratory, Harvard University, Cambridge, Massachusetts, USA
| | - M Kaplan
- Department of Microbiology, University of Chicago, Chicago, Illinois, USA
| | - F Wu
- ZJU-Hangzhou Global Scientific and Technological Innovation Center, Hangzhou, Zhejiang, China
| | - J P Amend
- Department of Earth Sciences, University of Southern California, Los Angeles, California, USA
- Department of Biological Sciences, University of Southern California, Los Angeles, California, USA
| | - K H Nealson
- Department of Earth Sciences, University of Southern California, Los Angeles, California, USA
- Department of Biological Sciences, University of Southern California, Los Angeles, California, USA
| | - D Emerson
- Bigelow Laboratory for Ocean Sciences, East Boothbay, Maine, USA
| |
Collapse
|
44
|
Aguirre-Sampieri S, Casañal A, Emsley P, Garza-Ramos G. Cryo-EM structure of bacterial nitrilase reveals insight into oligomerization, substrate recognition, and catalysis. J Struct Biol 2024; 216:108093. [PMID: 38615726 PMCID: PMC7616060 DOI: 10.1016/j.jsb.2024.108093] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2024] [Revised: 03/26/2024] [Accepted: 04/12/2024] [Indexed: 04/16/2024]
Abstract
Many enzymes can self-assemble into higher-order structures with helical symmetry. A particularly noteworthy example is that of nitrilases, enzymes in which oligomerization of dimers into spiral homo-oligomers is a requirement for their enzymatic function. Nitrilases are widespread in nature where they catalyze the hydrolysis of nitriles into the corresponding carboxylic acid and ammonia. Here, we present the Cryo-EM structure, at 3 Å resolution, of a C-terminal truncate nitrilase from Rhodococcus sp. V51B that assembles in helical filaments. The model comprises a complete turn of the helical arrangement with a substrate-intermediate bound to the catalytic cysteine. The structure was solved having added the substrate to the protein. The length and stability of filaments was made more substantial in the presence of the aromatic substrate, benzonitrile, but not for aliphatic nitriles or dinitriles. The overall structure maintains the topology of the nitrilase family, and the filament is formed by the association of dimers in a chain-like mechanism that stabilizes the spiral. The active site is completely buried inside each monomer, while the substrate binding pocket was observed within the oligomerization interfaces. The present structure is in a closed configuration, judging by the position of the lid, suggesting that the intermediate is one of the covalent adducts. The proximity of the active site to the dimerization and oligomerization interfaces, allows the dimer to sense structural changes once the benzonitrile was bound, and translated to the rest of the filament, stabilizing the helical structure.
Collapse
Affiliation(s)
- Sergio Aguirre-Sampieri
- Universidad Nacional Autónoma de México, Facultad de Medicina, Departamento de Bioquímica, Circuito Escolar S/N, Ciudad Universitaria, CDMX, Mexico
| | - Ana Casañal
- Human Technopole, Palazzo Italia, Viale Rita Levi‑Montalcini, 1, 20157 Milan, Italy
| | - Paul Emsley
- MRC Laboratory of Molecular Biology, Structural Studies Division, Francis Crick Avenue, CB2 0QH Cambridge, England
| | - Georgina Garza-Ramos
- Universidad Nacional Autónoma de México, Facultad de Medicina, Departamento de Bioquímica, Circuito Escolar S/N, Ciudad Universitaria, CDMX, Mexico.
| |
Collapse
|
45
|
Zheng M, Sun G, Li X, Fan Y. EGPDI: identifying protein-DNA binding sites based on multi-view graph embedding fusion. Brief Bioinform 2024; 25:bbae330. [PMID: 38975896 PMCID: PMC11229037 DOI: 10.1093/bib/bbae330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Revised: 06/08/2024] [Accepted: 06/26/2024] [Indexed: 07/09/2024] Open
Abstract
Mechanisms of protein-DNA interactions are involved in a wide range of biological activities and processes. Accurately identifying binding sites between proteins and DNA is crucial for analyzing genetic material, exploring protein functions, and designing novel drugs. In recent years, several computational methods have been proposed as alternatives to time-consuming and expensive traditional experiments. However, accurately predicting protein-DNA binding sites still remains a challenge. Existing computational methods often rely on handcrafted features and a single-model architecture, leaving room for improvement. We propose a novel computational method, called EGPDI, based on multi-view graph embedding fusion. This approach involves the integration of Equivariant Graph Neural Networks (EGNN) and Graph Convolutional Networks II (GCNII), independently configured to profoundly mine the global and local node embedding representations. An advanced gated multi-head attention mechanism is subsequently employed to capture the attention weights of the dual embedding representations, thereby facilitating the integration of node features. Besides, extra node features from protein language models are introduced to provide more structural information. To our knowledge, this is the first time that multi-view graph embedding fusion has been applied to the task of protein-DNA binding site prediction. The results of five-fold cross-validation and independent testing demonstrate that EGPDI outperforms state-of-the-art methods. Further comparative experiments and case studies also verify the superiority and generalization ability of EGPDI.
Collapse
Affiliation(s)
- Mengxin Zheng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Xueping Li
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| | - Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin 541004, China
| |
Collapse
|
46
|
Wells J, Hawkins-Hooker A, Bordin N, Sillitoe I, Paige B, Orengo C. Chainsaw: protein domain segmentation with fully convolutional neural networks. Bioinformatics 2024; 40:btae296. [PMID: 38718225 PMCID: PMC11256964 DOI: 10.1093/bioinformatics/btae296] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Revised: 03/23/2024] [Accepted: 05/07/2024] [Indexed: 05/23/2024] Open
Abstract
MOTIVATION Protein domains are fundamental units of protein structure and play a pivotal role in understanding folding, function, evolution, and design. The advent of accurate structure prediction techniques has resulted in an influx of new structural data, making the partitioning of these structures into domains essential for inferring evolutionary relationships and functional classification. RESULTS This article presents Chainsaw, a supervised learning approach to domain parsing that achieves accuracy that surpasses current state-of-the-art methods. Chainsaw uses a fully convolutional neural network which is trained to predict the probability that each pair of residues is in the same domain. Domain predictions are then derived from these pairwise predictions using an algorithm that searches for the most likely assignment of residues to domains given the set of pairwise co-membership probabilities. Chainsaw matches CATH domain annotations in 78% of protein domains versus 72% for the next closest method. When predicting on AlphaFold models, expert human evaluators were twice as likely to prefer Chainsaw's predictions versus the next best method. AVAILABILITY AND IMPLEMENTATION github.com/JudeWells/Chainsaw.
Collapse
Affiliation(s)
- Jude Wells
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Alex Hawkins-Hooker
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| | - Brooks Paige
- Centre for Artificial Intelligence, University College London, WC1E 6BT, United Kingdom
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, WC1E 6BT, United Kingdom
| |
Collapse
|
47
|
Meloni M, Rossi J, Fanti S, Carloni G, Tedesco D, Treffon P, Piccinini L, Falini G, Trost P, Vierling E, Licausi F, Giuntoli B, Musiani F, Fermani S, Zaffagnini M. Structural and biochemical characterization of Arabidopsis alcohol dehydrogenases reveals distinct functional properties but similar redox sensitivity. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2024; 118:1054-1070. [PMID: 38308388 DOI: 10.1111/tpj.16651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Revised: 01/07/2024] [Accepted: 01/18/2024] [Indexed: 02/04/2024]
Abstract
Alcohol dehydrogenases (ADHs) are a group of zinc-binding enzymes belonging to the medium-length dehydrogenase/reductase (MDR) protein superfamily. In plants, these enzymes fulfill important functions involving the reduction of toxic aldehydes to the corresponding alcohols (as well as catalyzing the reverse reaction, i.e., alcohol oxidation; ADH1) and the reduction of nitrosoglutathione (GSNO; ADH2/GSNOR). We investigated and compared the structural and biochemical properties of ADH1 and GSNOR from Arabidopsis thaliana. We expressed and purified ADH1 and GSNOR and determined two new structures, NADH-ADH1 and apo-GSNOR, thus completing the structural landscape of Arabidopsis ADHs in both apo- and holo-forms. A structural comparison of these Arabidopsis ADHs revealed a high sequence conservation (59% identity) and a similar fold. In contrast, a striking dissimilarity was observed in the catalytic cavity supporting substrate specificity and accommodation. Consistently, ADH1 and GSNOR showed strict specificity for their substrates (ethanol and GSNO, respectively), although both enzymes had the ability to oxidize long-chain alcohols, with ADH1 performing better than GSNOR. Both enzymes contain a high number of cysteines (12 and 15 out of 379 residues for ADH1 and GSNOR, respectively) and showed a significant and similar responsivity to thiol-oxidizing agents, indicating that redox modifications may constitute a mechanism for controlling enzyme activity under both optimal growth and stress conditions.
Collapse
Affiliation(s)
- Maria Meloni
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Jacopo Rossi
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Silvia Fanti
- Department of Chemistry "G. Ciamician", University of Bologna, 40126, Bologna, Italy
| | - Giacomo Carloni
- Department of Chemistry "G. Ciamician", University of Bologna, 40126, Bologna, Italy
| | - Daniele Tedesco
- Institute for Organic Synthesis and Photoreactivity (ISOF), National Research Council of Italy (CNR), 40129, Bologna, Italy
| | - Patrick Treffon
- Department of Biochemistry and Molecular Biology, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | - Luca Piccinini
- Department of Biology, University of Pisa, Pisa, 56127, Italy
- Center for Plant Sciences, Scuola Superiore Sant'Anna, Pisa, 56124, Italy
| | - Giuseppe Falini
- Department of Chemistry "G. Ciamician", University of Bologna, 40126, Bologna, Italy
| | - Paolo Trost
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Elizabeth Vierling
- Department of Biochemistry and Molecular Biology, University of Massachusetts Amherst, Amherst, Massachusetts, USA
| | | | - Beatrice Giuntoli
- Department of Biology, University of Pisa, Pisa, 56127, Italy
- Center for Plant Sciences, Scuola Superiore Sant'Anna, Pisa, 56124, Italy
| | - Francesco Musiani
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| | - Simona Fermani
- Department of Chemistry "G. Ciamician", University of Bologna, 40126, Bologna, Italy
- Interdepartmental Centre for Industrial Research Health Sciences & Technologies, University of Bologna, 40064, Bologna, Italy
| | - Mirko Zaffagnini
- Department of Pharmacy and Biotechnology, University of Bologna, 40126, Bologna, Italy
| |
Collapse
|
48
|
Wankowicz SA, Ravikumar A, Sharma S, Riley BT, Raju A, Flowers J, Hogan D, van den Bedem H, Keedy DA, Fraser JS. Uncovering Protein Ensembles: Automated Multiconformer Model Building for X-ray Crystallography and Cryo-EM. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.06.28.546963. [PMID: 37425870 PMCID: PMC10327213 DOI: 10.1101/2023.06.28.546963] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
In their folded state, biomolecules exchange between multiple conformational states that are crucial for their function. Traditional structural biology methods, such as X-ray crystallography and cryogenic electron microscopy (cryo-EM), produce density maps that are ensemble averages, reflecting molecules in various conformations. Yet, most models derived from these maps explicitly represent only a single conformation, overlooking the complexity of biomolecular structures. To accurately reflect the diversity of biomolecular forms, there is a pressing need to shift towards modeling structural ensembles that mirror the experimental data. However, the challenge of distinguishing signal from noise complicates manual efforts to create these models. In response, we introduce the latest enhancements to qFit, an automated computational strategy designed to incorporate protein conformational heterogeneity into models built into density maps. These algorithmic improvements in qFit are substantiated by superior R f r e e and geometry metrics across a wide range of proteins. Importantly, unlike more complex multicopy ensemble models, the multiconformer models produced by qFit can be manually modified in most major model building software (e.g. Coot) and fit can be further improved by refinement using standard pipelines (e.g. Phenix, Refmac, Buster). By reducing the barrier of creating multiconformer models, qFit can foster the development of new hypotheses about the relationship between macromolecular conformational dynamics and function.
Collapse
Affiliation(s)
- Stephanie A. Wankowicz
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ashraya Ravikumar
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Shivani Sharma
- Structural Biology Initiative, CUNY Advanced Science Research Center, New York, NY 10031
- Ph.D. Program in Biology, The Graduate Center – City University of New York, New York, NY 10016
| | - Blake T. Riley
- Structural Biology Initiative, CUNY Advanced Science Research Center, New York, NY 10031
| | - Akshay Raju
- Structural Biology Initiative, CUNY Advanced Science Research Center, New York, NY 10031
| | - Jessica Flowers
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Daniel Hogan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Henry van den Bedem
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Atomwise, Inc., San Francisco, CA, United States
| | - Daniel A. Keedy
- Structural Biology Initiative, CUNY Advanced Science Research Center, New York, NY 10031
- Department of Chemistry and Biochemistry, City College of New York, New York, NY 10031
- Ph.D. Programs in Biochemistry, Biology, and Chemistry, The Graduate Center – City University of New York, New York, NY 10016
| | - James S. Fraser
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| |
Collapse
|
49
|
Aleksandrova AA, Sarti E, Forrest LR. EncoMPASS: An encyclopedia of membrane proteins analyzed by structure and symmetry. Structure 2024; 32:492-504.e4. [PMID: 38367624 PMCID: PMC11251422 DOI: 10.1016/j.str.2024.01.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2018] [Revised: 01/09/2024] [Accepted: 01/10/2024] [Indexed: 02/19/2024]
Abstract
Protein structure determination and prediction, active site detection, and protein sequence alignment techniques all exploit information about protein structure and structural relationships. For membrane proteins, however, there is limited agreement among available online tools for highlighting and mapping such structural similarities. Moreover, no available resource provides a systematic overview of quaternary and internal symmetries, and their orientation relative to the membrane, despite the fact that these properties can provide key insights into membrane protein function and evolution. Here, we describe the Encyclopedia of Membrane Proteins Analyzed by Structure and Symmetry (EncoMPASS), a database for relating integral membrane proteins of known structure from the points of view of sequence, structure, and symmetry. EncoMPASS is accessible through a web interface, and its contents can be easily downloaded. This allows the user not only to focus on specific proteins, but also to study general properties of the structure and evolution of membrane proteins.
Collapse
Affiliation(s)
- Antoniya A Aleksandrova
- Computational Structural Biology Section, National Institutes of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA
| | - Edoardo Sarti
- Computational Structural Biology Section, National Institutes of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA
| | - Lucy R Forrest
- Computational Structural Biology Section, National Institutes of Neurological Disorders and Stroke, National Institutes of Health, Bethesda, MD 20892, USA.
| |
Collapse
|
50
|
Rozano L, Jones DAB, Hane JK, Mancera RL. Template-Based Modelling of the Structure of Fungal Effector Proteins. Mol Biotechnol 2024; 66:784-813. [PMID: 36940017 PMCID: PMC11043172 DOI: 10.1007/s12033-023-00703-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 02/14/2023] [Indexed: 03/21/2023]
Abstract
The discovery of new fungal effector proteins is necessary to enable the screening of cultivars for disease resistance. Sequence-based bioinformatics methods have been used for this purpose, but only a limited number of functional effector proteins have been successfully predicted and subsequently validated experimentally. A significant obstacle is that many fungal effector proteins discovered so far lack sequence similarity or conserved sequence motifs. The availability of experimentally determined three-dimensional (3D) structures of a number of effector proteins has recently highlighted structural similarities amongst groups of sequence-dissimilar fungal effectors, enabling the search for similar structural folds amongst effector sequence candidates. We have applied template-based modelling to predict the 3D structures of candidate effector sequences obtained from bioinformatics predictions and the PHI-BASE database. Structural matches were found not only with ToxA- and MAX-like effector candidates but also with non-fungal effector-like proteins-including plant defensins and animal venoms-suggesting the broad conservation of ancestral structural folds amongst cytotoxic peptides from a diverse range of distant species. Accurate modelling of fungal effectors were achieved using RaptorX. The utility of predicted structures of effector proteins lies in the prediction of their interactions with plant receptors through molecular docking, which will improve the understanding of effector-plant interactions.
Collapse
Affiliation(s)
- Lina Rozano
- Curtin Medical School, Curtin Health Innovation Research Institute, GPO Box U1987, Perth, WA, 6845, Australia
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
| | - Darcy A B Jones
- Centre for Crop and Disease Management, School of Molecular and Life Sciences, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
| | - James K Hane
- Centre for Crop and Disease Management, School of Molecular and Life Sciences, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia
| | - Ricardo L Mancera
- Curtin Medical School, Curtin Health Innovation Research Institute, GPO Box U1987, Perth, WA, 6845, Australia.
- Curtin Institute for Computation, Curtin University, GPO Box U1987, Perth, WA, 6845, Australia.
| |
Collapse
|