1
|
Liu H, Laiho A, Törönen P, Holm L. 3-D substructure search by transitive closure in AlphaFold database. Protein Sci 2025; 34:e70169. [PMID: 40400345 PMCID: PMC12095923 DOI: 10.1002/pro.70169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2024] [Revised: 05/01/2025] [Accepted: 05/02/2025] [Indexed: 05/23/2025]
Abstract
Identifying structural relationships between proteins is crucial for understanding their functions and evolutionary histories. We present ISS_ProtSci, a Python package designed for structural similarity searches within the AlphaFold Database v2 (AFDB2). ISS_ProtSci incorporates DaliLite to identify geometrically similar structures and uses a transitive closure algorithm to iteratively explore neighboring shells of proteins. The precomputed all-against-all comparisons generated by Foldseek, chosen for its speed, are validated by DaliLite for precision. Search results are annotated with metadata from UniProtKB and Pfam protein family classifications, using hmmsearch to identify protein domains. Outputs, including Dali pairwise alignment data, are provided in TSV format for easy filtering and analysis. Our method offers a significant improvement in recall over existing tools like Foldseek, especially in detecting more distantly related proteins. This is particularly valuable in structurally diverse protein families where traditional sequence-based or fast structural methods struggle. ISS_ProtSci delivers practical runtimes and flexibility, allowing users to input a PDB file, define the minimum size of the common core, and evaluate results using Pfam clans. In evaluating our method across 12 test cases based on Pfam clans, we achieved over 99% recall of relevant proteins, even in challenging cases where Foldseek's recall dropped below 50%. ISS_ProtSci not only identifies closely related proteins but also uncovers previously unrecognized structural relationships, contributing to more accurate protein family classifications. The software can be downloaded from http://ekhidna2.biocenter.helsinki.fi/ISS_ProtSci/.
Collapse
Affiliation(s)
- Hao Liu
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental SciencesUniversity of HelsinkiHelsinkiFinland
| | - Aleksi Laiho
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental SciencesUniversity of HelsinkiHelsinkiFinland
| | - Petri Törönen
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental SciencesUniversity of HelsinkiHelsinkiFinland
| | - Liisa Holm
- Organismal and Evolutionary Biology Research Program, Faculty of Biological and Environmental SciencesUniversity of HelsinkiHelsinkiFinland
- Institute of BiotechnologyHiLIFE, University of HelsinkiHelsinkiFinland
| |
Collapse
|
2
|
Kandathil SM, Lau AM, Buchan DWA, Jones DT. Foldclass and Merizo-search: scalable structural similarity search for single- and multi-domain proteins using geometric learning. Bioinformatics 2025; 41:btaf277. [PMID: 40326701 PMCID: PMC12122203 DOI: 10.1093/bioinformatics/btaf277] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 04/03/2025] [Accepted: 05/07/2025] [Indexed: 05/07/2025] Open
Abstract
MOTIVATION The availability of very large numbers of protein structures from accurate computational methods poses new challenges in storing, searching and detecting relationships between these structures. In particular, the new-found abundance of multi-domain structures in the AlphaFold structure database introduces challenges for traditional structure comparison methods. RESULTS We address these challenges using a fast, embedding-based structure comparison method called Foldclass which detects structural similarity between protein domains. We demonstrate the accuracy of Foldclass embeddings for homology detection. In combination with a recently developed deep learning-based automatic domain segmentation tool Merizo, we develop Merizo-search, which first segments multi-domain query structures into domains, and then searches a Foldclass embedding database to determine the top matches for each constituent domain. Combining the ability of Merizo to accurately segment complete chains into domains, and Foldclass to embed and detect similar domains, the Merizo-search tool can be used to rapidly detect per-domain similarities for complete chains, taking as little as 2 min to search all 365 million domains from the Encyclopedia of Domains. We anticipate that these tools will enable many analyses using the wealth of predicted structural data now available. AVAILABILITY AND IMPLEMENTATION Foldclass and Merizo-search are available at https://github.com/psipred/merizo_search. The version used in this publication is archived at https://doi.org/10.5281/zenodo.15120830. Merizo-search is also available on the PSIPRED web server at http://bioinf.cs.ucl.ac.uk/psipred.
Collapse
Affiliation(s)
- Shaun M Kandathil
- Department of Computer Science, University College London, London WC1E 6BT, United Kingdom
| | - Andy M Lau
- Department of Computer Science, University College London, London WC1E 6BT, United Kingdom
| | - Daniel W A Buchan
- Department of Computer Science, University College London, London WC1E 6BT, United Kingdom
| | - David T Jones
- Department of Computer Science, University College London, London WC1E 6BT, United Kingdom
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, United Kingdom
| |
Collapse
|
3
|
Lisacek F, Schnider B, Imberty A. Tools for structural lectinomics: From structures to lectomes. BBA ADVANCES 2025; 7:100154. [PMID: 40166736 PMCID: PMC11957679 DOI: 10.1016/j.bbadva.2025.100154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2024] [Revised: 02/24/2025] [Accepted: 03/05/2025] [Indexed: 04/02/2025] Open
Abstract
Lectins are ubiquitous proteins that interact with glycans in a variety of molecular processes and as such, also play a role in diseases, whether infectious, chronic or cancer-related. The systematic study of lectins is therefore essential, in particular for understanding cell-cell communication. Accumulated protein three-dimensional structural data in the past decades boosted advance in AI-based prediction and opened up new options to characterise lectins that are known to often be multimeric and multivalent. This article reviews the methods to obtain structures of lectins, the current data available for lectin 3D structures and their interactions, how this knowledge is used to classify these proteins and shows that the combination of an array of bioinformatics tools should make the prediction of binding specificity possible in a near future.
Collapse
Affiliation(s)
- Frédérique Lisacek
- SIB Swiss Institute of Bioinformatics CH-1227 Geneva, Switzerland
- Computer Science Department, UniGe CH-1227 Geneva, Switzerland
| | - Boris Schnider
- SIB Swiss Institute of Bioinformatics CH-1227 Geneva, Switzerland
- Computer Science Department, UniGe CH-1227 Geneva, Switzerland
| | - Anne Imberty
- Univ. Grenoble Alpes, CNRS, CERMAV 38000 Grenoble, France
| |
Collapse
|
4
|
Asghar R, Wu N, Ali N, Wang Y, Akkaya M. Computational studies reveal structural characterization and novel families of Puccinia striiformis f. sp. tritici effectors. PLoS Comput Biol 2025; 21:e1012503. [PMID: 40153705 PMCID: PMC11952758 DOI: 10.1371/journal.pcbi.1012503] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2024] [Accepted: 02/24/2025] [Indexed: 03/30/2025] Open
Abstract
Understanding the biological functions of Puccinia striiformis f. sp. tritici (Pst) effectors is fundamental for uncovering the mechanisms of pathogenicity and variability, thereby paving the way for developing durable and effective control strategies for stripe rust. However, due to the lack of an efficient genetic transformation system in Pst, progress in effector function studies has been slow. Here, we modeled the structures of 15,201 effectors from twelve Pst races or isolates, a Puccinia striiformis isolate, and one Puccinia striiformis f. sp. hordei isolate using AlphaFold2. Of these, 8,102 folds were successfully predicted, and we performed sequence- and structure-based annotations of these effectors. These effectors were classified into 410 structure clusters and 1,005 sequence clusters. Sequence lengths varied widely, with a concentration between 101-250 amino acids, and motif analysis revealed that 47% and 5.81% of the predicted effectors contain known effector motifs [Y/F/W]xC and RxLR, respectively highlighting the structural conservation across a substantial portion of the effectors. Subcellular localization predictions indicated a predominant cytoplasmic localization, with notable chloroplast and nuclear presence. Structure-guided analysis significantly enhances effector prediction efficiency as demonstrated by the 75% among 8,102 have structural annotation. The clustering and annotation prediction both based on the sequence and structure homologies allowed us to determine the adopted folding or fold families of the effectors. A common feature observed was the formation of structural homologies from different sequences. In our study, one of the comparative structural analyses revealed a new structure family with a core structure of four helices, including Pst27791, PstGSRE4, and PstSIE1, which target key wheat immune pathway proteins, impacting the host immune functions. Further comparative structural analysis showed similarities between Pst effectors and effectors from other pathogens, such as AvrSr35, AvrSr50, Zt-KP4-1, and MoHrip2, highlighting a possibility of convergent evolutionary strategies, yet to be supported by further data encompassing on some evolutionarily distant species. Currently, our initial analysis is the most one on Pst effectors' sequence, structural and annotation relationships providing a novel foundation to advance our future understanding of Pst pathogenicity and evolution.
Collapse
Affiliation(s)
- Raheel Asghar
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Nan Wu
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Noman Ali
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Yulei Wang
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| | - Mahinur Akkaya
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning, China
| |
Collapse
|
5
|
Liang Y, Zhao Y, Yin Z, Zeng X, Han X, Wen M. Functional and structural insights into α-L-Rhamnosidase: cloning, characterization, and decoding evolutionary constraints through structural motif. Arch Microbiol 2025; 207:61. [PMID: 39954080 DOI: 10.1007/s00203-025-04259-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Revised: 01/22/2025] [Accepted: 01/29/2025] [Indexed: 02/17/2025]
Abstract
α-L-rhamnosidase [E.C. 3.2.1.40] is important in various industrial and biotechnological applications. However, limited knowledge of the structural features of its active site residues and their local geometric arrangements during substrate interaction hinders further application development. In this study, we examined functionally characterized microbial α-L-rhamnosidases. Despite considerable differences in their global structures, the local structures of the substrate-binding sites and key residues were highly conserved. Using the local structural motif, we characterized α-L-rhamnosidase genes from metagenomic samples of traditional fermentation starters. To comprehensively understand the distribution of α-L-rhamnosidases with this motif in the AlphaFold database, we screened 26,858 α-L-rhamnosidase structures. Our findings showed that only 5678 out of 26,858 structures contain the specific conserved motifs, emphasizing their potential significance in mining enzyme function. Moreover, the analysis of structural diversity among representative enzymes demonstrated variation in the number and types of domains within this enzyme family. Further investigation of representative α-L-rhamnosidase sequences with this structural motif confirmed the evolutionary constraints of 15 key residues, indicating strong selective pressures to maintain these elements essential for enzyme functionality. These residues were consistently present across ancestral sequences, underscoring their importance throughout the enzyme's evolutionary history. This study suggests that structure-guided approaches are valuable for discovering functional enzymes. Identifying conserved motif across diverse microbial taxa not only aids in predicting enzyme functionality but also offers opportunities for enzyme engineering and biotechnological applications.
Collapse
Affiliation(s)
- Yupeng Liang
- National Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Key Laboratory of Microbial Diversity in Southwest China, Yunnan Institute of Microbiology, School of Life Sciences, Ministry of Education, Yunnan University, Kunming, 650500, Yunnan, China
| | - Yalan Zhao
- National Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Key Laboratory of Microbial Diversity in Southwest China, Yunnan Institute of Microbiology, School of Life Sciences, Ministry of Education, Yunnan University, Kunming, 650500, Yunnan, China
| | - Zhongwei Yin
- National Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Key Laboratory of Microbial Diversity in Southwest China, Yunnan Institute of Microbiology, School of Life Sciences, Ministry of Education, Yunnan University, Kunming, 650500, Yunnan, China
| | - Xin Zeng
- College of Mathematics and Computer Science, Dali University, Dali, 671003, China
| | - Xiulin Han
- National Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Key Laboratory of Microbial Diversity in Southwest China, Yunnan Institute of Microbiology, School of Life Sciences, Ministry of Education, Yunnan University, Kunming, 650500, Yunnan, China.
| | - Mengliang Wen
- National Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Key Laboratory of Microbial Diversity in Southwest China, Yunnan Institute of Microbiology, School of Life Sciences, Ministry of Education, Yunnan University, Kunming, 650500, Yunnan, China.
| |
Collapse
|
6
|
Pajkos M, Clerc I, Zanon C, Bernadó P, Cortés J. AFflecto: A web server to generate conformational ensembles of flexible proteins from AlphaFold models. J Mol Biol 2025:169003. [PMID: 40133775 DOI: 10.1016/j.jmb.2025.169003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2024] [Revised: 02/04/2025] [Accepted: 02/10/2025] [Indexed: 03/27/2025]
Abstract
Intrinsically disordered proteins and regions (IDPs/IDRs) leverage their structural flexibility to fulfill essential cellular functions, with dysfunctions often linked to severe diseases. However, the relationships between their sequences, structural dynamics and functional roles remain poorly understood. Understanding these complex relationships is crucial for therapeutic development, highlighting the need for methods to generate plausible IDP/IDR conformational ensembles. While AlphaFold (AF) excels at modeling structured domains, it fails to accurately represent disordered regions, leaving a significant portion of proteomes inaccurately modeled. We present AFflecto, a user-friendly web server for generating large conformational ensembles of proteins that include both structured domains and IDRs from AF structural models. AFflecto identifies IDRs as tails, linkers or loops by analyzing their structural context. Additionally, it incorporates a method to identify conditionally folded IDRs that AF may incorrectly predict as natively folded elements. The conformational space is globally explored using efficient stochastic sampling algorithms. AFflecto's web interface allows users to customize the modeling, by modifying boundaries between ordered and disordered regions, and selecting among several sampling strategies. The web server is freely available at https://moma.laas.fr/applications/AFflecto/.
Collapse
Affiliation(s)
- Mátyás Pajkos
- LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France
| | - Ilinka Clerc
- LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France
| | | | - Pau Bernadó
- Centre de Biologie Structurale, Université de Montpellier, INSERM, CNRS, Montpellier, France
| | - Juan Cortés
- LAAS-CNRS, Université de Toulouse, CNRS, Toulouse, France.
| |
Collapse
|
7
|
Rigden DJ, Fernández XM. The 2025 Nucleic Acids Research database issue and the online molecular biology database collection. Nucleic Acids Res 2025; 53:D1-D9. [PMID: 39658041 PMCID: PMC11701706 DOI: 10.1093/nar/gkae1220] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Accepted: 11/26/2024] [Indexed: 12/12/2024] Open
Abstract
The 2025 Nucleic Acids Research database issue contains 185 papers spanning biology and related areas. Seventy three new databases are covered, while resources previously described in the issue account for 101 update articles. Databases most recently published elsewhere account for a further 11 papers. Nucleic acid databases include EXPRESSO for multi-omics of 3D genome structure (this issue's chosen Breakthrough Resource and Article) and NAIRDB for Fourier transform infrared data. New protein databases include structure predictions for human isoforms at ASpdb and for viral proteins at BFVD. UniProt, Pfam and InterPro have all provided updates: metabolism and signalling are covered by new descriptions of STRING, KEGG and CAZy, while updated microbe-oriented databases include Enterobase, VFDB and PHI-base. Biomedical research is supported, among others, by ClinVar, PubChem and DrugMAP. Genomics-related resources include Ensembl, UCSC Genome Browser and dbSNP. New plant databases cover the Solanaceae (SolR) and Asteraceae (AMIR) families while an update from NCBI Taxonomy also features. The Database Issue is freely available on the Nucleic Acids Research website (https://academic.oup.com/nar). At the NAR online Molecular Biology Database Collection (http://www.oxfordjournals.org/nar/database/c/), 932 entries have been reviewed in the last year, 74 new resources added and 226 discontinued URLs eliminated bringing the current total to 2236 databases.
Collapse
Affiliation(s)
- Daniel J Rigden
- Department of Biochemistry, Cell and Systems Biology, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Crown Street, Liverpool L69 7ZB, UK
| | | |
Collapse
|
8
|
Waman V, Bordin N, Lau A, Kandathil S, Wells J, Miller D, Velankar S, Jones D, Sillitoe I, Orengo C. CATH v4.4: major expansion of CATH by experimental and predicted structural data. Nucleic Acids Res 2025; 53:D348-D355. [PMID: 39565206 PMCID: PMC11701635 DOI: 10.1093/nar/gkae1087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/18/2024] [Accepted: 10/24/2024] [Indexed: 11/21/2024] Open
Abstract
CATH (https://www.cathdb.info) is a structural classification database that assigns domains to the structures in the Protein Data Bank (PDB) and AlphaFold Protein Structure Database (AFDB) and adds layers of biological information, including homology and functional annotation. This article covers developments in the CATH classification since 2021. We report the significant expansion of structural information (180-fold) for CATH superfamilies through classification of PDB domains and predicted domain structures from the Encyclopedia of Domains (TED) resource. TED provides information on predicted domains in AFDB. CATH v4.4 represents an expansion of ∼64 844 experimentally determined domain structures from PDB. We also present a mapping of ∼90 million predicted domains from TED to CATH superfamilies. New PDB and TED data increases the number of superfamilies from 5841 to 6573, folds from 1349 to 2078 and architectures from 41 to 77. TED data comprises predicted structures, so these new folds and architectures remain hypothetical until experimentally confirmed. CATH also classifies domains into functional families (FunFams) within a superfamily. We have updated sequences in FunFams by scanning FunFam-HMMs against UniProt release 2024_02, giving a 276% increase in FunFams coverage. The mapping of TED structural domains has resulted in a 4-fold increase in FunFams with structural information.
Collapse
Affiliation(s)
- Vaishali P Waman
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Nicola Bordin
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Andy Lau
- Department of Computer Science, University College London, London WC1E 6BT, UK
- InstaDeep Ltd, 5 Merchant Square, London W2 1AY, UK
| | - Shaun Kandathil
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Jude Wells
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Centre for Artificial Intelligence, University College London, London WC1V 6BH, UK
| | - David Miller
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Centre for Artificial Intelligence, University College London, London WC1V 6BH, UK
| | - Sameer Velankar
- Protein Data Bank in Europe, European Molecular Biology Laboratory, European Bioinformatics Institute, Hinxton, Cambridge, CB10 1SD, UK
| | - David T Jones
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
- Department of Computer Science, University College London, London WC1E 6BT, UK
| | - Ian Sillitoe
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| | - Christine Orengo
- Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, UK
| |
Collapse
|
9
|
Greener JG, Jamali K. Fast protein structure searching using structure graph embeddings. BIOINFORMATICS ADVANCES 2024; 5:vbaf042. [PMID: 40196750 PMCID: PMC11974391 DOI: 10.1093/bioadv/vbaf042] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/11/2025] [Accepted: 03/03/2025] [Indexed: 04/09/2025]
Abstract
Comparing and searching protein structures independent of primary sequence has proved useful for remote homology detection, function annotation, and protein classification. Fast and accurate methods to search with structures will be essential to make use of the vast databases that have recently become available, in the same way that fast protein sequence searching underpins much of bioinformatics. We train a simple graph neural network using supervised contrastive learning to learn a low-dimensional embedding of protein domains. Availability and implementation The method, called Progres, is available as software at https://github.com/greener-group/progres and as a web server at https://progres.mrc-lmb.cam.ac.uk. It has accuracy comparable to the best current methods and can search the AlphaFold database TED domains in a 10th of a second per query on CPU.
Collapse
Affiliation(s)
- Joe G Greener
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| | - Kiarash Jamali
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| |
Collapse
|