1
|
Chen SF, Steele RJ, Hocky GM, Lemeneh B, Lad SP, Oermann EK. Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions. ARXIV 2025:arXiv:2408.16245v4. [PMID: 40236839 PMCID: PMC11998858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. To date, most biosequence transformers have been trained on a single omic-either proteins or nucleic acids and have seen incredible success in downstream tasks in each domain with particularly noteworthy breakthroughs in protein structural modeling. However, single-omic pre-training limits the ability of these models to capture cross-modal interactions. Here we present OmniBioTE, the largest open-source multi-omic model trained on over 250 billion tokens of mixed protein and nucleic acid data. We show that despite only being trained on unlabelled sequence data, OmniBioTE learns joint representations consistent with the central dogma of molecular biology. We further demonstrate that OmbiBioTE achieves state-of-the-art results predicting the change in Gibbs free energy ({\Delta}G) of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, compared to single-omic controls trained with identical compute, OmniBioTE demonstrates superior performance-per-FLOP and absolute accuracy across both multi-omic and single-omic benchmarks, highlighting the power of a unified modeling approach for biological sequences.
Collapse
|
2
|
Georgakis N, Premetis GE, Pantiora P, Varotsou C, Bodourian CS, Labrou NE. The impact of metagenomic analysis on the discovery of novel endolysins. Appl Microbiol Biotechnol 2025; 109:126. [PMID: 40411603 PMCID: PMC12103483 DOI: 10.1007/s00253-025-13513-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2025] [Revised: 05/04/2025] [Accepted: 05/05/2025] [Indexed: 05/26/2025]
Abstract
Metagenomics has revolutionized enzyme discovery by enabling the study of genetic material directly from environmental samples, bypassing the need for microbial cultivation. This approach is particularly effective for identifying novel endolysins, phage-derived enzymes with antibacterial properties suited for therapeutic and industrial applications. Diverse ecosystems, such as biofilms, human microbiome, hot springs, and geothermal areas, serve as rich reservoirs for endolysins with traits like thermostability, broad-spectrum activity, specificity and resistance to harsh conditions. Functional metagenomics, complemented by bioinformatics, enables the discovery and annotation of previously uncharacterized endolysins. Examples of endolysins discovered from metagenomics analysis are discussed. Despite the challenges of analyzing complex microbial ecosystems and isolating target genes, metagenomics holds immense potential for uncovering innovative endolysins, paving the way for developing new biotechnological applications. KEY POINTS: • Endolysins offer antibacterial potential for therapeutic and industrial use. • Metagenomics enables discovery of novel endolysins from diverse ecosystems. • Advances in tools and methods have accelerated novel endolysins discovery.
Collapse
Affiliation(s)
- Nikolaos Georgakis
- Laboratory of Enzyme Technology, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 75 Iera Odos Street, Athens, 11855, Greece
| | - Georgios E Premetis
- Laboratory of Enzyme Technology, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 75 Iera Odos Street, Athens, 11855, Greece
| | - Panagiota Pantiora
- Laboratory of Enzyme Technology, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 75 Iera Odos Street, Athens, 11855, Greece
| | - Christina Varotsou
- Laboratory of Enzyme Technology, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 75 Iera Odos Street, Athens, 11855, Greece
| | - Charoutioun S Bodourian
- Laboratory of Enzyme Technology, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 75 Iera Odos Street, Athens, 11855, Greece
| | - Nikolaos E Labrou
- Laboratory of Enzyme Technology, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, 75 Iera Odos Street, Athens, 11855, Greece.
| |
Collapse
|
3
|
Yurtseven A, Keller S, Hirsch P, Kalinina OV, Gress A. StructMAn 2.0 Web: a web server for structural annotation of protein sequences and mutations. Nucleic Acids Res 2025:gkaf381. [PMID: 40326516 DOI: 10.1093/nar/gkaf381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2025] [Revised: 04/11/2025] [Accepted: 04/25/2025] [Indexed: 05/07/2025] Open
Abstract
StructMAn is a method for protein structural annotation. It describes each position of a protein sequence or specific variants in it in terms of their importance for the three-dimensional (3D) structure of the protein and its interactions with other molecules. StructMAn maps, aligns, and aggregates data from experimentally resolved and predicted 3D structures of proteins and their homologs for any given protein sequence and/or a combination of mutations in it. The results provide structural annotation for every amino acid position allowing a detailed structural analysis. Furthermore, StructMAn enables generation of a wide variety of position-specific high-quality structural features that can be leveraged in machine learning applications. With the new web server StructMAn 2.0 Web, we provide a user-friendly way to use StructMAn offering an easy-to-use input interface and a comprehensive visualization for the various results of StructMAn. StructMAn 2.0 Web is available at https://tools.helmholtz-hips.de/structman.
Collapse
Affiliation(s)
- Alper Yurtseven
- Research Group Drug Bioinformatics, Department Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Campus E8.1, 66123 Saarbrücken, Saarland, Germany
- Graduate School of Computer Science, Saarland University, 66123 Saarbrücken, Saarland, Germany
| | - Sebastian Keller
- Research Group Drug Bioinformatics, Department Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Campus E8.1, 66123 Saarbrücken, Saarland, Germany
| | - Pascal Hirsch
- Chair for Clinical Bioinformatics, Saarland University, 66123 Saarbrücken, Saarland, Germany
| | - Olga V Kalinina
- Research Group Drug Bioinformatics, Department Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Campus E8.1, 66123 Saarbrücken, Saarland, Germany
- Drug Bioinformatics, Medical Faculty, Saarland University, 66421 Homburg, Saarland, Germany
- Center for Bioinformatics, Saarland University, 66123 Saarbrücken, Saarland, Germany
| | - Alexander Gress
- Research Group Drug Bioinformatics, Department Drug Bioinformatics, Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), Campus E8.1, 66123 Saarbrücken, Saarland, Germany
| |
Collapse
|
4
|
Pokharel S, Barasa K, Pratyush P, KC DB. PLM-DBPs: enhancing plant DNA-binding protein prediction by integrating sequence-based and structure-aware protein language models. Brief Bioinform 2025; 26:bbaf245. [PMID: 40439671 PMCID: PMC12121366 DOI: 10.1093/bib/bbaf245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2025] [Revised: 04/14/2025] [Accepted: 05/05/2025] [Indexed: 06/02/2025] Open
Abstract
DNA-binding proteins (DBPs) play a crucial role in gene regulation, development, and environmental responses across plants, animals, and microorganisms. Existing DBP prediction methods are largely limited to sequence information, whether through handcrafted features or sequence-based protein language models (PLMs), overlooking structural cues critical to protein function. In addition, most existing tools are trained for general DBP predictions, which are often not accurate for plant-specific DBPs due to the unique structural and functional properties of plant proteins. Our work introduces PLM-DBPs, a deep learning framework that integrates both sequence-based and structure-aware representations to enhance DBP prediction in plants. We evaluated several state-of-the-art PLMs to extract high-dimensional protein representations and experimented with various fusion strategies to validate the complementary information between the various representations. Our final model, a fusion of sequence-based and structure-aware ANN models, achieves a notable improvement in predicting DBPs in plants outperforming previous state-of-the-art models. Although sequence-based PLMs already demonstrate strong performance in DBP prediction, our findings show that the integration of structural information further enhances predictive accuracy. This underscores the complementary nature of structural representations and establishes PLM-DBPs as a robust tool for advancing plant research and agricultural innovation. The proposed model and other resources are publicly available at https://github.com/suresh-pokharel/PLM-DBPs.
Collapse
Affiliation(s)
- Suresh Pokharel
- Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester 14623, NY, United States
| | - Kepha Barasa
- College of Computing, Michigan Technological University, Houghton 49931, MI, United States
| | - Pawel Pratyush
- Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester 14623, NY, United States
| | - Dukka B KC
- Golisano College of Computing and Information Sciences, Rochester Institute of Technology, Rochester 14623, NY, United States
| |
Collapse
|
5
|
Ciuchcinski K, Kaczorowska AK, Biernacka D, Dorawa S, Kaczorowski T, Park Y, Piekarski K, Stanowski M, Ishikawa T, Stokke R, Steen IH, Dziewit L. Computational pipeline for sustainable enzyme discovery through (re)use of metagenomic data. JOURNAL OF ENVIRONMENTAL MANAGEMENT 2025; 382:125381. [PMID: 40252419 DOI: 10.1016/j.jenvman.2025.125381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/29/2024] [Revised: 04/03/2025] [Accepted: 04/13/2025] [Indexed: 04/21/2025]
Abstract
Enzymes derived from extremophilic organisms, also known as extremozymes, offer sustainable and efficient solutions for industrial applications. Valued for their resilience and low environmental impact, extremozymes have found use as catalysts in various processes, ranging from dairy production to pharmaceutical manufacturing. However, discovery of novel extremozymes is often hindered by challenges such as culturing difficulties, underrepresentation of extreme environments in reference databases, and limitations of traditional sequence-based screening methods. In this work, we present a computational pipeline designed to discover novel enzymes from metagenomic data derived from extreme environments. This pipeline represents a versatile and sustainable approach that promotes reuse and recycling of existing datasets and minimises the need for additional environmental sampling. In its core, the algorithm integrates both traditional bioinformatic techniques and recent advances in structural prediction, enabling rapid and accurate identification of enzymes. However, due to its design, the algorithm relies heavily on existing databases, which can limit its effectiveness in situations where reference data is scarce or when encountering novel protein families. As a proof-of-concept, we applied the pipeline to metagenomic data from deep-sea hydrothermal vents, with a focus on β-galactosidases. The pipeline identified 11 potential candidate proteins, out of which 10 showed in vitro activity. One of the selected enzymes, βGal_UW07, showed strong potential for industrial applications. The enzyme exhibited optimal activity at 70 °C and was exceptionally resistant to high pH and the presence of metal ions and reducing agents. Overall, our results indicate that the pipeline is highly accurate and can play a key role in sustainable bioprospecting, leveraging existing metagenomic datasets and minimising in situ interventions in pristine regions.
Collapse
Affiliation(s)
- Karol Ciuchcinski
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Anna-Karina Kaczorowska
- Collection of Plasmids and Microorganisms | KPD, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland.
| | - Daria Biernacka
- Collection of Plasmids and Microorganisms | KPD, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland; Structural Biology Laboratory, Intercollegiate Faculty of Biotechnology of University of Gdansk and Medical University of Gdańsk, Abrahama 58, 80-307, Gdańsk, Poland.
| | - Sebastian Dorawa
- Laboratory of Extremophiles Biology, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland.
| | - Tadeusz Kaczorowski
- Laboratory of Extremophiles Biology, Department of Microbiology, Faculty of Biology, University of Gdańsk, Wita Stwosza 59, 80-308, Gdańsk, Poland.
| | - Younginn Park
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Karol Piekarski
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Michal Stanowski
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Takao Ishikawa
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| | - Runar Stokke
- Department of Biological Sciences, Center for Deep Sea Research, University of Bergen, Postboks 7803, N-5020, Bergen, Norway.
| | - Ida Helene Steen
- Department of Biological Sciences, Center for Deep Sea Research, University of Bergen, Postboks 7803, N-5020, Bergen, Norway.
| | - Lukasz Dziewit
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw, Miecznikowa 1, 02-096, Warsaw, Poland.
| |
Collapse
|
6
|
Zhang L, Liu T. ATP-Pred: Prediction of Protein-ATP Binding Residues via Fusion of Residue-Level Embeddings and Kolmogorov-Arnold Network. J Chem Inf Model 2025; 65:3812-3826. [PMID: 40119803 DOI: 10.1021/acs.jcim.5c00016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/24/2025]
Abstract
Accurately identifying protein-ATP binding residues is essential for understanding biological processes and designing drugs. However, current sequence-based methods have limitations, such as difficulties in extracting discriminative features and the need for more efficient algorithms. Additionally, methods based on multiple sequence alignments often face challenges in handling large-scale predictions. To address these issues, we developed ATP-Pred, a sequence-based method for predicting ATP-binding residues in proteins. This model applies transfer learning by using two recently developed pretrain protein language models, Ankh and ProstT5, to extract residue-level embeddings that capture protein functionality. ATP-Pred also integrates a CNN-BiLSTM network and a Kolmogorov-Arnold network to build the prediction model. To handle data imbalance, we introduced a weighted focal loss function. Experimental results on three independent test data sets showed that ATP-Pred outperforms most existing methods. Its generalizability was further validated on four protein-mononucleotide binding residue data sets, where it delivered promising results. These findings suggest that ATP-Pred is a robust and reliable predictor.
Collapse
Affiliation(s)
- Lingrong Zhang
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| | - Taigang Liu
- College of Information Technology, Shanghai Ocean University, Shanghai 201306, China
| |
Collapse
|
7
|
Bjerregaard A, Groth PM, Hauberg S, Krogh A, Boomsma W. Foundation models of protein sequences: A brief overview. Curr Opin Struct Biol 2025; 91:103004. [PMID: 39983412 DOI: 10.1016/j.sbi.2025.103004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2024] [Revised: 01/24/2025] [Accepted: 01/26/2025] [Indexed: 02/23/2025]
Abstract
Protein sequence models have evolved from simple statistics of aligned families to versatile foundation models of evolutionary scale. Enabled by self-supervised learning and an abundance of protein sequence data, such foundation models now play a central role in protein science. They facilitate rich representations, powerful generative design, and fine-tuning across diverse domains. In this review, we trace modeling developments and categorize them into methodological trends over the modalities they describe and the contexts they condition upon. Following a brief historical overview, we focus our attention on the most recent trends and outline future perspectives.
Collapse
Affiliation(s)
- Andreas Bjerregaard
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Peter Mørch Groth
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Novonesis, Kgs, Lyngby, Denmark
| | - Søren Hauberg
- Section for Cognitive Systems, Technical University of Denmark, Kgs, Lyngby, Denmark
| | - Anders Krogh
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark; Center for Health Data Science, University of Copenhagen, Copenhagen, Denmark
| | - Wouter Boomsma
- Department of Computer Science, University of Copenhagen, Copenhagen, Denmark.
| |
Collapse
|
8
|
Song C, He S, Qian Y, Li X, Hu Y, Chen J, Wang J, Deng L. DeepMVD: A Novel Multiview Dynamic Feature Fusion Model for Accurate Protein Function Prediction. J Chem Inf Model 2025; 65:3077-3089. [PMID: 40053671 DOI: 10.1021/acs.jcim.4c02216] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2025]
Abstract
Proteins, as the fundamental macromolecules of life, play critical roles in various biological processes. Recent advancements in intelligent protein function prediction methods leverage sequences, structures, and biomedical literature data. Among them, function prediction methods for protein sequences remain an enduring and popular research direction. Existing studies have failed to effectively utilize the multilevel attribute features reflected in protein sequences. This limitation hinders the enrichment of protein descriptions needed for high-precision prediction of protein functions. To address this, we propose DeepMVD, a novel deep learning model that enhances prediction accuracy by dynamically fusing multiview features. DeepMVD employs specialized modules to extract unique features from each view and utilizes an adaptive fusion mechanism for optimal integration. Evaluation of the CAFA4 data set shows that DeepMVD significantly outperforms existing state-of-the-art models in terms of BP, MF, and CC terminology, all obtaining the highest Fmax (0.523, 0.712, 0.740). Ablation studies confirm the model's robustness. Source code and data sets are available at http://swanhub.co/scl/DeepMVD.
Collapse
Affiliation(s)
- Chaolin Song
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Shiwen He
- School of Software, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yurong Qian
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Xinhui Li
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Yue Hu
- School of Computer Science and Technology, Xinjiang University, Urumqi 830046, China
- Joint International Research Laboratory of Silk Road Multilingual Cognitive Computing, Xinjiang University, Urumqi, Xinjiang 830046, China
| | - Jiaying Chen
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Jingfu Wang
- School of Software, Xinjiang University, Urumqi 830091, China
- Xinjiang Engineering Research Center of Big Data and Intelligent Software, School of Software, Xinjiang University, Urumqi 830091, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi 830091, China
| | - Lei Deng
- School of Software, Xinjiang University, Urumqi 830091, China
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| |
Collapse
|
9
|
Carmona OG, Kleinjung J, Anastasiou D, Oostenbrink C, Fraternali F. AllohubPy: Detecting Allosteric Signals Through An Information-theoretic Approach. J Mol Biol 2025:168969. [PMID: 39900284 DOI: 10.1016/j.jmb.2025.168969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2024] [Revised: 01/22/2025] [Accepted: 01/24/2025] [Indexed: 02/05/2025]
Abstract
Allosteric regulation is crucial for biological processes like signal transduction, transcriptional regulation, and metabolism, yet the mechanisms and macromolecular properties that govern it are still not well understood. Several methods have been developed over the years to study allosterism through different angles. Among the possible ways to study allosterism, information-theoretic approaches, like AlloHubMat or GSAtools, can be particularly effective due to their use of robust statistics and the possibility to be combined with graph analysis. These methods capture local conformational changes associated with global motions from molecular dynamics simulations through the use of a Structural Alphabet, which simplifies the complexity of the Cartesian space by reducing the dimensionality down to a string of encoded fragments, representing sets of internal coordinates that still capture the overall conformation changes. In this work, we present "AllohubPy," an improved and standardized methodology of AlloHubMat and GSAtools coded in Python. We analyse the performance, limitations and sampling requirements of AllohubPy by using extensive molecular dynamics simulations of model allosteric systems and apply convergence analysis techniques to estimate result reliability. Additionally, we expand the methodology to use different dimensionality reduction Structural Alphabets, such as the 3DI alphabet, and integrate Protein Language Models (PLMs) to refine allosteric hub communication detection by monitoring the detected evolutionary constraints. Overall, AllohubPy expands its preceding methods and simplifies the use and reliability of the method to effectively capture dynamic allosteric motions and residue pathways. AllohubPy is freely available on GitHub (https://github.com/Fraternalilab/AlloHubPy) as a package and as a Jupyter Notebook.
Collapse
Affiliation(s)
- Oriol Gracia Carmona
- Department of Structural and Molecular Biology, Division of Biosciences and Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, United Kingdom; Department of Biological Sciences Birkbeck, University of London, London WC1E 7HX, United Kingdom; Randall Centre for Cell & Molecular Biosciences, King's College London, London SE1 1UL, United Kingdom
| | - Jens Kleinjung
- Nxera Pharma, Steinmetz & Cori Buildings, Granta Park, Great Abington, Cambridge CB21 6DG, United Kingdom
| | - Dimitrios Anastasiou
- Cancer Metabolism Laboratory, The Francis Crick Institute, 1 Midland Road, London NW1 1AT, United Kingdom
| | - Chris Oostenbrink
- Institute for Molecular Modeling and Simulation, Department of Material Sciences and Process Engineering, BOKU University 1190 Vienna, Austria
| | - Franca Fraternali
- Department of Structural and Molecular Biology, Division of Biosciences and Institute of Structural and Molecular Biology, University College London, London WC1E 6BT, United Kingdom; Department of Biological Sciences Birkbeck, University of London, London WC1E 7HX, United Kingdom.
| |
Collapse
|
10
|
Majila K, Ullanat V, Viswanath S. A deep learning method for predicting interactions for intrinsically disordered regions of proteins. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.12.19.629373. [PMID: 39763873 PMCID: PMC11702703 DOI: 10.1101/2024.12.19.629373] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/14/2025]
Abstract
Intrinsically disordered proteins or regions (IDPs/IDRs) adopt diverse binding modes with different partners, ranging from ordered to multivalent to fuzzy conformations in the bound state. Characterizing IDR interfaces is challenging experimentally and computationally. Alphafold-multimer and Alphafold3, the state-of-the-art structure prediction methods, are less accurate at predicting IDR binding sites at their benchmarked confidence cutoffs. Their performance improves upon lowering the confidence cutoffs. Here, we developed Disobind, a deep-learning method that predicts inter-protein contact maps and interface residues for an IDR and a partner protein, given their sequences. It outperforms AlphaFold-multimer and AlphaFold3 at multiple confidence cutoffs. Combining the Disobind and AlphaFold-multimer predictions further improves the performance. In contrast to most current methods, Disobind considers the context of the binding partner and does not depend on structures and multiple sequence alignments. Its predictions can be used to localize IDRs in integrative structures of large assemblies and characterize and modulate IDR-mediated interactions.
Collapse
Affiliation(s)
- Kartik Majila
- National Center for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India 560065
| | - Varun Ullanat
- National Center for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India 560065
| | - Shruthi Viswanath
- National Center for Biological Sciences, Tata Institute of Fundamental Research, Bangalore, India 560065
| |
Collapse
|
11
|
Chen JY, Wang JF, Hu Y, Li XH, Qian YR, Song CL. Evaluating the advancements in protein language models for encoding strategies in protein function prediction: a comprehensive review. Front Bioeng Biotechnol 2025; 13:1506508. [PMID: 39906415 PMCID: PMC11790633 DOI: 10.3389/fbioe.2025.1506508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2024] [Accepted: 01/02/2025] [Indexed: 02/06/2025] Open
Abstract
Protein function prediction is crucial in several key areas such as bioinformatics and drug design. With the rapid progress of deep learning technology, applying protein language models has become a research focus. These models utilize the increasing amount of large-scale protein sequence data to deeply mine its intrinsic semantic information, which can effectively improve the accuracy of protein function prediction. This review comprehensively combines the current status of applying the latest protein language models in protein function prediction. It provides an exhaustive performance comparison with traditional prediction methods. Through the in-depth analysis of experimental results, the significant advantages of protein language models in enhancing the accuracy and depth of protein function prediction tasks are fully demonstrated.
Collapse
Affiliation(s)
- Jia-Ying Chen
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Jing-Fu Wang
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Yue Hu
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Xin-Hui Li
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| | - Yu-Rong Qian
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
- School of Computer Science and Technology, Xinjiang University, Urumqi, China
| | - Chao-Lin Song
- School of Software, Xinjiang University, Urumqi, China
- Key Laboratory of Software Engineering, Xinjiang University, Urumqi, China
- Key Laboratory of Signal Detection and Processing in Xinjiang Uygur Autonomous Region, Xinjiang University, Urumqi, China
| |
Collapse
|
12
|
Johnson S, Weigele P, Fomenkov A, Ge A, Vincze A, Eaglesham J, Roberts R, Sun Z. Domainator, a flexible software suite for domain-based annotation and neighborhood analysis, identifies proteins involved in antiviral systems. Nucleic Acids Res 2025; 53:gkae1175. [PMID: 39657740 PMCID: PMC11754643 DOI: 10.1093/nar/gkae1175] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2024] [Revised: 11/07/2024] [Accepted: 11/15/2024] [Indexed: 12/12/2024] Open
Abstract
The availability of large databases of biological sequences presents an opportunity for in-depth exploration of gene diversity and function. Bacterial defense systems are a rich source of diverse but difficult to annotate genes with biotechnological applications. In this work, we present Domainator, a flexible and modular software suite for domain-based gene neighborhood and protein search, extraction and clustering. We demonstrate the utility of Domainator through three examples related to bacterial defense systems. First, we cluster CRISPR-associated Rossman fold (CARF) containing proteins with difficult to annotate effector domains, classifying most of them as likely transcriptional regulators and a subset as likely RNases. Second, we extract and cluster P4-like phage satellite defense hotspots, identify an abundant variant of Lamassu defense systems and demonstrate its in vivo activity against several T-even phages. Third, we integrate a protein language model into Domainator and use it to identify restriction endonucleases with low similarity to known reference sequences, validating the activity of one example in vitro. Domainator is made available as an open-source package with detailed documentation and usage examples.
Collapse
Affiliation(s)
| | | | | | - Andrew Ge
- New England Biolabs Inc., Ipswich, MA 01938, USA
| | - Anna Vincze
- New England Biolabs Inc., Ipswich, MA 01938, USA
| | | | | | - Zhiyi Sun
- New England Biolabs Inc., Ipswich, MA 01938, USA
| |
Collapse
|
13
|
Gonzales MEM, Ureta JC, Shrestha AMS. PHIStruct: improving phage-host interaction prediction at low sequence similarity settings using structure-aware protein embeddings. Bioinformatics 2024; 41:btaf016. [PMID: 39804673 PMCID: PMC11783280 DOI: 10.1093/bioinformatics/btaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 12/04/2024] [Accepted: 01/10/2025] [Indexed: 02/01/2025] Open
Abstract
MOTIVATION Recent computational approaches for predicting phage-host interaction have explored the use of sequence-only protein language models to produce embeddings of phage proteins without manual feature engineering. However, these embeddings do not directly capture protein structure information and structure-informed signals related to host specificity. RESULTS We present PHIStruct, a multilayer perceptron that takes in structure-aware embeddings of receptor-binding proteins, generated via the structure-aware protein language model SaProt, and then predicts the host from among the ESKAPEE genera. Compared against recent tools, PHIStruct exhibits the best balance of precision and recall, with the highest and most stable F1 score across a wide range of confidence thresholds and sequence similarity settings. The margin in performance is most pronounced when the sequence similarity between the training and test sets drops below 40%, wherein, at a relatively high-confidence threshold of above 50%, PHIStruct presents a 7%-9% increase in class-averaged F1 over machine learning tools that do not directly incorporate structure information, as well as a 5%-6% increase over BLASTp. AVAILABILITY AND IMPLEMENTATION The data and source code for our experiments and analyses are available at https://github.com/bioinfodlsu/PHIStruct.
Collapse
Affiliation(s)
- Mark Edward M Gonzales
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines
- College of Computer Studies, De La Salle University, Manila 1004, Philippines
| | - Jennifer C Ureta
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines
- College of Computer Studies, De La Salle University, Manila 1004, Philippines
- Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC 3052, Australia
| | - Anish M S Shrestha
- Bioinformatics Lab, Advanced Research Institute for Informatics, Computing and Networking, De La Salle University, Manila 1004, Philippines
- College of Computer Studies, De La Salle University, Manila 1004, Philippines
| |
Collapse
|
14
|
Tule S, Foley G, Bodén M. Do protein language models learn phylogeny? Brief Bioinform 2024; 26:bbaf047. [PMID: 39987495 PMCID: PMC11847157 DOI: 10.1093/bib/bbaf047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Revised: 12/20/2024] [Accepted: 02/20/2025] [Indexed: 02/25/2025] Open
Abstract
Deep machine learning demonstrates a capacity to uncover evolutionary relationships directly from protein sequences, in effect internalising notions inherent to classical phylogenetic tree inference. We connect these two paradigms by assessing the capacity of protein-based language models (pLMs) to discern phylogenetic relationships without being explicitly trained to do so. We evaluate ESM2, ProtTrans, and MSA-Transformer relative to classical phylogenetic methods, while also considering sequence insertions and deletions (indels) across 114 Pfam datasets. The largest ESM2 model tends to outperform other pLMs (including the multimodal ESM3) by recovering phylogenetic relationships among homologous protein sequences in both low- and high-gap settings. pLMs agree with conventional phylogenetic methods in general, but more so for protein families with fewer implied indels, highlighting indels as a key factor differentiating classical phylogenetics from pLMs. We find that pLMs preferentially capture broader as opposed to finer evolutionary relationships within a specific protein family, where ESM2 has a sweet spot for highly divergent sequences, at remote distance. Less than 10% of neurons are sufficient to broadly recapitulate classical phylogenetic distances; when used in isolation, the difference between the paradigms is further diminished. We show these neurons are polysemantic, shared among different homologous families but never fully overlapping. We highlight the potential of ESM2 as a complementary tool for phylogenetic analysis, especially when extending to remote homologs that are difficult to align and imply complex histories of insertions and deletions. Implementations of analyses are available at https://github.com/santule/pLMEvo.
Collapse
Affiliation(s)
- Sanjana Tule
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Gabriel Foley
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| | - Mikael Bodén
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD 4072, Australia
| |
Collapse
|
15
|
Greener JG, Jamali K. Fast protein structure searching using structure graph embeddings. BIOINFORMATICS ADVANCES 2024; 5:vbaf042. [PMID: 40196750 PMCID: PMC11974391 DOI: 10.1093/bioadv/vbaf042] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/07/2025] [Revised: 02/11/2025] [Accepted: 03/03/2025] [Indexed: 04/09/2025]
Abstract
Comparing and searching protein structures independent of primary sequence has proved useful for remote homology detection, function annotation, and protein classification. Fast and accurate methods to search with structures will be essential to make use of the vast databases that have recently become available, in the same way that fast protein sequence searching underpins much of bioinformatics. We train a simple graph neural network using supervised contrastive learning to learn a low-dimensional embedding of protein domains. Availability and implementation The method, called Progres, is available as software at https://github.com/greener-group/progres and as a web server at https://progres.mrc-lmb.cam.ac.uk. It has accuracy comparable to the best current methods and can search the AlphaFold database TED domains in a 10th of a second per query on CPU.
Collapse
Affiliation(s)
- Joe G Greener
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| | - Kiarash Jamali
- Medical Research Council Laboratory of Molecular Biology, Cambridge, CB2 0QH, United Kingdom
| |
Collapse
|