1
|
Liu Z, Qiu WR, Liu Y, Yan H, Pei W, Zhu YH, Qiu J. A comprehensive review of computational methods for Protein-DNA binding site prediction. Anal Biochem 2025; 703:115862. [PMID: 40209920 DOI: 10.1016/j.ab.2025.115862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 03/20/2025] [Accepted: 04/06/2025] [Indexed: 04/12/2025]
Abstract
Accurately identifying protein-DNA binding sites is essential for understanding the molecular mechanisms underlying biological processes, which in turn facilitates advancements in drug discovery and design. While biochemical experiments provide the most accurate way to locate DNA-binding sites, they are generally time-consuming, resource-intensive, and expensive. There is a pressing need to develop computational methods that are both efficient and accurate for DNA-binding site prediction. This study thoroughly reviews and categorizes major computational approaches for predicting DNA-binding sites, including template detection, statistical machine learning, and deep learning-based methods. The 14 state-of-the-art DNA-binding site prediction models have been benchmarked on 136 non-redundant proteins, where the deep learning-based, especially pre-trained large language model-based, methods achieve superior performance over the other two categories. Applications of these DNA-binding site prediction methods are also involved.
Collapse
Affiliation(s)
- Zi Liu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Wang-Ren Qiu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Yan Liu
- Department of Computer Science, Yangzhou University, 196 Huayang West Road, Yangzhou, 225100, China
| | - He Yan
- College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, 159 Longpanlu Road, Nanjing, 210037, China
| | - Wenyi Pei
- Geriatric Department, Shanghai Baoshan District Wusong Central Hospital, 101 Tongtai North Road, Shanghai, 200940, China.
| | - Yi-Heng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, 1 Weigang Road, Nanjing, 210095, China.
| | - Jing Qiu
- Information Department, The First Affiliated Hospital of Naval Medical University, 168 Changhai Road, Shanghai, 200433, China.
| |
Collapse
|
2
|
Zheng W, Wuyun Q, Li Y, Liu Q, Zhou X, Peng C, Zhu Y, Freddolino L, Zhang Y. Deep-learning-based single-domain and multidomain protein structure prediction with D-I-TASSER. Nat Biotechnol 2025:10.1038/s41587-025-02654-4. [PMID: 40410405 DOI: 10.1038/s41587-025-02654-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2024] [Accepted: 03/26/2025] [Indexed: 05/25/2025]
Abstract
The dominant success of deep learning techniques on protein structure prediction has challenged the necessity and usefulness of traditional force field-based folding simulations. We proposed a hybrid approach, deep-learning-based iterative threading assembly refinement (D-I-TASSER), which constructs atomic-level protein structural models by integrating multisource deep learning potentials with iterative threading fragment assembly simulations. D-I-TASSER introduces a domain splitting and assembly protocol for the automated modeling of large multidomain protein structures. Benchmark tests and the most recent critical assessment of protein structure prediction, 15 experiments demonstrate that D-I-TASSER outperforms AlphaFold2 and AlphaFold3 on both single-domain and multidomain proteins. Large-scale folding experiments further show that D-I-TASSER could fold 81% of protein domains and 73% of full-chain sequences in the human proteome with results highly complementary to recently released models by AlphaFold2. These results highlight a new avenue to integrate deep learning with classical physics-based folding simulations for high-accuracy protein structure and function predictions that are usable in genome-wide applications.
Collapse
Affiliation(s)
- Wei Zheng
- NITFID, School of Statistics and Data Science, AAIS, LPMC and KLMDASR, Nankai University, Tianjin, China
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Qiqige Wuyun
- Department of Computer Science and Engineering, Michigan State University, East Lansing, MI, USA
| | - Yang Li
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore
| | - Quancheng Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Xiaogen Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Chunxiang Peng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA
| | - Yiheng Zhu
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Lydia Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA.
- Department of Biological Chemistry, University of Michigan, Ann Arbor, MI, USA.
| | - Yang Zhang
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore.
- Department of Computer Science, School of Computing, National University of Singapore, Singapore, Singapore.
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
3
|
Utgés JS, MacGowan SA, Barton GJ. LIGYSIS-web: a resource for the analysis of protein-ligand binding sites. Nucleic Acids Res 2025:gkaf411. [PMID: 40377089 DOI: 10.1093/nar/gkaf411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2025] [Revised: 04/17/2025] [Accepted: 05/06/2025] [Indexed: 05/18/2025] Open
Abstract
LIGYSIS-web is a free website accessible to all users without any login requirement for the analysis of protein-ligand binding sites. LIGYSIS-web hosts a database of 65,000 protein-ligand binding sites across 25,000 proteins. LIGYSIS sites are defined by aggregating unique relevant protein-ligand interfaces across different biological assemblies of the same protein deposited on the PDBe. Additionally, users can upload their own structures in PDB or mmCIF format for analysis and subsequent visualisation and download. Ligand sites are characterised using evolutionary divergence from a multiple sequence alignment, human missense genetic variation from gnomAD and relative solvent accessibility to obtain accessibility-based cluster labels and scores indicating likelihood of function. These results are displayed in the LIGYSIS web server, a Python Flask web application with a JavaScript frontend employing Jinja and jQuery to link the 3Dmol.js structure viewer with dynamic tables and Chart.js graphs in an interactive manner. LIGYSIS-web is available at https://www.compbio.dundee.ac.uk/ligysis/, whilst the source code for the analysis pipelines and web application can be accessed at https://github.com/bartongroup/LIGYSIS, https://github.com/bartongroup/LIGYSIS-custom and https://github.com/bartongroup/LIGYSIS-web, respectively.
Collapse
Affiliation(s)
- Javier S Utgés
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dow Street, Dundee DD1 5EH Scotland, UK
| | - Stuart A MacGowan
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dow Street, Dundee DD1 5EH Scotland, UK
| | - Geoffrey J Barton
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dow Street, Dundee DD1 5EH Scotland, UK
| |
Collapse
|
4
|
Wang Y, Sun K, Li J, Guan X, Zhang O, Bagni D, Zhang Y, Carlson HA, Head-Gordon T. A workflow to create a high-quality protein-ligand binding dataset for training, validation, and prediction tasks. DIGITAL DISCOVERY 2025; 4:1209-1220. [PMID: 40190768 PMCID: PMC11967698 DOI: 10.1039/d4dd00357h] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/05/2024] [Accepted: 03/25/2025] [Indexed: 04/09/2025]
Abstract
Development of scoring functions (SFs) used to predict protein-ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein-ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and Binding DB with co-crystalized ligand-protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open-source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.
Collapse
Affiliation(s)
- Yingze Wang
- Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA
| | - Kunyang Sun
- Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA
| | - Jie Li
- Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA
| | - Xingyi Guan
- Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA
| | - Oufan Zhang
- Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA
| | - Dorian Bagni
- Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA
| | - Yang Zhang
- Department of Computer Science, School of Computing, National University of Singapore 117417 Singapore
- Cancer Science Institute of Singapore, National University of Singapore 117599 Singapore
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore 117596 Singapore
| | | | - Teresa Head-Gordon
- Kenneth S. Pitzer Theory Center and Department of Chemistry, University of California Berkeley CA 94720 USA
- Department of Bioengineering, University of California Berkeley CA 94720 USA
- Departments of Bioengineering and Chemical and Biomolecular Engineering, University of California Berkeley CA 94720 USA
| |
Collapse
|
5
|
Foda MY, Al-Shun SA, Abdelkrim G, Salem ML, Salah NA, El-Khawaga OY. Bioinformatics approach reveals the modulatory role of JUN in atorvastatin-mediated anti-breast cancer effects. J Biomol Struct Dyn 2025:1-21. [PMID: 40351185 DOI: 10.1080/07391102.2025.2499950] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Accepted: 07/21/2024] [Indexed: 05/14/2025]
Abstract
Atorvastatin, a widely prescribed cholesterol-lowering drug, has recently shown potential anticancer effects. However, its influence on gene expression and its biological functions in cancer, in particular breast cancer, still unclear. We aim to identify the dysregulated genes associated with atorvastatin treatment and the main players in their biological network. A total of 103 differentially expressed genes (DEGs) in the unified signature were identified, and the functional enrichment analysis suggested their relation to multiple cancer-related pathways. JUN was identified as the hub gene in the protein-protein interaction (PPI) network and was shown to be responsive to atorvastatin in breast cancer. Atorvastatin exhibited notable predicted cytotoxicity against breast cancer lines, with the activity positively correlated with JUN expression. JUN was significantly downregulated in breast cancer expression inversely correlated with cancer progression, whereas higher JUN expression was linked with better survival outcomes. Atorvastatin may directly interact with JUN protein forming a more compact and stable conformation. These findings demystify the potential therapeutic mechanism of atorvastatin in breast cancer, possibly by fine tuning of JUN expression. As such, JUN might serve as a valuable prognostic biomarker in breast cancer.
Collapse
Affiliation(s)
- Mohamed Y Foda
- Biochemistry Division, Chemistry Department, Faculty of Science, Mansoura University, Mansoura, Egypt
| | - Sara A Al-Shun
- Biochemistry Division, Chemistry Department, Faculty of Science, Mansoura University, Mansoura, Egypt
| | - Guendouzi Abdelkrim
- Laboratory of Chemistry, Synthesis, Properties and Applications (LCSPA), University of Saida, Saïda, Algeria
| | - Mohamed L Salem
- Immunology and Biotechnology Unit, Department of Zoology, Faculty of Science, and Center of Excellence in Cancer Research, Tanta University, Tanta, Egypt
| | - Nevin A Salah
- Biochemistry Division, Chemistry Department, Faculty of Science, Mansoura University, Mansoura, Egypt
| | - Omali Y El-Khawaga
- Biochemistry Division, Chemistry Department, Faculty of Science, Mansoura University, Mansoura, Egypt
| |
Collapse
|
6
|
Chatterjee A, Ravandi B, Haddadi P, Philip NH, Abdelmessih M, Mowrey WR, Ricchiuto P, Liang Y, Ding W, Mobarec JC, Eliassi-Rad T. Topology-driven negative sampling enhances generalizability in protein-protein interaction prediction. Bioinformatics 2025; 41:btaf148. [PMID: 40193392 PMCID: PMC12080959 DOI: 10.1093/bioinformatics/btaf148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2024] [Revised: 03/03/2025] [Accepted: 04/04/2025] [Indexed: 04/09/2025] Open
Abstract
MOTIVATION Unraveling the human interactome to uncover disease-specific patterns and discover drug targets hinges on accurate protein-protein interaction (PPI) predictions. However, challenges persist in machine learning (ML) models due to a scarcity of quality hard negative samples, shortcut learning, and limited generalizability to novel proteins. RESULTS In this study, we introduce a novel approach for strategic sampling of protein-protein noninteractions (PPNIs) by leveraging higher-order network characteristics that capture the inherent complementarity-driven mechanisms of PPIs. Next, we introduce Unsupervised Pre-training of Node Attributes tuned for PPI (UPNA-PPI), a high throughput sequence-to-function ML pipeline, integrating unsupervised pre-training in protein representation learning with Topological PPNI (TPPNI) samples, capable of efficiently screening billions of interactions. By using our TPPNI in training the UPNA-PPI model, we improve PPI prediction generalizability and interpretability, particularly in identifying potential binding sites locations on amino acid sequences, strengthening the prioritization of screening assays and facilitating the transferability of ML predictions across protein families and homodimers. UPNA-PPI establishes the foundation for a fundamental negative sampling methodology in graph machine learning by integrating insights from network topology. AVAILABILITY AND IMPLEMENTATION Code and UPNA-PPI predictions are freely available at https://github.com/alxndgb/UPNA-PPI.
Collapse
Affiliation(s)
- Ayan Chatterjee
- BioClarity AI, Boston, MA 02130, United States
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
| | - Babak Ravandi
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
- Department of Physics, Northeastern University, Boston, MA 02115, United States
| | - Parham Haddadi
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Naomi H Philip
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Mario Abdelmessih
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - William R Mowrey
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Piero Ricchiuto
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Yupu Liang
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Wei Ding
- Bioinformatics and Data Science, Alexion AstraZeneca Rare Disease, Boston, MA 02210, United States
| | - Juan Carlos Mobarec
- Protein Structure and Biophysics, Discovery Sciences, R&D, AstraZeneca, Cambridge, UK
| | - Tina Eliassi-Rad
- Network Science Institute, Northeastern University, Boston, MA 02115, United States
- Khoury College of Computer Sciences, Northeastern University, Boston, MA CB2 0AA, United States
- Santa Fe Institute, Santa Fe, NM 87501, United States
| |
Collapse
|
7
|
Wang Z, Nie T. ProCV: A 3D similarity grouping method for enhanced protein pocket recognition and ligand interaction analysis. iScience 2025; 28:112305. [PMID: 40264796 PMCID: PMC12013484 DOI: 10.1016/j.isci.2025.112305] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2025] [Revised: 02/11/2025] [Accepted: 03/24/2025] [Indexed: 04/24/2025] Open
Abstract
Efficient identification of protein binding pockets is critical for accurately predicting protein-ligand interactions. Traditional sequence-based methods often fail to capture structural complexity and require extensive conformational sampling, limiting both efficiency and accuracy. To overcome these challenges, we present ProCV, an innovative structure-based prediction method that utilizes advanced spatial recognition techniques-specifically, 3D similarity grouping in the Hough space-to enhance precision and speed. ProCV employs uniform spatial sampling, KD-tree structures, and the 3D Hough transform for accurate binding pocket identification. Comparative analyses on datasets from the Protein DataBank (PDB), scPDB, and BioLip demonstrate that ProCV offers high specificity and sensitivity with reduced false positives. Its similarity assessment framework accurately characterizes the spatial arrangement of 3D protein structures, facilitating precise binding site localization. These findings highlight ProCV's robustness, precision, and flexibility in identifying binding residues at atomic resolution within 3D structures, affirming its value in structural bioinformatics for protein-ligand interaction studies.
Collapse
Affiliation(s)
- Zhenhao Wang
- School of Information and Control Engineering, Qingdao University of Technology, No.777 Jialingjiang East Road, West Coast New Area, Qingdao 266520, China
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO 65211-7310, USA
| | - Tingyuan Nie
- School of Information and Control Engineering, Qingdao University of Technology, No.777 Jialingjiang East Road, West Coast New Area, Qingdao 266520, China
| |
Collapse
|
8
|
Śmiga M, Roszkiewicz E, Ślęzak P, Tracz M, Olczak T. cAMP-independent Crp homolog adds to the multi-layer regulatory network in Porphyromonas gingivalis. Front Cell Infect Microbiol 2025; 15:1535009. [PMID: 40308968 PMCID: PMC12040651 DOI: 10.3389/fcimb.2025.1535009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2024] [Accepted: 03/21/2025] [Indexed: 05/02/2025] Open
Abstract
Introduction Porphyromonas gingivalis encodes three CRP/FNR superfamily proteins: HcpR, PgRsp, and CrpPg, with CrpPg similar to cAMP-sensing proteins but not classified into known families. This study investigates the role of CrpPg in regulating the expression of factors essential for P. gingivalis virulence in A7436 and ATCC 33277 strains. Methods The role of CrpPg protein in P. gingivalis was determined using the ΔcrpPg mutant strains to characterize their phenotype and to assess the impact of crpPg inactivation on gene expression using RNA-seq and RT-qPCR. Additionally, the CrpPg protein was purified and characterized. Results Key findings in the ΔcrpPg mutant strain include up-regulated mfa1-5 and rgpA genes and down-regulated trxA, soxR, and ustA genes. While crpPg inactivation does not affect growth in liquid culture media, it impairs biofilm formation and enhances adhesion to and invasion of gingival keratinocytes. CrpPg binds directly to its own and mfa promoters without interacting with cyclic nucleotides or di-nucleotides. Its three-dimensional structure, resembling E. coli Crp in complex with cAMP and DNA, suggests that CrpPg functions as a global regulator independently of cAMP binding. The highest crpPg expression in the early exponential growth phase declines as cell density and metabolic conditions change over time, suggesting a regulatory function depending on the CrpPg protein amount. Conclusions By controlling the shift from planktonic to biofilm lifestyle, CrpPg may play a role in pathogenicity. Regulating the expression of virulence factors required for host cell invasion and intracellular replication, CrpPg may help P. gingivalis evade immune responses.
Collapse
Affiliation(s)
- Michał Śmiga
- Laboratory of Medical Biology, Faculty of Biotechnology, University of Wrocław, Wrocław, Poland
| | - Ewa Roszkiewicz
- Laboratory of Medical Biology, Faculty of Biotechnology, University of Wrocław, Wrocław, Poland
| | - Paulina Ślęzak
- Laboratory of Medical Biology, Faculty of Biotechnology, University of Wrocław, Wrocław, Poland
| | - Michał Tracz
- Laboratory of Protein Mass Spectrometry, Faculty of Biotechnology, University of Wrocław, Wrocław, Poland
| | - Teresa Olczak
- Laboratory of Medical Biology, Faculty of Biotechnology, University of Wrocław, Wrocław, Poland
| |
Collapse
|
9
|
Mitra P, Chatterjee S. In silico approach on structural and functional characterization of heat shock protein from Sulfobacillus acidophilus. J Appl Genet 2025:10.1007/s13353-025-00964-6. [PMID: 40232564 DOI: 10.1007/s13353-025-00964-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2024] [Revised: 02/18/2025] [Accepted: 03/20/2025] [Indexed: 04/16/2025]
Abstract
The 70 kDa heat shock proteins (Hsp70 s) are highly conserved and ubiquitous molecular chaperones. Hsp70 proteins are intimately involved in different biological activities including maintaining protein homeostasis and resisting environmental stress for survival. Characterizations of eukaryotic Hsp70 s with diverse functions are well established but investigations needed for prokaryotes. For better understanding, the sequences of Sulfobacillus acidophilus were retrieved from UniProt. Retrieved stress proteins were renamed as SaHsp70 s and performed an in silico analysis to identify sequential, structural properties and functional attributes. The in silico characterization of these proteins revealed that they are acidic, mostly thermostable globular protein with NAD(P)-binding Rossmann-folding. Molecular mass of SaHsp70 s ranged from 31.9 to 68.5 kDa and mainly localized in the cytoplasm. Phylogeny revealed the evolutionary distance and relationship among retrieved proteins. Domain analyzed only SaHsp70 - 1, SaHsp70 - 3, and SaHsp70 - 14 have actual conserved domain for Hsp70 and share the same clade on phylogenetic tree. Major part of each protein was abundant with α-helix and random coil which make it thermally stable and suitable for interacting with other proteins. SAVES and ProSA server proves the reliability, stability, and consistency tertiary structure of SaHsp70 s. Functional analysis was done in terms of membrane protein topology, PPI network generation, active and proteolytic cleavage sites prediction, conserved motif and domain detection. CastP predicted Gly, Lys, Thr, Glu, Pro, Gln, Arg and Val act as catalytic residue, are important for metal ions binding. Intramolecular interaction analysis suggested Lys67, Thr12, Thr170, Gly 168, Gly 169, and Glu 141 of SaHsp70 - 14 proteins could play central role in various complex cellular functions like stress mitigation, thermal stability, and related developmental processes.
Collapse
Affiliation(s)
- Pritish Mitra
- PG Department of Botany, Ramananda College, Bishnupur, Bankura, W.B, India
| | | |
Collapse
|
10
|
Xiong S, Cai J, Shi H, Cui F, Zhang Z, Wei L. UMPPI: Unveiling Multilevel Protein-Peptide Interaction Prediction via Language Models. J Chem Inf Model 2025; 65:3789-3799. [PMID: 40077987 DOI: 10.1021/acs.jcim.4c02365] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2025]
Abstract
Protein-peptide interactions are essential to cellular processes and disease mechanisms. Identifying protein-peptide binding residues is critical for understanding peptide function and advancing drug discovery. However, experimental methods are costly and time-intensive, while existing computational approaches often predict interactions or binding residues separately, lack effective feature integration, or rely heavily on limited high-quality structural data. To address these challenges, we propose UMPPI (Unveiling Multilevel Protein-Peptide Interaction), a multiobjective framework based on the pretrained protein language model ESM2. UMPPI simultaneously predicts binary protein-peptide interactions and binding residues on both peptides and proteins through a multiobjective optimization strategy. By integrating ESM2 to encode sequences and extract latent structural information, UMPPI bridges the gap between sequence-based and structure-based methods. Extensive experiments demonstrated that UMPPI successfully captured binary interactions between peptides and proteins and identified the binding residues on peptides and proteins. UMPPI can serve as a useful tool for protein-peptide interaction prediction and identification of critical binding residues, thereby facilitating the peptide drug discovery process.
Collapse
Affiliation(s)
- Shuwen Xiong
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
| | - Jiajie Cai
- School of Software, Shandong University, Jinan 250101, China
| | - Hua Shi
- School of Optoelectronic and Communication Engineering, Xiamen University of Technology, Xiamen 361005, China
| | - Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
| | - Leyi Wei
- Faculty of Applied Sciences, Macao Polytechnic University, R. de Luís Gonzaga Gomes, Macao 999078, China
- School of Software, Shandong University, Jinan 250101, China
| |
Collapse
|
11
|
Fang A, Zhang Z, Zhou A, Zitnik M. ATOMICA: Learning Universal Representations of Intermolecular Interactions. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.04.02.646906. [PMID: 40291688 PMCID: PMC12026499 DOI: 10.1101/2025.04.02.646906] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]
Abstract
Molecular interactions underlie nearly all biological processes, but most machine learning models treat molecules in isolation or specialize in a single type of interaction, such as protein-ligand or protein-protein binding. This siloed approach prevents generalization across biomolecular classes and limits the ability to model interaction interfaces systematically. We introduce ATOMICA, a geometric deep learning model that learns atomic-scale representations of intermolecular interfaces across diverse biomolecular modalities, including small molecules, metal ions, amino acids, and nucleic acids. ATOMICA uses a self-supervised denoising and masking objective to train on 2,037,972 interaction complexes and generate hierarchical embeddings at the levels of atoms, chemical blocks, and molecular interfaces. The model generalizes across molecular classes and recovers shared physicochemical features without supervision. Its latent space captures compositional and chemical similarities across interaction types and follows scaling laws that improve representation quality with increasing biomolecular data modalities. We apply ATOMICA to construct five modality-specific interfaceome networks, termed ATOMICAN et s, which connect proteins based on interaction similarity with ions, small molecules, nucleic acids, lipids, and proteins. These networks identify disease pathways across 27 conditions and predict disease-associated proteins in autoimmune neuropathies and lymphoma. Finally, we use ATOMICA to annotate the dark proteome-proteins lacking known structure or function-by predicting 2,646 previously uncharacterized ligand-binding sites. These include putative zinc finger motifs and transmembrane cytochrome subunits, demonstrating that ATOMICA enables systematic annotation of molecular interactions across the proteome.
Collapse
|
12
|
Asim MN, Ibrahim MA, Zaib A, Dengel A. DNA sequence analysis landscape: a comprehensive review of DNA sequence analysis task types, databases, datasets, word embedding methods, and language models. Front Med (Lausanne) 2025; 12:1503229. [PMID: 40265190 PMCID: PMC12011883 DOI: 10.3389/fmed.2025.1503229] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2024] [Accepted: 03/10/2025] [Indexed: 04/24/2025] Open
Abstract
Deoxyribonucleic acid (DNA) serves as fundamental genetic blueprint that governs development, functioning, growth, and reproduction of all living organisms. DNA can be altered through germline and somatic mutations. Germline mutations underlie hereditary conditions, while somatic mutations can be induced by various factors including environmental influences, chemicals, lifestyle choices, and errors in DNA replication and repair mechanisms which can lead to cancer. DNA sequence analysis plays a pivotal role in uncovering the intricate information embedded within an organism's genetic blueprint and understanding the factors that can modify it. This analysis helps in early detection of genetic diseases and the design of targeted therapies. Traditional wet-lab experimental DNA sequence analysis through traditional wet-lab experimental methods is costly, time-consuming, and prone to errors. To accelerate large-scale DNA sequence analysis, researchers are developing AI applications that complement wet-lab experimental methods. These AI approaches can help generate hypotheses, prioritize experiments, and interpret results by identifying patterns in large genomic datasets. Effective integration of AI methods with experimental validation requires scientists to understand both fields. Considering the need of a comprehensive literature that bridges the gap between both fields, contributions of this paper are manifold: It presents diverse range of DNA sequence analysis tasks and AI methodologies. It equips AI researchers with essential biological knowledge of 44 distinct DNA sequence analysis tasks and aligns these tasks with 3 distinct AI-paradigms, namely, classification, regression, and clustering. It streamlines the integration of AI into DNA sequence analysis tasks by consolidating information of 36 diverse biological databases that can be used to develop benchmark datasets for 44 different DNA sequence analysis tasks. To ensure performance comparisons between new and existing AI predictors, it provides insights into 140 benchmark datasets related to 44 distinct DNA sequence analysis tasks. It presents word embeddings and language models applications across 44 distinct DNA sequence analysis tasks. It streamlines the development of new predictors by providing a comprehensive survey of 39 word embeddings and 67 language models based predictive pipeline performance values as well as top performing traditional sequence encoding-based predictors and their performances across 44 DNA sequence analysis tasks.
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
| | - Muhammad Ali Ibrahim
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Arooj Zaib
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| | - Andreas Dengel
- German Research Center for Artificial Intelligence GmbH, Kaiserslautern, Germany
- Intelligentx GmbH (intelligentx.com), Kaiserslautern, Germany
- Department of Computer Science, Technical University of Kaiserslautern, Kaiserslautern, Germany
| |
Collapse
|
13
|
Tahmid MT, Hasan AKMM, Bayzid MS. TransBind allows precise detection of DNA-binding proteins and residues using language models and deep learning. Commun Biol 2025; 8:568. [PMID: 40185915 PMCID: PMC11971327 DOI: 10.1038/s42003-025-07534-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 01/13/2025] [Indexed: 04/07/2025] Open
Abstract
Identifying DNA-binding proteins and their binding residues is critical for understanding diverse biological processes, but conventional experimental approaches are slow and costly. Existing machine learning methods, while faster, often lack accuracy and struggle with data imbalance, relying heavily on evolutionary profiles like PSSMs and HMMs derived from multiple sequence alignments (MSAs). These dependencies make them unsuitable for orphan proteins or those that evolve rapidly. To address these challenges, we introduce TransBind, an alignment-free deep learning framework that predicts DNA-binding proteins and residues directly from a single primary sequence, eliminating the need for MSAs. By leveraging features from pre-trained protein language models, TransBind effectively handles the issue of data imbalance and achieves superior performance. Extensive evaluations using diverse experimental datasets and case studies demonstrate that TransBind significantly outperforms state-of-the-art methods in terms of both accuracy and computational efficiency. TransBind is available as a web server at https://trans-bind-web-server-frontend.vercel.app/ .
Collapse
Affiliation(s)
- Md Toki Tahmid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - A K M Mehedi Hasan
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh
| | - Md Shamsuzzoha Bayzid
- Department of Computer Science and Engineering, Bangladesh University of Engineering and Technology, Dhaka, 1205, Bangladesh.
| |
Collapse
|
14
|
Gheeraert A, Guyon F, Pérez S, Galochkina T. Unraveling the diversity of protein-carbohydrate interfaces: Insights from a multi-scale study. Carbohydr Res 2025; 550:109377. [PMID: 39823696 DOI: 10.1016/j.carres.2025.109377] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2024] [Revised: 12/18/2024] [Accepted: 01/08/2025] [Indexed: 01/20/2025]
Abstract
Protein-carbohydrate interactions play a crucial role in numerous fundamental biological processes. Thus, description and comparison of the carbohydrate binding site (CBS) architecture is of great importance for understanding of the underlying biological mechanisms. However, traditional approaches for carbohydrate-binding protein analysis and annotation rely primarily on the sequence-based methods applied to specific protein classes. The recently released DIONYSUS database aims to fill this gap by providing tools for CBS comparison at different levels: both in terms of protein properties and classification, as well as in terms of atomistic CBS organization. In the current study, we explore DIONYSUS content using a combination of the suggested approaches in order to evaluate the diversity of the currently resolved non-covalent protein-carbohydrate interfaces at different scales. Notably, our analysis reveals evolutionary convergence of CBS in proteins with distinct folds and coming from organisms across different kingdoms of life. Furthermore, we demonstrate that a CBS structure based approach has the potential to facilitate functional annotation for the proteins with missing information in the existing databases. In particular, it provides reliable information for numerous carbohydrate-binding proteins from rapidly evolving organisms, whose analysis is particularly challenging for classical sequence-based methods.
Collapse
Affiliation(s)
- Aria Gheeraert
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France
| | - Frédéric Guyon
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France
| | - Serge Pérez
- Centre de Recherches sur les Macromolécules Végétales, University Grenoble Alpes, CNRS,UPR 5301, Grenoble, France
| | - Tatiana Galochkina
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, F-75015 Paris, France.
| |
Collapse
|
15
|
Meng L, Wei L, Wu R. MVGNN-PPIS: A novel multi-view graph neural network for protein-protein interaction sites prediction based on Alphafold3-predicted structures and transfer learning. Int J Biol Macromol 2025; 300:140096. [PMID: 39848362 DOI: 10.1016/j.ijbiomac.2025.140096] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2024] [Revised: 01/04/2025] [Accepted: 01/17/2025] [Indexed: 01/25/2025]
Abstract
Protein-protein interactions (PPI) are crucial for understanding numerous biological processes and pathogenic mechanisms. Identifying interaction sites is essential for biomedical research and targeted drug development. Compared to experimental methods, accurate computational approaches for protein-protein interaction sites (PPIS) prediction can save significant time and costs. In this study, we propose a novel model named MVGNN-PPIS. To the best of our knowledge, it is the first to utilize predicted structures generated by AlphaFold3, and combined with transfer learning techniques, for predicting PPIS. This approach addresses the limitations of traditional methods that depend on native protein structures and multiple sequence alignments (MSA). Additionally, we introduced a multi-view graph framework based on two types of graph structures: the k-nearest neighbor graph and the adjacency matrix. By alternately employing a Graph Transformer and Graph Convolutional Networks (GCN) to aggregate node information, this framework effectively captures both local and global dependencies of each residue in the predicted structures, thereby significantly enhancing the model's sensitivity to binding sites. This framework further integrates direction, distances and angular information between the 3D coordinates of side-chain atom centroids to construct a relative coordinate system, generating enhanced edge features that ensure the model's equivariance to molecular translations and rotations in space. During training, the Focal Loss function is employed to effectively address the class imbalance in the dataset. Experimental results demonstrate that MVGNN outperforms the current state-of-the-art methods across multiple PPIS benchmark datasets. To further validate the model's generalization capability, we extended MVGNN to the domain of predicting protein-nucleic acid interaction sites, where it also achieved superior performance.
Collapse
Affiliation(s)
- Lu Meng
- College of Information Science and Engineering, Northeastern University, China.
| | - Lishuai Wei
- College of Information Science and Engineering, Northeastern University, China
| | - Rina Wu
- College of Information Science and Engineering, Northeastern University, China
| |
Collapse
|
16
|
van der Weg K, Merdivan E, Piraud M, Gohlke H. TopEC: prediction of Enzyme Commission classes by 3D graph neural networks and localized 3D protein descriptor. Nat Commun 2025; 16:2737. [PMID: 40108108 PMCID: PMC11923149 DOI: 10.1038/s41467-025-57324-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 02/11/2025] [Indexed: 03/22/2025] Open
Abstract
Tools available for inferring enzyme function from general sequence, fold, or evolutionary information are generally successful. However, they can lead to misclassification if a deviation in local structural features influences the function. Here, we present TopEC, a 3D graph neural network based on a localized 3D descriptor to learn chemical reactions of enzymes from enzyme structures and predict Enzyme Commission (EC) classes. Using message-passing frameworks, we include distance and angle information to significantly improve the predictive performance for EC classification (F-score: 0.72) compared to regular 2D graph neural networks. We trained networks without fold bias that can classify enzyme structures for a vast functional space (>800 ECs). Our model is robust to uncertainties in binding site locations and similar functions in distinct binding sites. We observe that TopEC networks learn from an interplay between biochemical features and local shape-dependent features. TopEC is available as a repository on GitHub: https://github.com/IBG4-CBCLab/TopEC and https://doi.org/10.25838/d5p-66 .
Collapse
Affiliation(s)
- Karel van der Weg
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, 52425, Jülich, Germany
| | - Erinc Merdivan
- Helmholtz AI Central Unit, Ingolstädter Landstraße 1, 85764, Oberschleißheim, Germany
| | - Marie Piraud
- Helmholtz AI Central Unit, Ingolstädter Landstraße 1, 85764, Oberschleißheim, Germany
| | - Holger Gohlke
- Institute of Bio- and Geosciences (IBG-4: Bioinformatics), Forschungszentrum Jülich GmbH, 52425, Jülich, Germany.
- Institute for Pharmaceutical and Medicinal Chemistry, Heinrich Heine University Düsseldorf, 40225, Düsseldorf, Germany.
| |
Collapse
|
17
|
Dai X, Henderson M, Yoo S, Liu Q. Predicting Metal-binding Proteins and Structures Through Integration of Evolutionary-scale and Physics-based Modeling. J Mol Biol 2025; 437:168962. [PMID: 39864615 DOI: 10.1016/j.jmb.2025.168962] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2024] [Revised: 01/20/2025] [Accepted: 01/21/2025] [Indexed: 01/28/2025]
Abstract
Metals are essential elements in all living organisms, binding to approximately 50% of proteins. They serve to stabilize proteins, catalyze reactions, regulate activities, and fulfill various physiological and pathological functions. While there have been many advancements in determining the structures of protein-metal complexes, numerous metal-binding proteins still need to be identified through computational methods and validated through experiments. To address this need, we have developed the ESMBind workflow, which combines evolutionary scale modeling (ESM) for metal-binding prediction and physics-based protein-metal modeling. Our approach utilizes the ESM-2 and ESM-IF models to predict metal-binding probability at the residue level. In addition, we have designed a metal-placement method and energy minimization technique to generate detailed 3D structures of protein-metal complexes. Our workflow outperforms other models in terms of residue and 3D-level predictions. To demonstrate its effectiveness, we applied the workflow to 142 uncharacterized fungal pathogen proteins and predicted metal-binding proteins involved in fungal infection and virulence.
Collapse
Affiliation(s)
- Xin Dai
- Computational Science Initiative, Brookhaven National Laboratory Upton NY USA.
| | - Max Henderson
- Department of Biochemistry and Cell Biology Stony Brook University Stony Brook NY USA
| | - Shinjae Yoo
- Computational Science Initiative, Brookhaven National Laboratory Upton NY USA
| | - Qun Liu
- Department of Biochemistry and Cell Biology Stony Brook University Stony Brook NY USA; Biology Department, Brookhaven National Laboratory, Upton NY USA.
| |
Collapse
|
18
|
Erckert K, Birkeneder F, Rost B. bindNode24: Competitive binding residue prediction with 60 % smaller model. Comput Struct Biotechnol J 2025; 27:1060-1066. [PMID: 40165821 PMCID: PMC11957672 DOI: 10.1016/j.csbj.2025.02.042] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2024] [Revised: 02/26/2025] [Accepted: 02/27/2025] [Indexed: 04/02/2025] Open
Abstract
Many proteins function through ligand binding. Yet, reliable experimental binding data remains limited. Recent advances predict binding residues from sequences using protein Language Model embeddings. The AlphaFold Protein Structure Database, which has reliable 3D structure predictions from AlphaFold2, opens the way for graph neural networks that predict binding residues. Here, we introduce bindNode24, a new method using Graph Neural Networks to predict whether a residue binds to any of three ligand classes: small molecules, metal ions, and nucleic macromolecules. Compared to state-of-the-art, this approach reduces the number of free parameters by almost 60 % at similar performance. Our findings also suggest that secondary and tertiary structure features from AlphaFold2 are easy to integrate into protein function prediction tasks that previously solely relied on protein Language Model embeddings.
Collapse
Affiliation(s)
- Kyra Erckert
- TUM School of Computation, Information and Technology, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, Garching, Munich 85748, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Boltzmannstr. 11, Garching 85748, Germany
| | - Franz Birkeneder
- TUM School of Computation, Information and Technology, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, Garching, Munich 85748, Germany
| | - Burkhard Rost
- TUM School of Computation, Information and Technology, Bioinformatics & Computational Biology - i12, Boltzmannstr. 3, Garching, Munich 85748, Germany
- Institute for Advanced Study (TUM-IAS), Lichtenbergstr. 2a, Garching, Munich 85748, Germany
- Germany & TUM School of Life Sciences Weihenstephan (TUM-WZW), Alte Akademie 8, Freising, Germany
| |
Collapse
|
19
|
Huebert DNG, Ghorbani A, Lam SYB, Larijani M. Coevolution of Lentiviral Vif with Host A3F and A3G: Insights from Computational Modelling and Ancestral Sequence Reconstruction. Viruses 2025; 17:393. [PMID: 40143321 PMCID: PMC11946711 DOI: 10.3390/v17030393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2024] [Revised: 03/03/2025] [Accepted: 03/05/2025] [Indexed: 03/28/2025] Open
Abstract
The evolutionary arms race between host restriction factors and viral antagonists provides crucial insights into immune system evolution and viral adaptation. This study investigates the structural and evolutionary dynamics of the double-domain restriction factors A3F and A3G and their viral inhibitor, Vif, across diverse primate species. By constructing 3D structural homology models and integrating ancestral sequence reconstruction (ASR), we identified patterns of sequence diversity, structural conservation, and functional adaptation. Inactive CD1 (Catalytic Domain 1) domains displayed greater sequence diversity and more positive surface charges than active CD2 domains, aiding nucleotide chain binding and intersegmental transfer. Despite variability, the CD2 DNA-binding grooves remained structurally consistent with conserved residues maintaining critical functions. A3F and A3G diverged in loop 7' interaction strategies, utilising distinct molecular interactions to facilitate their roles. Vif exhibited charge variation linked to host species, reflecting its coevolution with A3 proteins. These findings illuminate how structural adaptations and charge dynamics enable both restriction factors and their viral antagonists to adapt to selective pressures. Our results emphasize the importance of studying structural evolution in host-virus interactions, with implications for understanding immune defense mechanisms, zoonotic risks, and viral evolution. This work establishes a foundation for further exploration of restriction factor diversity and coevolution across species.
Collapse
Affiliation(s)
- David Nicolas Giuseppe Huebert
- Immunology and Infectious Diseases Program, Division of Biomedical Sciences, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, NL A1C 5S7, Canada; (D.N.G.H.); (A.G.)
- Structural Biology and Immunology Program, Department of Molecular Biology and Biochemistry, Faculty of Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada;
| | - Atefeh Ghorbani
- Immunology and Infectious Diseases Program, Division of Biomedical Sciences, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, NL A1C 5S7, Canada; (D.N.G.H.); (A.G.)
| | - Shaw Yick Brian Lam
- Structural Biology and Immunology Program, Department of Molecular Biology and Biochemistry, Faculty of Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada;
| | - Mani Larijani
- Immunology and Infectious Diseases Program, Division of Biomedical Sciences, Faculty of Medicine, Memorial University of Newfoundland, St. John’s, NL A1C 5S7, Canada; (D.N.G.H.); (A.G.)
- Structural Biology and Immunology Program, Department of Molecular Biology and Biochemistry, Faculty of Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada;
| |
Collapse
|
20
|
Wang Y, Sun K, Li J, Guan X, Zhang O, Bagni D, Zhang Y, Carlson HA, Head-Gordon T. A Workflow to Create a High-Quality Protein-Ligand Binding Dataset for Training, Validation, and Prediction Tasks. ARXIV 2025:arXiv:2411.01223v2. [PMID: 40093369 PMCID: PMC11908357] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Development of scoring functions (SFs) used to predict protein-ligand binding energies requires high-quality 3D structures and binding assay data for training and testing their parameters. In this work, we show that one of the widely-used datasets, PDBbind, suffers from several common structural artifacts of both proteins and ligands, which may compromise the accuracy, reliability, and generalizability of the resulting SFs. Therefore, we have developed a series of algorithms organized in a semi-automated workflow, HiQBind-WF, that curates non-covalent protein-ligand datasets to fix these problems. We also used this workflow to create an independent data set, HiQBind, by matching binding free energies from various sources including BioLiP, Binding MOAD and BindingDB with co-crystalized ligand-protein complexes from the PDB. The resulting HiQBind workflow and dataset are designed to ensure reproducibility and to minimize human intervention, while also being open-source to foster transparency in the improvements made to this important resource for the biology and drug discovery communities.
Collapse
Affiliation(s)
- Yingze Wang
- Kenneth S. Pitzer Theory Center and Department of Chemistry
| | - Kunyang Sun
- Kenneth S. Pitzer Theory Center and Department of Chemistry
| | - Jie Li
- Kenneth S. Pitzer Theory Center and Department of Chemistry
| | - Xingyi Guan
- Kenneth S. Pitzer Theory Center and Department of Chemistry
| | - Oufan Zhang
- Kenneth S. Pitzer Theory Center and Department of Chemistry
| | - Dorian Bagni
- Kenneth S. Pitzer Theory Center and Department of Chemistry
| | - Yang Zhang
- Department of Computer Science, School of Computing, National University of Singapore, 117417
- Cancer Science Institute of Singapore, National University of Singapore, 117599
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117596, Singapore
| | - Heather A Carlson
- Odyssey Therapeutics Inc. 1350 Highland Dr., Ann Arbor, MI, 48108, USA
| | - Teresa Head-Gordon
- Kenneth S. Pitzer Theory Center and Department of Chemistry
- Department of Bioengineering
- Department of Chemical and Biomolecular Engineering, University of California, Berkeley, CA, 94720 USA
| |
Collapse
|
21
|
Li Y, Tian Z, Nan X, Zhang S, Zhou Q, Lu S. HSSPPI: hierarchical and spatial-sequential modeling for PPIs prediction. Brief Bioinform 2025; 26:bbaf079. [PMID: 40037640 PMCID: PMC11879409 DOI: 10.1093/bib/bbaf079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 02/10/2025] [Accepted: 02/13/2025] [Indexed: 03/06/2025] Open
Abstract
MOTIVATION Protein-protein interactions play a fundamental role in biological systems. Accurate detection of protein-protein interaction sites (PPIs) remains a challenge. And, the methods of PPIs prediction based on biological experiments are expensive. Recently, a lot of computation-based methods have been developed and made great progress. However, current computational methods only focus on one form of protein, using only protein spatial conformation or primary sequence. And, the protein's natural hierarchical structure is ignored. RESULTS In this study, we propose a novel network architecture, HSSPPI, through hierarchical and spatial-sequential modeling of protein for PPIs prediction. In this network, we represent protein as a hierarchical graph, in which a node in the protein is a residue (residue-level graph) and a node in the residue is an atom (atom-level graph). Moreover, we design a spatial-sequential block for capturing complex interaction relationships from spatial and sequential forms of protein. We evaluate HSSPPI on public benchmark datasets and the predicting results outperform the comparative models. This indicates the effectiveness of hierarchical protein modeling and also illustrates that HSSPPI has a strong feature extraction ability by considering spatial and sequential information simultaneously. AVAILABILITY AND IMPLEMENTATION The code of HSSPPI is available at https://github.com/biolushuai/Hierarchical-Spatial-Sequential-Modeling-of-Protein.
Collapse
Affiliation(s)
- Yuguang Li
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, Henan, China
| | - Zhen Tian
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, Henan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, Zhejiang, China
| | - Xiaofei Nan
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, Henan, China
| | - Shoutao Zhang
- School of Life Sciences, Zhengzhou University, Zhengzhou 450001, Henan, China
- Zhongyuan Intelligent Medical Laboratory, Zhengzhou 450001, Henan, China
| | - Qinglei Zhou
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, Henan, China
| | - Shuai Lu
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, Henan, China
- National Supercomputing Center in Zhengzhou, Zhengzhou University, Zhengzhou 450001, Henan, China
| |
Collapse
|
22
|
Santos SJM, Valentini A. Brussonol and komaroviquinone as inhibitors of the SARS-CoV-2 Omicron BA.2 variant spike protein: A molecular docking, molecular dynamics, and quantum biochemistry approach. J Mol Graph Model 2025; 135:108914. [PMID: 39637552 DOI: 10.1016/j.jmgm.2024.108914] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2024] [Revised: 10/05/2024] [Accepted: 11/21/2024] [Indexed: 12/07/2024]
Abstract
Since late 2019, humanity has faced the challenges posed by the COVID-19 pandemic, caused by the SARS-CoV-2 virus. The continuous evolution of SARS-CoV-2 has led to the emergence of multiple Variants of Concern (VOCs) and Variants of Interest (VOIs), posing significant risks to global health. SARS-CoV-2 infects host cells via the angiotensin-converting enzyme 2 (ACE2) receptors, facilitated by the spike (S) protein. Icetexane diterpenes, including brussonol and komaroviquinone, exhibit notable anti-inflammatory, antibacterial, antiviral, antiproliferative, and anticancer properties. Recent research has explored their potential as inhibitors of the SARS-CoV-2 3Clpro protease, showing promising efficacy comparable to Nirmatrelvir. This study investigates brussonol and komaroviquinone as potential inhibitors of the SARS-CoV-2 Omicron BA.2 variant spike protein using molecular docking, molecular dynamics simulations, and quantum biochemistry approaches. The stability and interaction energies of brussonol, komaroviquinone, and mefloquine with the SARS-CoV-2 Omicron BA.2 variant spike protein were evaluated. RMSD analysis demonstrated that komaroviquinone and mefloquine maintain more stable binding poses with the spike protein compared to various NAGs and glycans. Electrostatic potential maps revealed significant interactions with ASN603, a critical residue for ligand binding efficacy. Furthermore, this study addresses a gap in current research, as no studies were found that simulate the trimer of the SARS-CoV-2 BA.2 variant spike protein. Most existing studies focus on the monomer and often exclude the NAGs and glycans. This research underscores the importance of maintaining the NAGs and glycans in the trimer simulations, providing a more accurate representation of the protein's structure and its interactions with ligands. The findings indicate that both komaroviquinone and brussonol exhibit higher binding affinities compared to mefloquine. This study provides valuable insights into the molecular interactions of these compounds, highlighting their potential for further development as antiviral agents against SARS-CoV-2.
Collapse
Affiliation(s)
- Samuel J M Santos
- Federal Institute of Education, Science and Technology of Rio Grande Do Sul, 95770-000, Feliz, Rio Grande Do Sul, Brazil.
| | - Antoninho Valentini
- Department of Analytical Chemistry and Physical Chemistry, Federal University of Ceará, Campus of Pici, 60440-554, Fortaleza, Ceará, Brazil.
| |
Collapse
|
23
|
Zhang Y, Huang C, Wang Y, Li S, Sun S. CL-GNN: Contrastive Learning and Graph Neural Network for Protein-Ligand Binding Affinity Prediction. J Chem Inf Model 2025; 65:1724-1735. [PMID: 39913849 DOI: 10.1021/acs.jcim.4c01290] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
In the realm of drug discovery and design, the accurate prediction of protein-ligand binding affinity is of paramount importance as it underpins the functional interactions within biological systems. This study introduces a novel self-supervised learning (SSL) framework that combines contrastive learning and graph neural networks (CL-GNN) for predicting protein-ligand binding affinities, which is a critical aspect of drug discovery. Traditional methods for affinity prediction are expensive and time-consuming, prompting the development of more efficient computational approaches. CL-GNN utilizes a contrastive learning strategy, a form of SSL, to learn from a large data set of 371 458 unique unlabeled protein-ligand complexes. By employing graph neural networks and molecular graph enhancement techniques, the model effectively captures protein-ligand interactions in a self-supervised manner. The fine-tuned model demonstrates competitive performance, achieving high Pearson's correlation coefficients and low root-mean-square errors on benchmark data sets. The proposed method outperforms existing machine learning models, showcasing its potential for accelerating the drug development process. The method effectively quantifies the similarity between protein-ligand complex representations learned in the pretraining and downstream testing phases through cosine similarity assessment. This approach not only revealed potential connections between complexes in their binding properties but also provided new insights into the understanding of drug mechanisms of action. In addition, the transparency of the model is significantly improved by visualizing the importance of key protein residues and ligand atoms. This visualization tool provides insight into the model's predictive decision-making process, providing key biological insights for drug design and optimization.
Collapse
Affiliation(s)
- Yunjiang Zhang
- Department of Chemical Engineering and Technology, College of Materials Science and Engineering, Beijing University of Technology, Beijing 100124, P. R. China
| | - Chenyu Huang
- Department of Chemical Engineering and Technology, College of Materials Science and Engineering, Beijing University of Technology, Beijing 100124, P. R. China
| | - Yaxin Wang
- Department of Chemical Engineering and Technology, College of Materials Science and Engineering, Beijing University of Technology, Beijing 100124, P. R. China
| | - Shuyuan Li
- Department of Chemical Engineering and Technology, College of Materials Science and Engineering, Beijing University of Technology, Beijing 100124, P. R. China
| | - Shaorui Sun
- Department of Chemical Engineering and Technology, College of Materials Science and Engineering, Beijing University of Technology, Beijing 100124, P. R. China
| |
Collapse
|
24
|
Ma X, Li F, Chen Q, Gao S, Bai F. NesT-NABind: a Nested Transformer for Nucleic Acid-Binding Site Prediction on Protein Surface. J Chem Inf Model 2025; 65:1166-1177. [PMID: 39818834 DOI: 10.1021/acs.jcim.4c01765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2025]
Abstract
Protein-nucleic acid interactions play a crucial role in many physiological processes. Identifying the binding sites of nucleotides on the protein surface is the prerequisite for understanding the molecular recognition mechanisms between the two types of macromolecules and also provides the information to design or generate molecule modulators against these sites to manipulate biological function according to specific requirements. Existing studies mainly focus on characterizing local surfaces around sites, often neglecting the interrelationships among these sites and the global protein information. To address this gap, we propose NesT-NABind, a Nested Transformer for Nucleic Acid-Binding site prediction. This model leverages the Transformer's advanced capabilities in contextual understanding and long-range dependency capturing. Specifically, we introduce a local patch-scale Transformer to process surface information around each site and a global protein-scale transformer to integrate surface and sequence information on the entire protein. These two Transformers operate at different scales of protein, hence the term "nested". Experiments demonstrate that NesT-NABind achieves a 5.57% improvement in the F1 score and a 3.64% improvement in AUPRC compared to state-of-the-art methods. With the incorporation of global features, NesT-NABind shows an enhanced predictive capability for the challenging large proteins and therefore can be used in a much wider range of applications.
Collapse
Affiliation(s)
- Xinyue Ma
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
- Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
| | - Fenglei Li
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
- Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
- Department of Computer Science, Aalto University,Konemiehentie 2, Espoo02150,Finland
| | - Qianyu Chen
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
- Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
| | - Shenghua Gao
- Department of Computer Science, The University of Hong Kong, Pokfolam Road, HKSAR, 999077, China
- HKU Shanghai lntelligent Computing Research Center, Shanghai, 201210, China
| | - Fang Bai
- School of Information Science and Technology, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
- Shanghai Institute for Advanced Immunochemical Studies, ShanghaiTech University, 393 Middle Huaxia Road, Pudong New Area, Shanghai 201210, China
- School of Life Science and Technology, ShanghaiTech University, Pudong New Area, 393 Middle Huaxia Road, Shanghai 201210, China
- Shanghai Clinical Research and Trial Center, No.1599 Keyuan Road, Pudong New Area, Shanghai 201210, China
| |
Collapse
|
25
|
Nunes-Alves AK, Abrahão JS, de Farias ST. Yaravirus brasiliense genomic structure analysis and its possible influence on the metabolism. Genet Mol Biol 2025; 48:e20240139. [PMID: 39918235 PMCID: PMC11803573 DOI: 10.1590/1678-4685-gmb-2024-0139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2024] [Accepted: 12/11/2024] [Indexed: 02/11/2025] Open
Abstract
Here we analyze the Yaravirus brasiliense, an amoeba-infecting 80-nm-sized virus with a 45-kbp dsDNA, using structural molecular modeling. Almost all of its 74 genes were previously identified as ORFans. Considering its unprecedented genetic content, we analyzed Yaravirus genome to understand its genetic organization, its proteome, and how it interacts with its host. We reported possible functions for all Yaravirus proteins. Our results suggest the first ever report of a fragment proteome, in which the proteins are separated in modules and joined together at a protein level. Given the structural resemblance between some Yaravirus proteins and proteins related to tricarboxylic acid cycle (TCA), glyoxylate cycle, and the respiratory complexes, our work also allows us to hypothesize that these viral proteins could be modulating cell metabolism by upregulation. The presence of these TCA cycle-related enzymes specifically could be trying to overcome the cycle's control points, since they are strategic proteins that maintain malate and oxaloacetate levels. Therefore, we propose that Yaravirus proteins are redirecting energy and resources towards viral production, and avoiding TCA cycle control points, "unlocking" the cycle. Altogether, our data helped understand a previously almost completely unknown virus, and a little bit more of the incredible diversity of viruses.
Collapse
Affiliation(s)
- Ana Karoline Nunes-Alves
- Universidade Federal da Paraíba, Departamento de Biologia Molecular,
Laboratório de Genética Evolutiva Paulo Leminski, João Pessoa, PB, Brazil
| | - Jônatas Santos Abrahão
- Universidade Federal de Minas Gerais, Instituto de Ciências
Biológicas, Departamento de Microbiologia, Laboratório de Vírus, Belo Horizonte, MG,
Brazil
| | - Sávio Torres de Farias
- Universidade Federal da Paraíba, Departamento de Biologia Molecular,
Laboratório de Genética Evolutiva Paulo Leminski, João Pessoa, PB, Brazil
- Network of Researchers on the Chemical Evolution of Life (NoRCEL),
Leeds, United Kingdom
| |
Collapse
|
26
|
Rangra S, Aggarwal KK. Characterization and kinetics of a cathepsin B-inhibiting protein from Musa acuminata Colla peel. Biochimie 2025; 229:141-150. [PMID: 39461656 DOI: 10.1016/j.biochi.2024.10.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Revised: 10/23/2024] [Accepted: 10/24/2024] [Indexed: 10/29/2024]
Abstract
Hyperexpression of cathepsin B caused by an imbalance of endogenous inhibitors is involved in multiple pathologies, hence making it a key therapeutic target. Protease inhibitors are effective biomolecules that regulate protease activities and are considered potential therapeutic agents in various diseases. Plant protease inhibitors have been reported as an effective complementary alternative drug. A proteinaceous cathepsin B inhibitor (CBI-BP) has been isolated from Musa acuminata Colla (banana) peel with a molecular weight of 27.9 kDa on SDS-PAGE. The purity of the CBI-BP was confirmed on the native- PAGE. The isolated CBI-BP showed an IC50 value of 8.14 μg and a Ki value of 10.59 μg (0.19 μM). Cathepsin B inhibition kinetics indicated that CBI-BP follows a mixed-type of cathepsin B inhibition. Its inhibition activity was also confirmed by reverse zymography. The inhibitor was stable from pH 2.6-10.0 with maximum activity at pH 7.2, temperature 25-100 °C and exhibited thermostability for 60 min at 70 °C. MALDI/TOF/MS analysis of CBI-BP showed 40 % similarity to the GH18 domain-containing protein (A0A4S8JRM9) from Musa balbisiana. Although in-silico docking studies showed binding of A0A4S8JRM9 to cathepsin B affects the binding energy of the substrate to cathepsin B but is not reported for any anti-cathepsin B activity. This suggests that isolated CBI-BP might be a novel protein with anti-cathepsin B activity. Thus the isolated CBI-BP may be further explored as possible anti-cathepsin B drug.
Collapse
Affiliation(s)
- Sabita Rangra
- University School of Biotechnology, Guru Gobind Singh Indraprastha University. New Delhi-110078, India
| | - Kamal Krishan Aggarwal
- University School of Biotechnology, Guru Gobind Singh Indraprastha University. New Delhi-110078, India.
| |
Collapse
|
27
|
Wu J, Liu Y, Zhang Y, Wang X, Yan H, Zhu Y, Song J, Yu DJ. Identifying Protein-Nucleotide Binding Residues via Grouped Multi-task Learning and Pre-trained Protein Language Models. J Chem Inf Model 2025; 65:1040-1052. [PMID: 39788787 DOI: 10.1021/acs.jcim.4c02092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
The accurate identification of protein-nucleotide binding residues is crucial for protein function annotation and drug discovery. Numerous computational methods have been proposed to predict these binding residues, achieving remarkable performance. However, due to the limited availability and high variability of nucleotides, predicting binding residues for diverse nucleotides remains a significant challenge. To address these, we propose NucGMTL, a new grouped deep multi-task learning approach designed for predicting binding residues of all observed nucleotides in the BioLiP database. NucGMTL leverages pre-trained protein language models to generate robust sequence embedding and incorporates multi-scale learning along with scale-based self-attention mechanisms to capture a broader range of feature dependencies. To effectively harness the shared binding patterns across various nucleotides, deep multi-task learning is utilized to distill common representations, taking advantage of auxiliary information from similar nucleotides selected based on task grouping. Performance evaluation on benchmark data sets shows that NucGMTL achieves an average area under the Precision-Recall curve (AUPRC) of 0.594, surpassing other state-of-the-art methods. Further analyses highlight that the predominant advantage of NucGMTL can be reflected by its effective integration of grouped multi-task learning and pre-trained protein language models. The data set and source code are freely accessible at: https://github.com/jerry1984Y/NucGMTL.
Collapse
Affiliation(s)
- Jiashun Wu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Yan Liu
- School of Information Engineering, Yangzhou University, Yangzhou 225100, China
| | - Ying Zhang
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Xiaoyu Wang
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - He Yan
- College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, Nanjing 210037, China
| | - Yiheng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, Nanjing 210095, China
| | - Jiangning Song
- Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
- Monash Data Futures Institute, Monash University, Melbourne, VIC 3800, Australia
| | - Dong-Jun Yu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| |
Collapse
|
28
|
Dhyani K, Dash S, Joshi S, Garg A, Pal D, Nishant K, Muniyappa K. The ATPase activity of yeast chromosome axis protein Hop1 affects the frequency of meiotic crossovers. Nucleic Acids Res 2025; 53:gkae1264. [PMID: 39727188 PMCID: PMC11797056 DOI: 10.1093/nar/gkae1264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 12/05/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024] Open
Abstract
Saccharomyces cerevisiae meiosis-specific Hop1, a structural constituent of the synaptonemal complex, also facilitates the formation of programmed DNA double-strand breaks and the pairing of homologous chromosomes. Here, we reveal a serendipitous discovery that Hop1 possesses robust DNA-independent ATPase activity, although it lacks recognizable sequence motifs required for ATP binding and hydrolysis. By leveraging molecular docking combined with molecular dynamics simulations and biochemical assays, we identified an ensemble of five amino acid residues in Hop1 that could potentially participate in ATP-binding and hydrolysis. Consistent with this premise, we found that Hop1 binds to ATP and that substitution of amino acid residues in the putative ATP-binding site significantly impaired its ATPase activity, suggesting that this activity is intrinsic to Hop1. Notably, K65A and N67Q substitutions in the Hop1 N-terminal HORMA domain synergistically abolished its ATPase activity, noticeably impaired its DNA-binding affinity and reduced its association with meiotic chromosomes, while enhancing the frequency of meiotic crossovers (COs). Overall, our study establishes Hop1 as a DNA-independent ATPase and reveals a potential biological function for its ATPase activity in the regulation of meiotic CO frequency.
Collapse
Affiliation(s)
- Kshitiza M Dhyani
- Department of Biochemistry, Indian Institute of Science, CV Raman Road, Bengaluru 560012, India
| | - Suman Dash
- School of Biology, Indian Institute of Science Education and Research, Maruthamala(PO), Vithura, Thiruvananthapuram 695551, India
| | - Sameer Joshi
- School of Biology, Indian Institute of Science Education and Research, Maruthamala(PO), Vithura, Thiruvananthapuram 695551, India
| | - Aditi Garg
- Computational and Data Sciences, Indian Institute of Science, CV Raman Road, Bengaluru 560012, India
| | - Debnath Pal
- Computational and Data Sciences, Indian Institute of Science, CV Raman Road, Bengaluru 560012, India
| | - Koodali T Nishant
- School of Biology, Indian Institute of Science Education and Research, Maruthamala(PO), Vithura, Thiruvananthapuram 695551, India
| | - Kalappa Muniyappa
- Department of Biochemistry, Indian Institute of Science, CV Raman Road, Bengaluru 560012, India
| |
Collapse
|
29
|
Hao S, Li CY, Hu X, Feng Z, Zhang G, Yang C, Hu H. S-DCNN: prediction of ATP binding residues by deep convolutional neural network based on SMOTE. Front Genet 2025; 15:1513201. [PMID: 39834546 PMCID: PMC11744016 DOI: 10.3389/fgene.2024.1513201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2024] [Accepted: 12/11/2024] [Indexed: 01/22/2025] Open
Abstract
Background The realization of many protein functions requires binding with ligands. As a significant protein-binding ligand, ATP plays a crucial role in various biological processes. Currently, the precise prediction of ATP binding residues remains challenging. Methods Based on the sequence information, this paper introduces a method called S-DCNN for predicting ATP binding residues, utilizing a deep convolutional neural network (DCNN) enhanced with the synthetic minority over-sampling technique (SMOTE). Results The incorporation of additional feature parameters such as dihedral angles, energy, and propensity factors into the standard parameter set resulted in a significant enhancement in prediction accuracy on the ATP-289 dataset. The S-DCNN achieved the highest Matthews correlation coefficient value of 0.5031 and an accuracy rate of 97.06% on an independent test set. Furthermore, when applied to the ATP-221 and ATP-388 datasets for validation, the S-DCNN outperformed existing methods on ATP-221 and performed comparably to other methods on ATP-388 during independent testing. Conclusion Our experimental results underscore the efficacy of the S-DCNN in accurately predicting ATP binding residues, establishing it as a potent tool in the prediction of ATP binding residues.
Collapse
Affiliation(s)
- Sixi Hao
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
- School of Mathematics and Statistics, Xinyang College, Xinyang, China
| | - Cai-Yan Li
- School of Computer Science and Technology/Baotou Medical College, Baotou, China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Gaimei Zhang
- Department of Obstetrics and Gynecology, Hohhot First Hospital, Hohhot, China
| | - Caiyun Yang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Huimin Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| |
Collapse
|
30
|
Gheeraert A, Bailly T, Ren Y, Hamraoui A, Te J, Vander Meersche Y, Cretin G, Leon Foun Lin R, Gelly JC, Pérez S, Guyon F, Galochkina T. DIONYSUS: a database of protein-carbohydrate interfaces. Nucleic Acids Res 2025; 53:D387-D395. [PMID: 39436020 PMCID: PMC11701518 DOI: 10.1093/nar/gkae890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2024] [Revised: 09/03/2024] [Accepted: 09/26/2024] [Indexed: 10/23/2024] Open
Abstract
Protein-carbohydrate interactions govern a wide variety of biological processes and play an essential role in the development of different diseases. Here, we present DIONYSUS, the first database of protein-carbohydrate interfaces annotated according to structural, chemical and functional properties of both proteins and carbohydrates. We provide exhaustive information on the nature of interactions, binding site composition, biological function and specific additional information retrieved from existing databases. The user can easily search the database using protein sequence and structure information or by carbohydrate binding site properties. Moreover, for a given interaction site, the user can perform its comparison with a representative subset of non-covalent protein-carbohydrate interactions to retrieve information on its potential function or specificity. Therefore, DIONYSUS is a source of valuable information both for a deeper understanding of general protein-carbohydrate interaction patterns, for annotation of the previously unannotated proteins and for such applications as carbohydrate-based drug design. DIONYSUS is freely available at www.dsimb.inserm.fr/DIONYSUS/.
Collapse
Affiliation(s)
- Aria Gheeraert
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Thomas Bailly
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Yani Ren
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
- Université Paris-Saclay, INRAE, MetaGenoPolis, 78350 Jouy-en-Josas, France
| | - Ali Hamraoui
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
- Institut de biologie de l’Ecole normale supérieure (IBENS), Ecole normale supérieure, CNRS, INSERM, PSL Universite Paris, 75005 Paris, France
| | - Julie Te
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Yann Vander Meersche
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Gabriel Cretin
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Ravy Leon Foun Lin
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Jean-Christophe Gelly
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Serge Pérez
- Centre de Recherches sur les Macromolécules Végétales, University Grenoble Alpes, CNRS, UPR, 5301 Grenoble, France
| | - Frédéric Guyon
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| | - Tatiana Galochkina
- Université Paris Cité and Université des Antilles and Université de la Réunion, INSERM, BIGR, DSIMB, F-75015 Paris, France
| |
Collapse
|
31
|
Rodrigues CHM, Ascher DB. CSM-Potential2: A comprehensive deep learning platform for the analysis of protein interacting interfaces. Proteins 2025; 93:209-216. [PMID: 37870486 PMCID: PMC11623435 DOI: 10.1002/prot.26615] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 10/04/2023] [Accepted: 10/05/2023] [Indexed: 10/24/2023]
Abstract
Proteins are molecular machinery that participate in virtually all essential biological functions within the cell, which are tightly related to their 3D structure. The importance of understanding protein structure-function relationship is highlighted by the exponential growth of experimental structures, which has been greatly expanded by recent breakthroughs in protein structure prediction, most notably RosettaFold, and AlphaFold2. These advances have prompted the development of several computational approaches that leverage these data sources to explore potential biological interactions. However, most methods are generally limited to analysis of single types of interactions, such as protein-protein or protein-ligand interactions, and their complexity limits the usability to expert users. Here we report CSM-Potential2, a deep learning platform for the analysis of binding interfaces on protein structures. In addition to prediction of protein-protein interactions binding sites and classification of biological ligands, our new platform incorporates prediction of interactions with nucleic acids at the residue level and allows for ligand transplantation based on sequence and structure similarity to experimentally determined structures. We anticipate our platform to be a valuable resource that provides easy access to a range of state-of-the-art methods to expert and non-expert users for the study of biological interactions. Our tool is freely available as an easy-to-use web server and API available at https://biosig.lab.uq.edu.au/csm_potential.
Collapse
Affiliation(s)
- Carlos H. M. Rodrigues
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes InstituteMelbourneVictoriaAustralia
- School of Chemistry and Molecular BiosciencesUniversity of QueenslandBrisbaneQueenslandAustralia
| | - David B. Ascher
- Computational Biology and Clinical Informatics, Baker Heart and Diabetes InstituteMelbourneVictoriaAustralia
- School of Chemistry and Molecular BiosciencesUniversity of QueenslandBrisbaneQueenslandAustralia
| |
Collapse
|
32
|
Zhang J, Zhou F, Liang X, Kurgan L. Accurate Prediction of Protein-Binding Residues in Protein Sequences Using SCRIBER. Methods Mol Biol 2025; 2867:247-260. [PMID: 39576586 DOI: 10.1007/978-1-0716-4196-5_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
Deciphering molecular-level mechanisms that govern protein-protein interactions (PPIs) relies in part on the accurate prediction of protein-binding partners and protein-binding residues. These predictions can be used to support a wide spectrum of applications that include development of PPI networks and protein docking programs, drug design studies, and investigations of molecular details that underlie certain diseases. Computational methods that predict protein-binding residues offer convenient, inexpensive, and relatively accurate data that can aid these efforts. We introduce and describe a user-friendly webserver for the SCRIBER method that conveniently provides state-of-the-art predictions of protein-binding residues and that minimizes cross-predictions, i.e., incorrect prediction of residues that bind other/non-protein ligands as protein binding. SCRIBER relies on a two-layer architecture that is specifically designed to reduce the cross-predictions. We motivate and explain this predictive architecture. We describe how to use the webserver, interact with its web interface, and collect, read, and understand results generated by SCRIBER. The SCRIBER webserver is available at http://biomine.cs.vcu.edu/servers/SCRIBER/ .
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China.
| | - Feng Zhou
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China
| | - Xingchen Liang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
33
|
Zhao B, Basu S, Kurgan L. DescribePROT Database of Residue-Level Protein Structure and Function Annotations. Methods Mol Biol 2025; 2867:169-184. [PMID: 39576581 DOI: 10.1007/978-1-0716-4196-5_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
DescribePROT is a freely available online database of structural and functional descriptors of proteins at the amino acid level. It provides access to 13 diverse descriptors that include sequence conservation, putative secondary structure, solvent accessibility, intrinsic disorder, and signal peptides, and putative annotations of residues that interact with proteins, peptides and nucleic acids. These data can be used to elucidate protein functions, to support efforts to develop therapeutics, and to develop and evaluate future predictors of protein structure and function. DescribePROT includes 7.8 billion predictions for 1.4 million proteins from 83 complete proteomes of popular model organisms. This information can be downloaded at multiple levels of scope (entire database, specific organisms, and individual proteins) and can be interacted with using a graphical interface that simultaneously displays data on multiple descriptors. We describe the contents of this resource, provide directions on how to use its interface, and offer instructions on how to obtain and interact with the underlying data. Moreover, we briefly discuss plans for a future expansion of this database. DescribePROT is available at http://biomine.cs.vcu.edu/servers/DESCRIBEPROT/ .
Collapse
Affiliation(s)
- Bi Zhao
- Genomics program, College of Public Health, University of South Florida, Tampa, FL, USA
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA, USA.
| |
Collapse
|
34
|
Gupta N, Yadav M, Ali W, Singh G, Chaudhary S, Grover S, Chandra S, Rathore JS. Genomic profiling and molecular dynamics analysis of parDEPa toxin-antitoxin homologs targeting DNA gyrase in Pseudomonas aeruginosa: insights from computational investigations. J Biomol Struct Dyn 2025:1-17. [PMID: 39743786 DOI: 10.1080/07391102.2024.2446675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Accepted: 12/16/2024] [Indexed: 01/04/2025]
Abstract
In the realm of hospital-acquired and chronic infections, Pseudomonas aeruginosa stands out, demonstrating significant associations with increased morbidity, mortality, and antibiotic resistance. Antibiotic-resistant strains are believed to contribute to thousands of deaths each year. Chronic and latent infections are associated with the bacterial toxin-antitoxin (TA) system, although the mechanisms involved are poorly understood. This study focuses on a novel type II TA system, parDEPa, identified in the genome of P. aeruginosa ATCC 27853. We explored its structural features, functional relationships, and genetic configurations. Our research identified parDEPa homologs in P. aeruginosa, clarified their interactions, and highlighted connections to essential cellular metabolic processes. Notably, homologs of the ParDPa antitoxin were found to be more conserved than the ParEPa toxin. Structural models of the ParEPa toxin and ParDPa antitoxin confirmed their integrity. Through docking and molecular dynamics simulations, we showed that the ParEPa toxin binds to DNA gyrase, inhibiting replication. The stability of the ParDPa-ParEPa complex is primarily driven by hydrophobic interactions (-1763.2 kcal/mol), while the ParEPa-GyrAPa interaction is sustained by strong electrostatic forces (-1294.9 kcal/mol). The RMSD scores indicated greater stability for the ParDPa-ParEPa complex (1.11 Å) than the ParEPa-GyrAPa complex (1.16 Å). RMSF analysis identified key residues involved in the ParDPa-ParEPa complex (Leu59, Gly60, Arg115, Asn116, Arg117) and the ParEPa-GyrAPa complex (Pro48, Gln49, Ser55, Asp94, Gln95). These findings significantly enhance our understanding of the structural and metabolic roles of the chromosomally encoded parDEPa TA module in P. aeruginosa.
Collapse
Affiliation(s)
- Nomita Gupta
- School of Biotechnology, Gautam Buddha University, Greater Noida, Uttar Pradesh, India
| | - Mohit Yadav
- School of Biotechnology, Gautam Buddha University, Greater Noida, Uttar Pradesh, India
- Department of Biomedical Engineering, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Waseem Ali
- Department of Molecular Medicine, Jamia Hamdard, New Delhi, India
| | - Garima Singh
- School of Biotechnology, Gautam Buddha University, Greater Noida, Uttar Pradesh, India
| | - Shobhi Chaudhary
- School of Biotechnology, Gautam Buddha University, Greater Noida, Uttar Pradesh, India
| | - Sonam Grover
- Department of Molecular Medicine, Jamia Hamdard, New Delhi, India
| | - Subhash Chandra
- Computational Biology & Biotechnology Laboratory, Department of Botany, Soban Singh Jeena University, Almora, Uttarakhand, India
| | | |
Collapse
|
35
|
Essien C, Wang N, Yu Y, Alqarghuli S, Qin Y, Manshour N, He F, Xu D. Predicting the location of coordinated metal ion-ligand binding sites using geometry-aware graph neural networks. Comput Struct Biotechnol J 2024; 27:137-148. [PMID: 39840139 PMCID: PMC11750443 DOI: 10.1016/j.csbj.2024.12.016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 12/15/2024] [Accepted: 12/20/2024] [Indexed: 01/23/2025] Open
Abstract
More than 50 % of proteins bind to metal ions. Interactions between metal ions and proteins, especially coordinated interactions, are essential for biological functions, such as maintaining protein structure and signal transport. Physiological metal-ion binding prediction is pivotal for both elucidating the biological functions of proteins and for the design of new drugs. However, accurately predicting these interactions remains challenging. In this study, we proposed GPred, a novel structure-based method that transforms the 3-dimensional structure of a protein into a point cloud representation and then designs a geometry-aware graph neural network to learn the local structural properties of each amino acid residue under specific ligand-binding supervision. We trained our model to predict the location of coordinated binding sites for five essential metal ions: Zn2+, Ca2+, Mg2+, Mn2+, and Fe2+. We further demonstrated the versatility of GPred by applying transfer learning to predict the binding sites of 2 heavy metal ions, that is, cadmium (Cd2+) and mercury (Hg2+). We achieved greater than 19.62 %, 14.32 %, 36.62 %, and 40.69 % improvement in the area under the precision-recall curve (AUPR) of Zn2+, Ca2+, Mg2+, Mn2+, and Fe2+, respectively, when compared with 6 current accessible state-of-the-art sequence-based or structure-based tools. We also validated the proposed approach on protein structures predicted by AlphaFold2, and its performance was similar to experimental protein structures. In both cases, achieving a low false discovery rate for proteins without annotated ion-binding sites was demonstrated. © 2017 Elsevier Inc. All rights reserved.
Collapse
Affiliation(s)
- Clement Essien
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Ning Wang
- School of Information Science and Technology, Northeast Normal University, Changchun, Jilin, China
| | - Yang Yu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Salhuldin Alqarghuli
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Yongfang Qin
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Negin Manshour
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| | - Fei He
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
- School of Information Science and Technology, Northeast Normal University, Changchun, Jilin, China
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, Bond Life Sciences Center, University of Missouri, Columbia, MO, USA
| |
Collapse
|
36
|
Yang LY, Ping K, Luo Y, McShan AC. BioDolphin as a comprehensive database of lipid-protein binding interactions. Commun Chem 2024; 7:288. [PMID: 39633021 PMCID: PMC11618342 DOI: 10.1038/s42004-024-01384-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Accepted: 11/28/2024] [Indexed: 12/07/2024] Open
Abstract
Lipid-protein interactions are crucial for virtually all biological processes in living cells. However, existing structural databases focusing on these interactions are limited to integral membrane proteins. A systematic understanding of diverse lipid-protein interactions also encompassing lipid-anchored, peripheral membrane and soluble lipid binding proteins remains to be elucidated. To address this gap and facilitate the research of universal lipid-protein assemblies, we developed BioDolphin - a curated database with over 127,000 lipid-protein interactions. BioDolphin provides comprehensive annotations, including protein functions, protein families, lipid classifications, lipid-protein binding affinities, membrane association type, and atomic structures. Accessible via a publicly available web server ( www.biodolphin.chemistry.gatech.edu ), users can efficiently search for lipid-protein interactions using a wide range of options and download datasets of interest. Additionally, BioDolphin features interactive 3D visualization of each lipid-protein complex, facilitating the exploration of structure-function relationships. BioDolphin also includes detailed information on atomic-level intermolecular interactions between lipids and proteins that enable large scale analysis of both paired complexes and larger assemblies. As an open-source resource, BioDolphin enables global analysis of lipid-protein interactions and supports data-driven approaches for developing predictive machine learning algorithms for lipid-protein binding affinity and structures.
Collapse
Affiliation(s)
- Li-Yen Yang
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Kaike Ping
- Department of Computer Science, Virginia Tech, Blacksburg, VA, 24061, USA
| | - Yunan Luo
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Andrew C McShan
- School of Chemistry and Biochemistry, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| |
Collapse
|
37
|
Basu S, Yu J, Kihara D, Kurgan L. Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences. Brief Bioinform 2024; 26:bbaf016. [PMID: 39833102 PMCID: PMC11745544 DOI: 10.1093/bib/bbaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/24/2024] [Accepted: 01/06/2025] [Indexed: 01/22/2025] Open
Abstract
Computational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Jing Yu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, 915 Mitch Daniels Boulevard, West Lafayette, IN 47907, United States
- Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| |
Collapse
|
38
|
Zhang J, Basu S, Zhang F, Kurgan L. MERIT: Accurate Prediction of Multi Ligand-binding Residues with Hybrid Deep Transformer Network, Evolutionary Couplings and Transfer Learning. J Mol Biol 2024:168872. [PMID: 40133785 DOI: 10.1016/j.jmb.2024.168872] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Revised: 10/30/2024] [Accepted: 11/15/2024] [Indexed: 03/27/2025]
Abstract
Multi-ligand binding residues (MLBRs) are amino acids in protein sequences that interact with multiple different ligands that include proteins, peptides, nucleic acids, and a variety of small molecules. MLBRs are implicated in a number of cellular functions and targeted in a context of multiple human diseases. There are many sequence-based predictors of residues that interact with specific ligand types and they can be collectively used to identify MLBRs. However, there are no methods that directly predict MLBRs. To this end, we conceptualize, design, evaluate and release MERIT (Multi-binding rEsidues pRedIcTor). This tool relies on a custom-crafted deep neural network that implements a number of innovative features, such as a multi-layered/step architecture with transformer modules that we train using a custom-designed loss function, computation of evolutionary couplings, and application of transfer learning. These innovations boost predictive performance, which we demonstrate using an ablation analysis. In particular, they reduce the number of cross-predictions, defined as residues that interact with a single ligand type that are incorrectly predicted as MLBRs. We compare MERIT against a representative selection of current and popular ligand-specific predictors, meta-predictors that combine their results to identify MLBRs, and a baseline regression-based predictor. These tests reveal that MERIT provides accurate predictions and statistically outperforms these alternatives. Moreover, using two test datasets, one with MLBRs and another with only the single ligand binding residues, we show that MERIT consistently produces relatively low false positive rates, including low rates of cross-predictions. The web server and datasets from this study are freely available at http://biomine.cs.vcu.edu/servers/MERIT/.
Collapse
Affiliation(s)
- Jian Zhang
- School of Computer and Information Technology, Xinyang Normal University, Xinyang 464000, China; Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China.
| | - Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| | - Fuhao Zhang
- College of Information Engineering, Northwest A&F University, Yangling, Shaanxi 712100, China
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA.
| |
Collapse
|
39
|
Liu YC, Lin YJ, Chang YY, Chuang CC, Ou YY. Deciphering the Language of Protein-DNA Interactions: A Deep Learning Approach Combining Contextual Embeddings and Multi-Scale Sequence Modeling. J Mol Biol 2024; 436:168769. [PMID: 39214282 DOI: 10.1016/j.jmb.2024.168769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2024] [Revised: 08/01/2024] [Accepted: 08/26/2024] [Indexed: 09/04/2024]
Abstract
Deciphering the mechanisms governing protein-DNA interactions is crucial for understanding key cellular processes and disease pathways. In this work, we present a powerful deep learning approach that significantly advances the computational prediction of DNA-interacting residues from protein sequences. Our method leverages the rich contextual representations learned by pre-trained protein language models, such as ProtTrans, to capture intrinsic biochemical properties and sequence motifs indicative of DNA binding sites. We then integrate these contextual embeddings with a multi-window convolutional neural network architecture, which scans across the sequence at varying window sizes to effectively identify both local and global binding patterns. Comprehensive evaluation on curated benchmark datasets demonstrates the remarkable performance of our approach, achieving an area under the ROC curve (AUC) of 0.89 - a substantial improvement over previous state-of-the-art sequence-based predictors. This showcases the immense potential of pairing advanced representation learning and deep neural network designs for uncovering the complex syntax governing protein-DNA interactions directly from primary sequences. Our work not only provides a robust computational tool for characterizing DNA-binding mechanisms, but also highlights the transformative opportunities at the intersection of language modeling, deep learning, and protein sequence analysis. The publicly available code and data further facilitate broader adoption and continued development of these techniques for accelerating mechanistic insights into vital biological processes and disease pathways. In addition, the code and data for this work are available at https://github.com/B1607/DIRP.
Collapse
Affiliation(s)
- Yu-Chen Liu
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan
| | - Yi-Jing Lin
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan
| | - Yan-Yun Chang
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan
| | - Cheng-Che Chuang
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan
| | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li 32003, Taiwan; Graduate Program in Biomedical Informatics, Yuan Ze University, Chung-Li 32003, Taiwan.
| |
Collapse
|
40
|
Utgés JS, Barton GJ. Comparative evaluation of methods for the prediction of protein-ligand binding sites. J Cheminform 2024; 16:126. [PMID: 39529176 PMCID: PMC11552181 DOI: 10.1186/s13321-024-00923-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Accepted: 10/28/2024] [Indexed: 11/16/2024] Open
Abstract
The accurate identification of protein-ligand binding sites is of critical importance in understanding and modulating protein function. Accordingly, ligand binding site prediction has remained a research focus for over three decades with over 50 methods developed and a change of paradigm from geometry-based to machine learning. In this work, we collate 13 ligand binding site predictors, spanning 30 years, focusing on the latest machine learning-based methods such as VN-EGNN, IF-SitePred, GrASP, PUResNet, and DeepPocket and compare them to the established P2Rank, PRANK and fpocket and earlier methods like PocketFinder, Ligsite and Surfnet. We benchmark the methods against the human subset of our new curated reference dataset, LIGYSIS. LIGYSIS is a comprehensive protein-ligand complex dataset comprising 30,000 proteins with bound ligands which aggregates biologically relevant unique protein-ligand interfaces across biological units of multiple structures from the same protein. LIGYSIS is an improvement for testing methods over earlier datasets like sc-PDB, PDBbind, binding MOAD, COACH420 and HOLO4K which either include 1:1 protein-ligand complexes or consider asymmetric units. Re-scoring of fpocket predictions by PRANK and DeepPocket display the highest recall (60%) whilst IF-SitePred presents the lowest recall (39%). We demonstrate the detrimental effect that redundant prediction of binding sites has on performance as well as the beneficial impact of stronger pocket scoring schemes, with improvements up to 14% in recall (IF-SitePred) and 30% in precision (Surfnet). Finally, we propose top-N+2 recall as the universal benchmark metric for ligand binding site prediction and urge authors to share not only the source code of their methods, but also of their benchmark.Scientific contributionsThis study conducts the largest benchmark of ligand binding site prediction methods to date, comparing 13 original methods and 15 variants using 10 informative metrics. The LIGYSIS dataset is introduced, which aggregates biologically relevant protein-ligand interfaces across multiple structures of the same protein. The study highlights the detrimental effect of redundant binding site prediction and demonstrates significant improvement in recall and precision through stronger scoring schemes. Finally, top-N+2 recall is proposed as a universal benchmark metric for ligand binding site prediction, with a recommendation for open-source sharing of both methods and benchmarks.
Collapse
Affiliation(s)
- Javier S Utgés
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dow Street, Dundee, DD1 5EH, Scotland, UK
| | - Geoffrey J Barton
- Division of Computational Biology, School of Life Sciences, University of Dundee, Dow Street, Dundee, DD1 5EH, Scotland, UK.
| |
Collapse
|
41
|
Hu J, Chen KX, Rao B, Ni JY, Thafar MA, Albaradei S, Arif M. Protein-peptide binding residue prediction based on protein language models and cross-attention mechanism. Anal Biochem 2024; 694:115637. [PMID: 39121938 DOI: 10.1016/j.ab.2024.115637] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Revised: 07/28/2024] [Accepted: 08/06/2024] [Indexed: 08/12/2024]
Abstract
Accurate identifications of protein-peptide binding residues are essential for protein-peptide interactions and advancing drug discovery. To address this problem, extensive research efforts have been made to design more discriminative feature representations. However, extracting these explicit features usually depend on third-party tools, resulting in low computational efficacy and suffering from low predictive performance. In this study, we design an end-to-end deep learning-based method, E2EPep, for protein-peptide binding residue prediction using protein sequence only. E2EPep first employs and fine-tunes two state-of-the-art pre-trained protein language models that can extract two different high-latent feature representations from protein sequences relevant for protein structures and functions. A novel feature fusion module is then designed in E2EPep to fuse and optimize the above two feature representations of binding residues. In addition, we have also design E2EPep+, which integrates E2EPep and PepBCL models, to improve the prediction performance. Experimental results on two independent testing data sets demonstrate that E2EPep and E2EPep + could achieve the average AUC values of 0.846 and 0.842 while achieving an average Matthew's correlation coefficient value that is significantly higher than that of existing most of sequence-based methods and comparable to that of the state-of-the-art structure-based predictors. Detailed data analysis shows that the primary strength of E2EPep lies in the effectiveness of feature representation using cross-attention mechanism to fuse the embeddings generated by two fine-tuned protein language models. The standalone package of E2EPep and E2EPep + can be obtained at https://github.com/ckx259/E2EPep.git for academic use only.
Collapse
Affiliation(s)
- Jun Hu
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China; Center for AI and Computational Biology, Suzhou Institution of Systems Medicine, Suzhou, 215123, China.
| | - Kai-Xin Chen
- College of Information Engineering, Zhejiang University of Technology, Hangzhou, 310023, China
| | - Bing Rao
- School of Information & Electrical Engineering, Hangzhou City University, Hangzhou, 310015, China
| | - Jing-Yuan Ni
- NUIST Reading Academy, Nanjing University of Information Science & Technology, Nanjing, 210044, China
| | - Maha A Thafar
- Department of Computer Science, College of Computers and Information Technology, Taif University, Taif, 21944, Saudi Arabia
| | - Somayah Albaradei
- Department of Computer Science, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Muhammad Arif
- College of Science and Engineering, Hamad Bin Khalifa University, Doha, 34110, Qatar.
| |
Collapse
|
42
|
Long Y, Donald BR. Predicting Affinity Through Homology (PATH): Interpretable Binding Affinity Prediction with Persistent Homology. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.16.567384. [PMID: 38014181 PMCID: PMC10680814 DOI: 10.1101/2023.11.16.567384] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
Accurate binding affinity prediction is crucial to structure-based drug design. Recent work used computational topology to obtain an effective representation of protein-ligand interactions. While algorithms using algebraic topology have proven useful in predicting properties of biomolecules, previous algorithms employed uninterpretable machine learning models which failed to explain the underlying geometric and topological features that drive accurate binding affinity prediction. Moreover, they had high computational complexity which made them intractable for large proteins. We present the fastest known algorithm to compute persistent homology features for protein-ligand complexes using opposition distance, with a runtime that is independent of the protein size. Then, we exploit these features in a novel, interpretable algorithm to predict protein-ligand binding affinity. Our algorithm achieves interpretability through an effective embedding of distances across bipartite matchings of the protein and ligand atoms into real-valued functions by summing Gaussians centered at features constructed by persistent homology. We name these functions internuclear persistent contours (IPCs) . Next, we introduce persistence fingerprints , a vector with 10 components that sketches the distances of different bipartite matching between protein and ligand atoms, refined from IPCs. Let the number of protein atoms in the protein-ligand complex be n , number of ligand atoms be m , and ω ≈ 2.4 be the matrix multiplication exponent. We show that for any 0 < ε < 1, after an 𝒪 ( mn log( mn )) preprocessing procedure, we can compute an ε -accurate approximation to the persistence fingerprint in 𝒪 ( m log 6 ω ( m/ε )) time, independent of protein size. This is an improvement in time complexity by a factor of 𝒪 (( m + n ) 3 ) over any previous binding affinity prediction that uses persistent homology. We show that the representational power of persistence fingerprint generalizes to protein-ligand binding datasets beyond the training dataset. Then, we introduce PATH , Predicting Affinity Through Homology, a two-part algorithm consisting of PATH + and PATH - . PATH + is an interpretable, small ensemble of shallow regression trees for binding affinity prediction from persistence fingerprints. We show that despite using 1,400-fold fewer features, PATH + has comparable performance to a previous state-of-the-art binding affinity prediction algorithm that uses persistent homology. Moreover, PATH + has the advantage of being interpretable. We visualize the features captured by persistence fingerprint for variant HIV-1 protease complexes and show that persistence fingerprint captures binding-relevant structural mutations. PATH - , in turn, uses regression trees over IPCs to differentiate between binding and decoy complexes. Finally, we benchmarked PATH versus established binding affinity prediction algorithms spanning physics-based, knowledge-based, and deep learning methods, revealing that PATH has comparable or better performance with less overfitting, compared to these state-of-the-art methods. The source code for PATH is released open-source as part of the osprey protein design software package.
Collapse
|
43
|
Mohamed SF, Narayanan R. Enterobacter cloacae-mediated polymer biodegradation: in-silico analysis predicts broad spectrum degradation potential by Alkane monooxygenase. Biodegradation 2024; 35:969-991. [PMID: 39001975 DOI: 10.1007/s10532-024-10091-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Accepted: 07/03/2024] [Indexed: 07/15/2024]
Abstract
Plastic pollution poses a significant environmental challenge. In this study, the strain Enterobacter cloacae O5-E, a bacterium displaying polyethylene-degrading capabilities was isolated. Over a span of 30 days, analytical techniques including x-ray diffractometry, scanning electron microscopy, optical profilometry, hardness testing and mass spectrometric analysis were employed to examine alterations in the polymer. Results revealed an 11.48% reduction in crystallinity, a 50% decrease in hardness, and a substantial 25-fold increase in surface roughness resulting from the pits and cracks introduced in the polymer by the isolate. Additionally, the presence of degradational by-products revealed via gas chromatography ascertains the steady progression of degradation. Further, recognizing the pivotal role of alkane monooxygenase in plastic degradation, the study expanded to detect this enzyme in the isolate molecularly. Molecular docking studies were conducted to assess the enzyme's affinity with various polymers, demonstrating notable binding capability with most polymers, especially with polyurethane (- 5.47 kcal/mol). These findings highlight the biodegradation potential of Enterobacter cloacae O5-E and the crucial involvement of alkane monooxygenase in the initial steps of the degradation process, offering a promising avenue to address the global plastic pollution crisis.
Collapse
Affiliation(s)
- Shafana Farveen Mohamed
- Department of Genetic Engineering, School of Bioengineering and Faculty of Engineering and Technology, College of Engineering & Technology (CET), SRM Institute of Science and Technology, Kattankulathur, Kanchipuram, Chennai, Tamil Nadu, 603203, India
| | - Rajnish Narayanan
- Department of Genetic Engineering, School of Bioengineering and Faculty of Engineering and Technology, College of Engineering & Technology (CET), SRM Institute of Science and Technology, Kattankulathur, Kanchipuram, Chennai, Tamil Nadu, 603203, India.
| |
Collapse
|
44
|
Li Y, Nan X, Zhang S, Zhou Q, Lu S, Tian Z. PMSFF: Improved Protein Binding Residues Prediction through Multi-Scale Sequence-Based Feature Fusion Strategy. Biomolecules 2024; 14:1220. [PMID: 39456153 PMCID: PMC11506650 DOI: 10.3390/biom14101220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 09/22/2024] [Accepted: 09/24/2024] [Indexed: 10/28/2024] Open
Abstract
Proteins perform different biological functions through binding with various molecules which are mediated by a few key residues and accurate prediction of such protein binding residues (PBRs) is crucial for understanding cellular processes and for designing new drugs. Many computational prediction approaches have been proposed to identify PBRs with sequence-based features. However, these approaches face two main challenges: (1) these methods only concatenate residue feature vectors with a simple sliding window strategy, and (2) it is challenging to find a uniform sliding window size suitable for learning embeddings across different types of PBRs. In this study, we propose one novel framework that could apply multiple types of PBRs Prediciton task through Multi-scale Sequence-based Feature Fusion (PMSFF) strategy. Firstly, PMSFF employs a pre-trained language model named ProtT5, to encode amino acid residues in protein sequences. Then, it generates multi-scale residue embeddings by applying multi-size windows to capture effective neighboring residues and multi-size kernels to learn information across different scales. Additionally, the proposed model treats protein sequences as sentences, employing a bidirectional GRU to learn global context. We also collect benchmark datasets encompassing various PBRs types and evaluate our PMSFF approach to these datasets. Compared with state-of-the-art methods, PMSFF demonstrates superior performance on most PBRs prediction tasks.
Collapse
Affiliation(s)
- Yuguang Li
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
| | - Xiaofei Nan
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
| | - Shoutao Zhang
- School of Life Sciences, Zhengzhou University, Zhengzhou 450001, China;
- Longhu Laboratory of Advanced Immunology, Zhengzhou 450001, China
| | - Qinglei Zhou
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
| | - Shuai Lu
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
- National Supercomputing Center in Zhengzhou, Zhengzhou University, Zhengzhou 450001, China
| | - Zhen Tian
- School of Computer and Artificial Intelligence, Zhengzhou University, Zhengzhou 450001, China; (Y.L.); (X.N.); (Q.Z.)
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324003, China
| |
Collapse
|
45
|
Song Y, Yuan Q, Chen S, Zeng Y, Zhao H, Yang Y. Accurately predicting enzyme functions through geometric graph learning on ESMFold-predicted structures. Nat Commun 2024; 15:8180. [PMID: 39294165 PMCID: PMC11411130 DOI: 10.1038/s41467-024-52533-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Accepted: 09/11/2024] [Indexed: 09/20/2024] Open
Abstract
Enzymes are crucial in numerous biological processes, with the Enzyme Commission (EC) number being a commonly used method for defining enzyme function. However, current EC number prediction technologies have not fully recognized the importance of enzyme active sites and structural characteristics. Here, we propose GraphEC, a geometric graph learning-based EC number predictor using the ESMFold-predicted structures and a pre-trained protein language model. Specifically, we first construct a model to predict the enzyme active sites, which is utilized to predict the EC number. The prediction is further improved through a label diffusion algorithm by incorporating homology information. In parallel, the optimum pH of enzymes is predicted to reflect the enzyme-catalyzed reactions. Experiments demonstrate the superior performance of our model in predicting active sites, EC numbers, and optimum pH compared to other state-of-the-art methods. Additional analysis reveals that GraphEC is capable of extracting functional information from protein structures, emphasizing the effectiveness of geometric graph learning. This technology can be used to identify unannotated enzyme functions, as well as to predict their active sites and optimum pH, with the potential to advance research in synthetic biology, genomics, and other fields.
Collapse
Affiliation(s)
- Yidong Song
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Qianmu Yuan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
- High Performance Computing Department, National Supercomputing Center in Shenzhen, Shenzhen, Guangdong, China
| | - Sheng Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Yuansong Zeng
- School of Big Data & Software Engineering, Chongqing University, Chongqing, China
| | - Huiying Zhao
- Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, Guangdong, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, Guangdong, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou, China.
| |
Collapse
|
46
|
Shafiee S, Fathi A, Taherzadeh G. DP-site: A dual deep learning-based method for protein-peptide interaction site prediction. Methods 2024; 229:17-29. [PMID: 38871095 DOI: 10.1016/j.ymeth.2024.06.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 04/22/2024] [Accepted: 06/01/2024] [Indexed: 06/15/2024] Open
Abstract
BACKGROUND Protein-peptide interaction prediction is an important topic for several applications including various biological processes, understanding drug discovery, protein function abnormal cellular behaviors, and treating diseases. Over the years, studies have shown that experimental methods have improved the identification of this bio-molecular interaction. However, predicting protein-peptide interactions using these methods is laborious, time-consuming, dependent on third-party tools, and costly. METHOD To address these previous drawbacks, this study introduces a computational framework called DP-Site. The proposed framework concentrates on using a compound of a dual pipeline along with a combination predictor. A deep convolutional neural network for feature extraction and classification is embedded in pipeline 1. In addition, pipeline 2 includes a deep long-short-term memory-based and a random forest classifier for feature extraction and classification. In this investigation, the evolutionary, structure-based, sequence-based, and physicochemical information of proteins is utilized for identifying protein-peptide interaction at the residue level. RESULTS The proposed method is evaluated on both the ten-fold cross-validation and independent test sets. The robust and consistent results between cross-validation and independent test sets confirm the ability of the proposed method to predict peptide binding residues in proteins. Moreover, experimental findings demonstrate that DP-Site has significantly outperformed other state-of-the-art sequence-based and structure-based methods. The proposed method achieves a remarkable balance between a specificity of 0.799 and a sensitivity of 0.770, along with the best f-measure of 0.661 and the highest precision of 0.580 using an independent test set. CONCLUSIONS The outcome of various experiments confirms the proficiency of the proposed method and outperforms state-of-the-art sequence-based and structure-based methods in terms of the mentioned criteria. DP-Site can be accessed at https://github.com/shafiee 95/shima.shafiee.DP-Site.
Collapse
Affiliation(s)
- Shima Shafiee
- Department of Computer Engineering and Information Technology, Razi University, Kermanshah, Iran.
| | - Abdolhossein Fathi
- Department of Computer Engineering and Information Technology, Razi University, Kermanshah, Iran.
| | - Ghazaleh Taherzadeh
- Department of Math, Physics, and Computer Science, Wilkes University, Pennsylvania, USA.
| |
Collapse
|
47
|
Wang B, Li W. Advances in the Application of Protein Language Modeling for Nucleic Acid Protein Binding Site Prediction. Genes (Basel) 2024; 15:1090. [PMID: 39202449 PMCID: PMC11353971 DOI: 10.3390/genes15081090] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2024] [Revised: 08/13/2024] [Accepted: 08/14/2024] [Indexed: 09/03/2024] Open
Abstract
Protein and nucleic acid binding site prediction is a critical computational task that benefits a wide range of biological processes. Previous studies have shown that feature selection holds particular significance for this prediction task, making the generation of more discriminative features a key area of interest for many researchers. Recent progress has shown the power of protein language models in handling protein sequences, in leveraging the strengths of attention networks, and in successful applications to tasks such as protein structure prediction. This naturally raises the question of the applicability of protein language models in predicting protein and nucleic acid binding sites. Various approaches have explored this potential. This paper first describes the development of protein language models. Then, a systematic review of the latest methods for predicting protein and nucleic acid binding sites is conducted by covering benchmark sets, feature generation methods, performance comparisons, and feature ablation studies. These comparisons demonstrate the importance of protein language models for the prediction task. Finally, the paper discusses the challenges of protein and nucleic acid binding site prediction and proposes possible research directions and future trends. The purpose of this survey is to furnish researchers with actionable suggestions for comprehending the methodologies used in predicting protein-nucleic acid binding sites, fostering the creation of protein-centric language models, and tackling real-world obstacles encountered in this field.
Collapse
Affiliation(s)
| | - Wenjin Li
- Institute for Advanced Study, Shenzhen University, Shenzhen 518061, China;
| |
Collapse
|
48
|
Jang YJ, Qin QQ, Huang SY, Peter ATJ, Ding XM, Kornmann B. Accurate prediction of protein function using statistics-informed graph networks. Nat Commun 2024; 15:6601. [PMID: 39097570 PMCID: PMC11297950 DOI: 10.1038/s41467-024-50955-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2023] [Accepted: 07/15/2024] [Indexed: 08/05/2024] Open
Abstract
Understanding protein function is pivotal in comprehending the intricate mechanisms that underlie many crucial biological activities, with far-reaching implications in the fields of medicine, biotechnology, and drug development. However, more than 200 million proteins remain uncharacterized, and computational efforts heavily rely on protein structural information to predict annotations of varying quality. Here, we present a method that utilizes statistics-informed graph networks to predict protein functions solely from its sequence. Our method inherently characterizes evolutionary signatures, allowing for a quantitative assessment of the significance of residues that carry out specific functions. PhiGnet not only demonstrates superior performance compared to alternative approaches but also narrows the sequence-function gap, even in the absence of structural information. Our findings indicate that applying deep learning to evolutionary data can highlight functional sites at the residue level, providing valuable support for interpreting both existing properties and new functionalities of proteins in research and biomedicine.
Collapse
Affiliation(s)
- Yaan J Jang
- Department of Biochemistry, University of Oxford, Oxford, UK.
- AmoAi Technologies, Oxford, UK.
| | - Qi-Qi Qin
- AmoAi Technologies, Oxford, UK
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Si-Yu Huang
- AmoAi Technologies, Oxford, UK
- Oxford Martin School, University of Oxford, Oxford, UK
- School of Systems Science, Beijing Normal University, Beijing, China
| | | | - Xue-Ming Ding
- School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, Shanghai, China
| | - Benoît Kornmann
- Department of Biochemistry, University of Oxford, Oxford, UK.
| |
Collapse
|
49
|
Chen L, Li Q, Nasif KFA, Xie Y, Deng B, Niu S, Pouriyeh S, Dai Z, Chen J, Xie CY. AI-Driven Deep Learning Techniques in Protein Structure Prediction. Int J Mol Sci 2024; 25:8426. [PMID: 39125995 PMCID: PMC11313475 DOI: 10.3390/ijms25158426] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Revised: 07/29/2024] [Accepted: 07/29/2024] [Indexed: 08/12/2024] Open
Abstract
Protein structure prediction is important for understanding their function and behavior. This review study presents a comprehensive review of the computational models used in predicting protein structure. It covers the progression from established protein modeling to state-of-the-art artificial intelligence (AI) frameworks. The paper will start with a brief introduction to protein structures, protein modeling, and AI. The section on established protein modeling will discuss homology modeling, ab initio modeling, and threading. The next section is deep learning-based models. It introduces some state-of-the-art AI models, such as AlphaFold (AlphaFold, AlphaFold2, AlphaFold3), RoseTTAFold, ProteinBERT, etc. This section also discusses how AI techniques have been integrated into established frameworks like Swiss-Model, Rosetta, and I-TASSER. The model performance is compared using the rankings of CASP14 (Critical Assessment of Structure Prediction) and CASP15. CASP16 is ongoing, and its results are not included in this review. Continuous Automated Model EvaluatiOn (CAMEO) complements the biennial CASP experiment. Template modeling score (TM-score), global distance test total score (GDT_TS), and Local Distance Difference Test (lDDT) score are discussed too. This paper then acknowledges the ongoing difficulties in predicting protein structure and emphasizes the necessity of additional searches like dynamic protein behavior, conformational changes, and protein-protein interactions. In the application section, this paper introduces some applications in various fields like drug design, industry, education, and novel protein development. In summary, this paper provides a comprehensive overview of the latest advancements in established protein modeling and deep learning-based models for protein structure predictions. It emphasizes the significant advancements achieved by AI and identifies potential areas for further investigation.
Collapse
Affiliation(s)
- Lingtao Chen
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (Q.L.); (K.F.A.N.); (Y.X.); (B.D.); (S.P.)
| | - Qiaomu Li
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (Q.L.); (K.F.A.N.); (Y.X.); (B.D.); (S.P.)
| | - Kazi Fahim Ahmad Nasif
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (Q.L.); (K.F.A.N.); (Y.X.); (B.D.); (S.P.)
| | - Ying Xie
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (Q.L.); (K.F.A.N.); (Y.X.); (B.D.); (S.P.)
| | - Bobin Deng
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (Q.L.); (K.F.A.N.); (Y.X.); (B.D.); (S.P.)
| | - Shuteng Niu
- Department of Computer Science, Bowling Green State University, Bowling Green, OH 43403, USA;
| | - Seyedamin Pouriyeh
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (Q.L.); (K.F.A.N.); (Y.X.); (B.D.); (S.P.)
| | - Zhiyu Dai
- Division of Pulmonary and Critical Care Medicine, John T. Milliken Department of Medicine, Washington University School of Medicine in St. Louis, St. Louis, MO 63110, USA;
| | - Jiawei Chen
- College of Computing, Data Science and Society, University of California, Berkeley, CA 94720, USA;
| | - Chloe Yixin Xie
- College of Computing and Software Engineering, Kennesaw State University, Marietta, GA 30060, USA; (L.C.); (Q.L.); (K.F.A.N.); (Y.X.); (B.D.); (S.P.)
| |
Collapse
|
50
|
Mustafov D, Siddiqui SS, Kukol A, Lambrou GI, Shagufta, Ahmad I, Braoudaki M. MicroRNA-Dependent Mechanisms Underlying the Function of a β-Amino Carbonyl Compound in Glioblastoma Cells. ACS OMEGA 2024; 9:31789-31802. [PMID: 39072119 PMCID: PMC11270567 DOI: 10.1021/acsomega.4c02991] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/28/2024] [Revised: 06/10/2024] [Accepted: 06/18/2024] [Indexed: 07/30/2024]
Abstract
Glioblastoma (GB) is an aggressive brain malignancy characterized by its invasive nature. Current treatment has limited effectiveness, resulting in poor patients' prognoses. β-Amino carbonyl (β-AC) compounds have gained attention due to their potential anticancerous properties. In vitro assays were performed to evaluate the effects of an in-house synthesized β-AC compound, named SHG-8, upon GB cells. Small RNA sequencing (sRNA-seq) and biocomputational analyses investigated the effects of SHG-8 upon the miRNome and its bioavailability within the human body. SHG-8 exhibited significant cytotoxicity and inhibition of cell migration and proliferation in U87MG and U251MG GB cells. GB cells treated with the compound released significant amounts of reactive oxygen species (ROS). Annexin V and acridine orange/ethidium bromide staining also demonstrated that the compound led to apoptosis. sRNA-seq revealed a shift in microRNA (miRNA) expression profiles upon SHG-8 treatment and significant upregulation of miR-3648 and downregulation of miR-7973. Real-time polymerase chain reaction (RT-qPCR) demonstrated a significant downregulation of CORO1C, an oncogene and a player in the Wnt/β-catenin pathway. In silico analysis indicated SHG-8's potential to cross the blood-brain barrier. We concluded that SHG-8's inhibitory effects on GB cells may involve the deregulation of various miRNAs and the inhibition of CORO1C.
Collapse
Affiliation(s)
- Denis Mustafov
- School
of Life and Medical Sciences, University
of Hertfordshire, Hatfield, AL10 9AB, United
Kingdom
- College
of Health, Medicine and Life Sciences, Brunel
University London, Uxbridge UB8 3PH, United
Kingdom
| | - Shoib S. Siddiqui
- School
of Life and Medical Sciences, University
of Hertfordshire, Hatfield, AL10 9AB, United
Kingdom
| | - Andreas Kukol
- School
of Life and Medical Sciences, University
of Hertfordshire, Hatfield, AL10 9AB, United
Kingdom
| | - George I. Lambrou
- Choremeio
Research Laboratory, First Department of Pediatrics, School of Medicine, National and Kapodistrian University of Athens, Athens,
Greece, Thivon and Levadeias
8, Goudi, 11527 Athens, Greece
- University
Research Institute of Maternal and Child Health and Precision Medicine, National and Kapodistrian University of Athens, Thivon and Levadeias 8, 11527 Athens, Greece
| | - Shagufta
- Department
of Biotechnology, School of Arts and Sciences, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab
Emirates
| | - Irshad Ahmad
- Department
of Biotechnology, School of Arts and Sciences, American University of Ras Al Khaimah, Ras Al Khaimah, United Arab
Emirates
| | - Maria Braoudaki
- School
of Life and Medical Sciences, University
of Hertfordshire, Hatfield, AL10 9AB, United
Kingdom
- University
Research Institute of Maternal and Child Health and Precision Medicine, National and Kapodistrian University of Athens, Thivon and Levadeias 8, 11527 Athens, Greece
| |
Collapse
|