1
|
Liu Z, Qiu WR, Liu Y, Yan H, Pei W, Zhu YH, Qiu J. A comprehensive review of computational methods for Protein-DNA binding site prediction. Anal Biochem 2025; 703:115862. [PMID: 40209920 DOI: 10.1016/j.ab.2025.115862] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2024] [Revised: 03/20/2025] [Accepted: 04/06/2025] [Indexed: 04/12/2025]
Abstract
Accurately identifying protein-DNA binding sites is essential for understanding the molecular mechanisms underlying biological processes, which in turn facilitates advancements in drug discovery and design. While biochemical experiments provide the most accurate way to locate DNA-binding sites, they are generally time-consuming, resource-intensive, and expensive. There is a pressing need to develop computational methods that are both efficient and accurate for DNA-binding site prediction. This study thoroughly reviews and categorizes major computational approaches for predicting DNA-binding sites, including template detection, statistical machine learning, and deep learning-based methods. The 14 state-of-the-art DNA-binding site prediction models have been benchmarked on 136 non-redundant proteins, where the deep learning-based, especially pre-trained large language model-based, methods achieve superior performance over the other two categories. Applications of these DNA-binding site prediction methods are also involved.
Collapse
Affiliation(s)
- Zi Liu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Wang-Ren Qiu
- School of Information Engineering, Jingdezhen Ceramic University, Jingdezhen, 333403, China
| | - Yan Liu
- Department of Computer Science, Yangzhou University, 196 Huayang West Road, Yangzhou, 225100, China
| | - He Yan
- College of Information Science and Technology & Artificial Intelligence, Nanjing Forestry University, 159 Longpanlu Road, Nanjing, 210037, China
| | - Wenyi Pei
- Geriatric Department, Shanghai Baoshan District Wusong Central Hospital, 101 Tongtai North Road, Shanghai, 200940, China.
| | - Yi-Heng Zhu
- College of Artificial Intelligence, Nanjing Agricultural University, 1 Weigang Road, Nanjing, 210095, China.
| | - Jing Qiu
- Information Department, The First Affiliated Hospital of Naval Medical University, 168 Changhai Road, Shanghai, 200433, China.
| |
Collapse
|
2
|
Xia R, Li W, Cheng Y, Xie L, Xu X. Molecular surfaces modeling: Advancements in deep learning for molecular interactions and predictions. Biochem Biophys Res Commun 2025; 763:151799. [PMID: 40239539 DOI: 10.1016/j.bbrc.2025.151799] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2025] [Revised: 03/20/2025] [Accepted: 04/10/2025] [Indexed: 04/18/2025]
Abstract
Molecular surface analysis can provide a high-dimensional, rich representation of molecular properties and interactions, which is crucial for enabling powerful predictive modeling and rational molecular design across diverse scientific and technological domains. With remarkable successes achieved by artificial intelligence (AI) in different fields such as computer vision and natural language processing, there is a growing imperative to harness AI's potential in accelerating molecular discovery and innovation. The integration of AI techniques with molecular surface analysis has opened up new frontiers, allowing researchers to uncover hidden patterns, relationships, and design principles that were previously elusive. By leveraging the complementary strengths of molecular surface representations and advanced AI algorithms, scientists can now explore chemical space more efficiently, optimize molecular properties with greater precision, and drive transformative advancements in areas like drug development, materials engineering, and catalysis. In this review, we aim to provide an overview of recent advancements in the field of molecular surface analysis and its integration with AI techniques. These AI-driven approaches have led to significant advancements in various downstream tasks, including interface site prediction, protein-protein interaction prediction, surface-centric molecular generation and design.
Collapse
Affiliation(s)
- Renjie Xia
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, 213001, China
| | - Wei Li
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, 213001, China
| | - Yi Cheng
- College of Engineering, Lishui University, Lishui, 323000, China
| | - Liangxu Xie
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, 213001, China.
| | - Xiaojun Xu
- Institute of Bioinformatics and Medical Engineering, Jiangsu University of Technology, Changzhou, 213001, China.
| |
Collapse
|
3
|
Chen G, Hou L, Li Z, Xie B, Liu Y. A new strategy for Cas protein recognition based on graph neural networks and SMILES encoding. Sci Rep 2025; 15:15236. [PMID: 40307455 PMCID: PMC12043993 DOI: 10.1038/s41598-025-99999-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2024] [Accepted: 04/24/2025] [Indexed: 05/02/2025] Open
Abstract
The CRISPR-Cas system, an adaptive immune mechanism found in bacteria and archaea, has evolved into a promising genomic editing tool, with various types of Cas proteins playing a crucial role. In this study, we developed a set of strategies for mining and identifying Cas1 proteins. Firstly, we analyzed the characteristic differences of 14 types of Cas proteins in the protein large language model embedding space in detail; then converted proteins into the Simplified Molecular Input Line Entry System (SMILES) format, thereby constructing graph data representing atom and bond features. Next, based on the characteristic differences of different Cas proteins, we designed and trained an ensemble model composed of two Directed Message Passing Neural Network (DMPNN) models for high-precision identification of Cas1 proteins. This ensemble model performed excellently on both training data and newly designed datasets. The comparison of this method with other methods, such as CRISPRCasFinder, has demonstrated its effectiveness. Finally, the ensemble model was successfully employed to identify potential Cas1 proteins in the Ensemble database, further highlighting its robustness and practicality. The strategies and models from this research may potentially be extended to other types of Cas proteins, though this would require further investigation and validation. Moreover, our work highlights SMILES encoding as a versatile tool for studying biological macromolecules, enabling efficient structural representation and advanced computational applications in protein research and beyond.
Collapse
Affiliation(s)
- Gaoxiang Chen
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China.
| | - Liya Hou
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| | - Zhanwei Li
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| | - Bin Xie
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| | - Yongqiang Liu
- Zhejiang Laboratory, Research Center for Life Sciences Computing, Hangzhou, 311100, China
| |
Collapse
|
4
|
Tytarenko A, Singh A, Ambati VK, Copeland MM, Kundrotas PJ, Halfmann R, Kasyanov PO, Feinberg EA, Vakser IA. Highly Optimized Simulation of Atomic Resolution Cell-Like Protein Environment. J Phys Chem B 2025; 129:3183-3190. [PMID: 40077832 PMCID: PMC11956777 DOI: 10.1021/acs.jpcb.4c07769] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2024] [Revised: 03/03/2025] [Accepted: 03/06/2025] [Indexed: 03/14/2025]
Abstract
Computational approaches can provide details of molecular mechanisms in a crowded environment inside cells. Protein docking predicts stable configurations of molecular complexes, which correspond to deep energy minima. Systematic docking approaches, such as those based on fast Fourier transform (FFT), also map the entire intermolecular energy landscape by determining the position and depth of the full spectrum of the energy minima. Such mapping allows speeding up simulations by precalculating the intermolecular energy values. Our earlier study combined FFT docking with the Monte Carlo protocol, enabling simulation of cell-size, crowded protein systems with seconds, and longer trajectories at atomic resolution, several orders of magnitude longer than those achievable by alternative approaches. In this study, we present a further drastic extension of the modeling capabilities by parallelized implementation of the simulation protocol. The procedure was applied to a panel of Death Fold Domains that form nucleated polymers in human innate immune signaling, recapitulating their homooligomerization tendencies and providing insights into the molecular mechanisms of polymer nucleation. The parallelized protocol allows extension of the simulation trajectories by orders of magnitude beyond the previously reported implementation, reaching into the uncharted territory of atomic resolution simulation of cell-sized systems.
Collapse
Affiliation(s)
- Andrii
M. Tytarenko
- Institute
for Applied System Analysis at the Igor Sikorsky Kyiv Polytechnic
Institute, Kyiv 03056, Ukraine
| | - Amar Singh
- Computational
Biology Program, The University of Kansas, Lawrence, Kansas 66045, United States
| | - Vineeth Kumar Ambati
- Computational
Biology Program, The University of Kansas, Lawrence, Kansas 66045, United States
| | - Matthew M. Copeland
- Computational
Biology Program, The University of Kansas, Lawrence, Kansas 66045, United States
| | - Petras J. Kundrotas
- Computational
Biology Program, The University of Kansas, Lawrence, Kansas 66045, United States
| | - Randal Halfmann
- Stowers
Institute for Medical Research, Kansas City, Missouri 64110, United States
- Department
of Biochemistry and Molecular Biology, University
of Kansas Medical Center, Kansas
City, Kansas 66160, United States
| | - Pavlo O. Kasyanov
- Institute
for Applied System Analysis at the Igor Sikorsky Kyiv Polytechnic
Institute, Kyiv 03056, Ukraine
| | - Eugene A. Feinberg
- Department
of Applied Mathematics and Statistics, Stony
Brook University, Stony
Brook, New York 11794, United States
| | - Ilya A. Vakser
- Computational
Biology Program, The University of Kansas, Lawrence, Kansas 66045, United States
- Department
of Molecular Biosciences, The University
of Kansas, Lawrence, Kansas 66045, United States
| |
Collapse
|
5
|
Butt W, Lai B, Chiu TP, Bhattarai M, Qian S, Bishop AR, Duan J, Alexandrov BS, Rohs R, He X. Contribution of DNA breathing to physical interactions with transcription factors. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.20.633840. [PMID: 39896490 PMCID: PMC11785057 DOI: 10.1101/2025.01.20.633840] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2025]
Abstract
Interaction between transcription factors (TFs) and DNA plays a key role in regulating gene expression. It is generally believed that these interactions are controlled through recognition of DNA core motifs by TFs. Nevertheless, several studies pointed out the limitation of this view, in particular, DNA sequence variants influencing TF binding are often located outside of core motifs. One possible explanation is that the physical properties of DNA may play a role in TF-DNA interactions. Recent studies have supported the importance of DNA shape features, especially in flanking regions of core motifs. Another important physical property of DNA is DNA breathing, the spontaneous opening of double-stranded DNA through thermal motions. But there have been few genomic studies of the role of DNA breathing in TF-DNA interactions. In this work, we analyzed in vitro TF-DNA binding data of three TFs and found that DNA breathing features inside or near core motifs are correlated with binding affinity. This suggests that these TFs may prefer locally and temporally melted DNA formed through breathing. We extended the analysis to 44 TFs with in vivo ChIP-seq binding data. We found that for a large proportion of TFs, their breathing features in or near core motifs are associated with binding, but the sign and magnitude of these associations vary substantially across TF families. Altogether, our study supports the hypothesis that DNA breathing features near binding motifs contribute to TF-DNA interactions.
Collapse
Affiliation(s)
- Waqaas Butt
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Ben Lai
- Toyota Technology Institute of Chicago, Chicago, Illinois, United States of America
| | - Tsu-Pei Chiu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
| | - Manish Bhattarai
- Theoretical Division, Los Alamos National Lab, Los Alamos, New Mexico, United States of America
| | - Sheng Qian
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Alan R. Bishop
- Theoretical Division, Los Alamos National Lab, Los Alamos, New Mexico, United States of America
| | - Jubao Duan
- Center for Psychiatric Genetics, NorthShore University HealthSystem Research Institute, Chicago, Illinois, United States of America
| | - Boian S. Alexandrov
- Theoretical Division, Los Alamos National Lab, Los Alamos, New Mexico, United States of America
| | - Remo Rohs
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California, United States of America
- Departments of Chemistry, Physics & Astronomy, and Computer Science, University of Southern California, Los Angeles, California, United States of America
| | - Xin He
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| |
Collapse
|
6
|
Wu S, Xu J, Guo JT. Accurate prediction of nucleic acid binding proteins using protein language model. BIOINFORMATICS ADVANCES 2025; 5:vbaf008. [PMID: 39990254 PMCID: PMC11845279 DOI: 10.1093/bioadv/vbaf008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Revised: 12/20/2024] [Accepted: 01/15/2025] [Indexed: 02/25/2025]
Abstract
Motivation Nucleic acid binding proteins (NABPs) play critical roles in various and essential biological processes. Many machine learning-based methods have been developed to predict different types of NABPs. However, most of these studies have limited applications in predicting the types of NABPs for any given protein with unknown functions, due to several factors such as dataset construction, prediction scope and features used for training and testing. In addition, single-stranded DNA binding proteins (DBP) (SSBs) have not been extensively investigated for identifying novel SSBs from proteins with unknown functions. Results To improve prediction accuracy of different types of NABPs for any given protein, we developed hierarchical and multi-class models with machine learning-based methods and a feature extracted from protein language model ESM2. Our results show that by combining the feature from ESM2 and machine learning methods, we can achieve high prediction accuracy up to 95% for each stage in the hierarchical approach, and 85% for overall prediction accuracy from the multi-class approach. More importantly, besides the much improved prediction of other types of NABPs, the models can be used to accurately predict single-stranded DBPs, which is underexplored. Availability and implementation The datasets and code can be found at https://figshare.com/projects/Prediction_of_nucleic_acid_binding_proteins_using_protein_language_model/211555.
Collapse
Affiliation(s)
- Siwen Wu
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL 60637, United States
| | - Jun-tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| |
Collapse
|
7
|
Mitra R, Cohen AS, Sagendorf JM, Berman HM, Rohs R. DNAproDB: an updated database for the automated and interactive analysis of protein-DNA complexes. Nucleic Acids Res 2025; 53:D396-D402. [PMID: 39494533 PMCID: PMC11701736 DOI: 10.1093/nar/gkae970] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2024] [Revised: 10/07/2024] [Accepted: 10/11/2024] [Indexed: 11/05/2024] Open
Abstract
DNAproDB (https://dnaprodb.usc.edu/) is a database, visualization tool, and processing pipeline for analyzing structural features of protein-DNA interactions. Here, we present a substantially updated version of the database through additional structural annotations, search, and user interface functionalities. The update expands the number of pre-analyzed protein-DNA structures, which are automatically updated weekly. The analysis pipeline identifies water-mediated hydrogen bonds that are incorporated into the visualizations of protein-DNA complexes. Tertiary structure-aware nucleotide layouts are now available. New file formats and external database annotations are supported. The website has been redesigned, and interacting with graphs and data is more intuitive. We also present a statistical analysis on the updated collection of structures revealing salient patterns in protein-DNA interactions.
Collapse
Affiliation(s)
- Raktim Mitra
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Ari S Cohen
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jared M Sagendorf
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Helen M Berman
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
- Department of Chemistry and Chemical Biology, Rutgers, The State University of New Jersey, 174 Frelinghuysen Road, Piscataway, NJ 08854, USA
| | - Remo Rohs
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
- Department of Chemistry, University of Southern California, Los Angeles, CA 90089, USA
- Department of Physics & Astronomy, University of Southern California, Los Angeles, CA 90089, USA
- Thomas Lord Department of Computer Science, University of Southern California, Los Angeles, CA 90089, USA
| |
Collapse
|
8
|
Basu S, Yu J, Kihara D, Kurgan L. Twenty years of advances in prediction of nucleic acid-binding residues in protein sequences. Brief Bioinform 2024; 26:bbaf016. [PMID: 39833102 PMCID: PMC11745544 DOI: 10.1093/bib/bbaf016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/24/2024] [Accepted: 01/06/2025] [Indexed: 01/22/2025] Open
Abstract
Computational prediction of nucleic acid-binding residues in protein sequences is an active field of research, with over 80 methods that were released in the past 2 decades. We identify and discuss 87 sequence-based predictors that include dozens of recently published methods that are surveyed for the first time. We overview historical progress and examine multiple practical issues that include availability and impact of predictors, key features of their predictive models, and important aspects related to their training and assessment. We observe that the past decade has brought increased use of deep neural networks and protein language models, which contributed to substantial gains in the predictive performance. We also highlight advancements in vital and challenging issues that include cross-predictions between deoxyribonucleic acid (DNA)-binding and ribonucleic acid (RNA)-binding residues and targeting the two distinct sources of binding annotations, structure-based versus intrinsic disorder-based. The methods trained on the structure-annotated interactions tend to perform poorly on the disorder-annotated binding and vice versa, with only a few methods that target and perform well across both annotation types. The cross-predictions are a significant problem, with some predictors of DNA-binding or RNA-binding residues indiscriminately predicting interactions with both nucleic acid types. Moreover, we show that methods with web servers are cited substantially more than tools without implementation or with no longer working implementations, motivating the development and long-term maintenance of the web servers. We close by discussing future research directions that aim to drive further progress in this area.
Collapse
Affiliation(s)
- Sushmita Basu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Jing Yu
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| | - Daisuke Kihara
- Department of Biological Sciences, Purdue University, 915 Mitch Daniels Boulevard, West Lafayette, IN 47907, United States
- Department of Computer Science, Purdue University, 305 N. University Street, West Lafayette, IN 47907, United States
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, 401 West Main Street, Richmond, VA 23284, United States
| |
Collapse
|
9
|
Maddocks JH, Dans PD, Cheatham TH, Harris S, Laughton C, Orozco M, Pollack L, Olson WK. Special issue: Multiscale simulations of DNA from electrons to nucleosomes. Biophys Rev 2024; 16:259-262. [PMID: 39099838 PMCID: PMC11296990 DOI: 10.1007/s12551-024-01204-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/06/2024] Open
Abstract
This editorial for Volume 16, Issue 3 of Biophysical Reviews highlights the three-dimensional structural and dynamic information encoded in DNA sequences and introduces the topics covered in this special issue of the journal on Multiscale Simulations of DNA from Electrons to Nucleosomes. Biophysical Reviews is the official journal of the International Union for Pure and Applied Biophysics (IUPAB 2024). The international scope of the articles in the issue exemplifies the goals of IUPAB to organize worldwide advancements, co-operation, communication, and education in biophysics.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Wilma K. Olson
- Rutgers, the State University of New Jersey, Piscataway, NJ USA
| |
Collapse
|