1
|
Wu S, Xu J, Guo JT. Accurate prediction of nucleic acid binding proteins using protein language model. BIOINFORMATICS ADVANCES 2025; 5:vbaf008. [PMID: 39990254 PMCID: PMC11845279 DOI: 10.1093/bioadv/vbaf008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 09/22/2024] [Revised: 12/20/2024] [Accepted: 01/15/2025] [Indexed: 02/25/2025]
Abstract
Motivation Nucleic acid binding proteins (NABPs) play critical roles in various and essential biological processes. Many machine learning-based methods have been developed to predict different types of NABPs. However, most of these studies have limited applications in predicting the types of NABPs for any given protein with unknown functions, due to several factors such as dataset construction, prediction scope and features used for training and testing. In addition, single-stranded DNA binding proteins (DBP) (SSBs) have not been extensively investigated for identifying novel SSBs from proteins with unknown functions. Results To improve prediction accuracy of different types of NABPs for any given protein, we developed hierarchical and multi-class models with machine learning-based methods and a feature extracted from protein language model ESM2. Our results show that by combining the feature from ESM2 and machine learning methods, we can achieve high prediction accuracy up to 95% for each stage in the hierarchical approach, and 85% for overall prediction accuracy from the multi-class approach. More importantly, besides the much improved prediction of other types of NABPs, the models can be used to accurately predict single-stranded DBPs, which is underexplored. Availability and implementation The datasets and code can be found at https://figshare.com/projects/Prediction_of_nucleic_acid_binding_proteins_using_protein_language_model/211555.
Collapse
Affiliation(s)
- Siwen Wu
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL 60637, United States
| | - Jun-tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| |
Collapse
|
2
|
Pradhan UK, Naha S, Das R, Gupta A, Parsad R, Meher PK. RBProkCNN: Deep learning on appropriate contextual evolutionary information for RNA binding protein discovery in prokaryotes. Comput Struct Biotechnol J 2024; 23:1631-1640. [PMID: 38660008 PMCID: PMC11039349 DOI: 10.1016/j.csbj.2024.04.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 04/12/2024] [Accepted: 04/12/2024] [Indexed: 04/26/2024] Open
Abstract
RNA-binding proteins (RBPs) are central to key functions such as post-transcriptional regulation, mRNA stability, and adaptation to varied environmental conditions in prokaryotes. While the majority of research has concentrated on eukaryotic RBPs, recent developments underscore the crucial involvement of prokaryotic RBPs. Although computational methods have emerged in recent years to identify RBPs, they have fallen short in accurately identifying prokaryotic RBPs due to their generic nature. To bridge this gap, we introduce RBProkCNN, a novel machine learning-driven computational model meticulously designed for the accurate prediction of prokaryotic RBPs. The prediction process involves the utilization of eight shallow learning algorithms and four deep learning models, incorporating PSSM-based evolutionary features. By leveraging a convolutional neural network (CNN) and evolutionarily significant features selected through extreme gradient boosting variable importance measure, RBProkCNN achieved the highest accuracy in five-fold cross-validation, yielding 98.04% auROC and 98.19% auPRC. Furthermore, RBProkCNN demonstrated robust performance with an independent dataset, showcasing a commendable 95.77% auROC and 95.78% auPRC. Noteworthy is its superior predictive accuracy when compared to several state-of-the-art existing models. RBProkCNN is available as an online prediction tool (https://iasri-sg.icar.gov.in/rbprokcnn/), offering free access to interested users. This tool represents a substantial contribution, enriching the array of resources available for the accurate and efficient prediction of prokaryotic RBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ritwika Das
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
3
|
Li X, Wei Z, Hu Y, Zhu X. GraphNABP: Identifying nucleic acid-binding proteins with protein graphs and protein language models. Int J Biol Macromol 2024; 280:135599. [PMID: 39276905 DOI: 10.1016/j.ijbiomac.2024.135599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 09/11/2024] [Accepted: 09/11/2024] [Indexed: 09/17/2024]
Abstract
The computational identification of nucleic acid-binding proteins (NABP) is of great significance for understanding the mechanisms of these biological activities and drug discovery. Although a bunch of sequence-based methods have been proposed to predict NABP and achieved promising performance, the structure information is often overlooked. On the other hand, the power of popular protein language models (pLM) has seldom been harnessed for predicting NABPs. In this study, we propose a novel framework called GraphNABP, to predict NABP by integrating sequence and predicted 3D structure information. Specifically, sequence embeddings and protein molecular graphs were first obtained from ProtT5 protein language model and predicted 3D structures, respectively. Then, graph attention (GAT) and bidirectional long short-term memory (BiLSTM) neural networks were used to enhance feature representations. Finally, a fully connected layer is used to predict NABPs. To the best of our knowledge, this is the first time to integrate AlphaFold and protein language models for the prediction of NABPs. The performances on multiple independent test sets indicate that GraphNABP outperforms other state-of-the-art methods. Our results demonstrate the effectiveness of pLM embeddings and structural information for NABP prediction. The codes and data used in this study are available at https://github.com/lixiangli01/GraphNABP.
Collapse
Affiliation(s)
- Xiang Li
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Zhuoyu Wei
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Yueran Hu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China
| | - Xiaolei Zhu
- School of Information and Artificial Intelligence, Anhui Agricultural University, Hefei, Anhui 230036, China.
| |
Collapse
|
4
|
Pradhan UK, Meher PK, Naha S, Sharma NK, Agarwal A, Gupta A, Parsad R. DBPMod: a supervised learning model for computational recognition of DNA-binding proteins in model organisms. Brief Funct Genomics 2024; 23:363-372. [PMID: 37651627 DOI: 10.1093/bfgp/elad039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2023] [Revised: 08/09/2023] [Accepted: 08/15/2023] [Indexed: 09/02/2023] Open
Abstract
DNA-binding proteins (DBPs) play critical roles in many biological processes, including gene expression, DNA replication, recombination and repair. Understanding the molecular mechanisms underlying these processes depends on the precise identification of DBPs. In recent times, several computational methods have been developed to identify DBPs. However, because of the generic nature of the models, these models are unable to identify species-specific DBPs with higher accuracy. Therefore, a species-specific computational model is needed to predict species-specific DBPs. In this paper, we introduce the computational DBPMod method, which makes use of a machine learning approach to identify species-specific DBPs. For prediction, both shallow learning algorithms and deep learning models were used, with shallow learning models achieving higher accuracy. Additionally, the evolutionary features outperformed sequence-derived features in terms of accuracy. Five model organisms, including Caenorhabditis elegans, Drosophila melanogaster, Escherichia coli, Homo sapiens and Mus musculus, were used to assess the performance of DBPMod. Five-fold cross-validation and independent test set analyses were used to evaluate the prediction accuracy in terms of area under receiver operating characteristic curve (auROC) and area under precision-recall curve (auPRC), which was found to be ~89-92% and ~89-95%, respectively. The comparative results demonstrate that the DBPMod outperforms 12 current state-of-the-art computational approaches in identifying the DBPs for all five model organisms. We further developed the web server of DBPMod to make it easier for researchers to detect DBPs and is publicly available at https://iasri-sg.icar.gov.in/dbpmod/. DBPMod is expected to be an invaluable tool for discovering DBPs, supplementing the current experimental and computational methods.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Nitesh K Sharma
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences, University of Southern California, 1540 Alcazar Street, Los Angeles, CA 90033, USA
| | - Aarushi Agarwal
- Amity Institute of Biotechnology, Amity University, Noida, Uttar Pradesh 201313, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
5
|
Pradhan UK, Meher PK, Naha S, Das R, Gupta A, Parsad R. ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins. Protein Sci 2024; 33:e5015. [PMID: 38747369 PMCID: PMC11094783 DOI: 10.1002/pro.5015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 04/18/2024] [Accepted: 04/21/2024] [Indexed: 05/19/2024]
Abstract
Prokaryotic DNA binding proteins (DBPs) play pivotal roles in governing gene regulation, DNA replication, and various cellular functions. Accurate computational models for predicting prokaryotic DBPs hold immense promise in accelerating the discovery of novel proteins, fostering a deeper understanding of prokaryotic biology, and facilitating the development of therapeutics targeting for potential disease interventions. However, existing generic prediction models often exhibit lower accuracy in predicting prokaryotic DBPs. To address this gap, we introduce ProkDBP, a novel machine learning-driven computational model for prediction of prokaryotic DBPs. For prediction, a total of nine shallow learning algorithms and five deep learning models were utilized, with the shallow learning models demonstrating higher performance metrics compared to their deep learning counterparts. The light gradient boosting machine (LGBM), coupled with evolutionarily significant features selected via random forest variable importance measure (RF-VIM) yielded the highest five-fold cross-validation accuracy. The model achieved the highest auROC (0.9534) and auPRC (0.9575) among the 14 machine learning models evaluated. Additionally, ProkDBP demonstrated substantial performance with an independent dataset, exhibiting higher values of auROC (0.9332) and auPRC (0.9371). Notably, when benchmarked against several cutting-edge existing models, ProkDBP showcased superior predictive accuracy. Furthermore, to promote accessibility and usability, ProkDBP (https://iasri-sg.icar.gov.in/prokdbp/) is available as an online prediction tool, enabling free access to interested users. This tool stands as a significant contribution, enhancing the repertoire of resources for accurate and efficient prediction of prokaryotic DBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Prabina Kumar Meher
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Sanchita Naha
- Division of Computer ApplicationsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ritwika Das
- Division of Agricultural BioinformaticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ajit Gupta
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Rajender Parsad
- ICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| |
Collapse
|
6
|
Sun C, Feng Y. EPDRNA: A Model for Identifying DNA-RNA Binding Sites in Disease-Related Proteins. Protein J 2024; 43:513-521. [PMID: 38491248 DOI: 10.1007/s10930-024-10183-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/02/2024] [Indexed: 03/18/2024]
Abstract
Protein-DNA and protein-RNA interactions are involved in many biological processes and regulate many cellular functions. Moreover, they are related to many human diseases. To understand the molecular mechanism of protein-DNA binding and protein-RNA binding, it is important to identify which residues in the protein sequence bind to DNA and RNA. At present, there are few methods for specifically identifying the binding sites of disease-related protein-DNA and protein-RNA. In this study, so we combined four machine learning algorithms into an ensemble classifier (EPDRNA) to predict DNA and RNA binding sites in disease-related proteins. The dataset used in model was collated from UniProt and PDB database, and PSSM, physicochemical properties and amino acid type were used as features. The EPDRNA adopted soft voting and achieved the best AUC value of 0.73 at the DNA binding sites, and the best AUC value of 0.71 at the RNA binding sites in 10-fold cross validation in the training sets. In order to further verify the performance of the model, we assessed EPDRNA for the prediction of DNA-binding sites and the prediction of RNA-binding sites on the independent test dataset. The EPDRNA achieved 85% recall rate and 25% precision on the protein-DNA interaction independent test set, and achieved 82% recall rate and 27% precision on the protein-RNA interaction independent test set. The online EPDRNA webserver is freely available at http://www.s-bioinformatics.cn/epdrna .
Collapse
Affiliation(s)
- CanZhuang Sun
- College of Science, Inner Mongolia Agriculture University, Hohhot, 010018, People's Republic of China
| | - YongE Feng
- College of Science, Inner Mongolia Agriculture University, Hohhot, 010018, People's Republic of China.
| |
Collapse
|
7
|
Wu S, Guo JT. Improved prediction of DNA and RNA binding proteins with deep learning models. Brief Bioinform 2024; 25:bbae285. [PMID: 38856168 PMCID: PMC11163377 DOI: 10.1093/bib/bbae285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 05/20/2024] [Accepted: 05/31/2024] [Indexed: 06/11/2024] Open
Abstract
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play important roles in essential biological processes. To facilitate functional annotation and accurate prediction of different types of NABPs, many machine learning-based computational approaches have been developed. However, the datasets used for training and testing as well as the prediction scopes in these studies have limited their applications. In this paper, we developed new strategies to overcome these limitations by generating more accurate and robust datasets and developing deep learning-based methods including both hierarchical and multi-class approaches to predict the types of NABPs for any given protein. The deep learning models employ two layers of convolutional neural network and one layer of long short-term memory. Our approaches outperform existing DBP and RBP predictors with a balanced prediction between DBPs and RBPs, and are more practically useful in identifying novel NABPs. The multi-class approach greatly improves the prediction accuracy of DBPs and RBPs, especially for the DBPs with ~12% improvement. Moreover, we explored the prediction accuracy of single-stranded DNA binding proteins and their effect on the overall prediction accuracy of NABP predictions.
Collapse
Affiliation(s)
- Siwen Wu
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| | - Jun-tao Guo
- Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC 28223, United States
| |
Collapse
|
8
|
Jia P, Zhang F, Wu C, Li M. A comprehensive review of protein-centric predictors for biomolecular interactions: from proteins to nucleic acids and beyond. Brief Bioinform 2024; 25:bbae162. [PMID: 38739759 PMCID: PMC11089422 DOI: 10.1093/bib/bbae162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 02/17/2024] [Accepted: 03/31/2024] [Indexed: 05/16/2024] Open
Abstract
Proteins interact with diverse ligands to perform a large number of biological functions, such as gene expression and signal transduction. Accurate identification of these protein-ligand interactions is crucial to the understanding of molecular mechanisms and the development of new drugs. However, traditional biological experiments are time-consuming and expensive. With the development of high-throughput technologies, an increasing amount of protein data is available. In the past decades, many computational methods have been developed to predict protein-ligand interactions. Here, we review a comprehensive set of over 160 protein-ligand interaction predictors, which cover protein-protein, protein-nucleic acid, protein-peptide and protein-other ligands (nucleotide, heme, ion) interactions. We have carried out a comprehensive analysis of the above four types of predictors from several significant perspectives, including their inputs, feature profiles, models, availability, etc. The current methods primarily rely on protein sequences, especially utilizing evolutionary information. The significant improvement in predictions is attributed to deep learning methods. Additionally, sequence-based pretrained models and structure-based approaches are emerging as new trends.
Collapse
Affiliation(s)
- Pengzhen Jia
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
- College of Information Engineering, Northwest A&F University, No. 3 Taicheng Road, Yangling, Shaanxi 712100, China
| | - Chaojin Wu
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| | - Min Li
- School of Computer Science and Engineering, Central South University, 932 Lushan Road(S), Changsha 410083, China
| |
Collapse
|
9
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta S, Gupta A, Parsad R. RBPLight: a computational tool for discovery of plant-specific RNA-binding proteins using light gradient boosting machine and ensemble of evolutionary features. Brief Funct Genomics 2023; 22:401-410. [PMID: 37158175 DOI: 10.1093/bfgp/elad016] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 04/12/2023] [Accepted: 04/21/2023] [Indexed: 05/10/2023] Open
Abstract
RNA-binding proteins (RBPs) are essential for post-transcriptional gene regulation in eukaryotes, including splicing control, mRNA transport and decay. Thus, accurate identification of RBPs is important to understand gene expression and regulation of cell state. In order to detect RBPs, a number of computational models have been developed. These methods made use of datasets from several eukaryotic species, specifically from mice and humans. Although some models have been tested on Arabidopsis, these techniques fall short of correctly identifying RBPs for other plant species. Therefore, the development of a powerful computational model for identifying plant-specific RBPs is needed. In this study, we presented a novel computational model for locating RBPs in plants. Five deep learning models and ten shallow learning algorithms were utilized for prediction with 20 sequence-derived and 20 evolutionary feature sets. The highest repeated five-fold cross-validation accuracy, 91.24% AU-ROC and 91.91% AU-PRC, was achieved by light gradient boosting machine. While evaluated using an independent dataset, the developed approach achieved 94.00% AU-ROC and 94.50% AU-PRC. The proposed model achieved significantly higher accuracy for predicting plant-specific RBPs as compared to the currently available state-of-art RBP prediction models. Despite the fact that certain models have already been trained and assessed on the model organism Arabidopsis, this is the first comprehensive computer model for the discovery of plant-specific RBPs. The web server RBPLight was also developed, which is publicly accessible at https://iasri-sg.icar.gov.in/rbplight/, for the convenience of researchers to identify RBPs in plants.
Collapse
Affiliation(s)
- Upendra K Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sagar Gupta
- CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur (HP) 176061, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
10
|
Akbari Rokn Abadi S, Tabatabaei S, Koohi S. KDeep: a new memory-efficient data extraction method for accurately predicting DNA/RNA transcription factor binding sites. J Transl Med 2023; 21:727. [PMID: 37845681 PMCID: PMC10580661 DOI: 10.1186/s12967-023-04593-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2023] [Accepted: 10/04/2023] [Indexed: 10/18/2023] Open
Abstract
This paper addresses the crucial task of identifying DNA/RNA binding sites, which has implications in drug/vaccine design, protein engineering, and cancer research. Existing methods utilize complex neural network structures, diverse input types, and machine learning techniques for feature extraction. However, the growing volume of sequences poses processing challenges. This study introduces KDeep, employing a CNN-LSTM architecture with a novel encoding method called 2Lk. 2Lk enhances prediction accuracy, reduces memory consumption by up to 84%, reduces trainable parameters, and improves interpretability by approximately 79% compared to state-of-the-art approaches. KDeep offers a promising solution for accurate and efficient binding site prediction.
Collapse
Affiliation(s)
| | | | - Somayyeh Koohi
- Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
11
|
Yan K, Feng J, Huang J, Wu H. iDRPro-SC: identifying DNA-binding proteins and RNA-binding proteins based on subfunction classifiers. Brief Bioinform 2023:bbad251. [PMID: 37405873 DOI: 10.1093/bib/bbad251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 06/10/2023] [Accepted: 06/12/2023] [Indexed: 07/07/2023] Open
Abstract
Nucleic acid-binding proteins are proteins that interact with DNA and RNA to regulate gene expression and transcriptional control. The pathogenesis of many human diseases is related to abnormal gene expression. Therefore, recognizing nucleic acid-binding proteins accurately and efficiently has important implications for disease research. To address this question, some scientists have proposed the method of using sequence information to identify nucleic acid-binding proteins. However, different types of nucleic acid-binding proteins have different subfunctions, and these methods ignore their internal differences, so the performance of the predictor can be further improved. In this study, we proposed a new method, called iDRPro-SC, to predict the type of nucleic acid-binding proteins based on the sequence information. iDRPro-SC considers the internal differences of nucleic acid-binding proteins and combines their subfunctions to build a complete dataset. Additionally, we used an ensemble learning to characterize and predict nucleic acid-binding proteins. The results of the test dataset showed that iDRPro-SC achieved the best prediction performance and was superior to the other existing nucleic acid-binding protein prediction methods. We have established a web server that can be accessed online: http://bliulab.net/iDRPro-SC.
Collapse
Affiliation(s)
- Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jiawei Feng
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Jing Huang
- Huajian Yutong Technology (Beijing) Co., Ltd
- State Key Laboratory of Media Convergence Production Technology and Systems, Beijing China,100803
- Xinhua New Media Culture Communication Co., Ltd
| | - Hao Wu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
12
|
Pradhan UK, Meher PK, Naha S, Pal S, Gupta A, Parsad R. PlDBPred: a novel computational model for discovery of DNA binding proteins in plants. Brief Bioinform 2023; 24:6840070. [PMID: 36416116 DOI: 10.1093/bib/bbac483] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2022] [Revised: 10/10/2022] [Accepted: 10/11/2022] [Indexed: 11/24/2022] Open
Abstract
DNA-binding proteins (DBPs) play crucial roles in numerous cellular processes including nucleotide recognition, transcriptional control and the regulation of gene expression. Majority of the existing computational techniques for identifying DBPs are mainly applicable to human and mouse datasets. Even though some models have been tested on Arabidopsis, they produce poor accuracy when applied to other plant species. Therefore, it is imperative to develop an effective computational model for predicting plant DBPs. In this study, we developed a comprehensive computational model for plant specific DBPs identification. Five shallow learning and six deep learning models were initially used for prediction, where shallow learning methods outperformed deep learning algorithms. In particular, support vector machine achieved highest repeated 5-fold cross-validation accuracy of 94.0% area under receiver operating characteristic curve (AUC-ROC) and 93.5% area under precision recall curve (AUC-PR). With an independent dataset, the developed approach secured 93.8% AUC-ROC and 94.6% AUC-PR. While compared with the state-of-art existing tools by using an independent dataset, the proposed model achieved much higher accuracy. Overall results suggest that the developed computational model is more efficient and reliable as compared to the existing models for the prediction of DBPs in plants. For the convenience of the majority of experimental scientists, the developed prediction server PlDBPred is publicly accessible at https://iasri-sg.icar.gov.in/pldbpred/.The source code is also provided at https://iasri-sg.icar.gov.in/pldbpred/source_code.php for prediction using a large-size dataset.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi-110012, India
| |
Collapse
|
13
|
Du X, Hu J. Deep Multi-Label Joint Learning for RNA and DNA-Binding Proteins Prediction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:307-320. [PMID: 35148267 DOI: 10.1109/tcbb.2022.3150280] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The recognition of DNA- (DBPs) and RNA-binding proteins (RBPs) is not only conducive to understanding cell function, but also a challenging task. Previous studies have shown that these proteins are usually considered separately due to different binding domains. In addition, due to the high similarity between DBPs and RBPs, it is possible for DBPs predictor to predict RBPs as DBPs, and vice versa, which leads to high cross-prediction rate. In this study, we creatively propose a novel deep multi-label joint learning framework to leverage the relationship between multiple labels and binding proteins. First, a multi-label variant network is designed to explore multi-scale context hidden information. Then, multi-label Long Short-Term Memory (multiLSTM) is used to mine the potential relationship between labels. Finally, the calibrated hidden features from variant network are considered for different levels of joint learning so that multiLSTM can better explore the correlation between them. Extensive experiments are also carried out to compare the proposed method with other existing methods. Furthermore, we also provide further insights into the importance of the relevant bioanalysis of proteins obtained from our model and summarize these binding proteins that are significantly related to a disease. Our method is freely available at http://39.108.90.186/dmlj.
Collapse
|
14
|
Wu Z, Basu S, Wu X, Kurgan L. qNABpredict: Quick, accurate, and taxonomy-aware sequence-based prediction of content of nucleic acid binding amino acids. Protein Sci 2023; 32:e4544. [PMID: 36519304 PMCID: PMC9798252 DOI: 10.1002/pro.4544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 12/07/2022] [Accepted: 12/08/2022] [Indexed: 12/23/2022]
Abstract
Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce more details but suffer high computational cost since they must predict every amino acid in the input sequence and rely on multiple sequence alignments. We propose an alternative approach that predicts content (fraction) of the NA-binding residues, offering more information than the protein-level prediction and much shorter runtime than the residue-level tools. Our first-of-its-kind content predictor, qNABpredict, relies on a small, rationally designed and fast-to-compute feature set that represents relevant characteristics extracted from the input sequence and a well-parametrized support vector regression model. We provide two versions of qNABpredict, a taxonomy-agnostic model that can be used for proteins of unknown taxonomic origin and more accurate taxonomy-aware models that are tailored to specific taxonomic kingdoms: archaea, bacteria, eukaryota, and viruses. Empirical tests on a low-similarity test dataset show that qNABpredict is 100 times faster and generates statistically more accurate content predictions when compared to the content extracted from results produced by the residue-level predictors. We also show that qNABpredict's content predictions can be used to improve results generated by the residue-level predictors. We release qNABpredict as a convenient webserver and source code at http://biomine.cs.vcu.edu/servers/qNABpredict/. This new tool should be particularly useful to predict details of protein-NA interactions for large protein families and proteomes.
Collapse
Affiliation(s)
- Zhonghua Wu
- School of Mathematical Sciences and LPMCNankai UniversityTianjinChina
| | - Sushmita Basu
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginiaUSA
| | - Xuantai Wu
- School of Mathematical Sciences and LPMCNankai UniversityTianjinChina
| | - Lukasz Kurgan
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginiaUSA
| |
Collapse
|
15
|
Wang N, Zhang J, Liu B. iDRBP-EL: Identifying DNA- and RNA- Binding Proteins Based on Hierarchical Ensemble Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:432-441. [PMID: 34932484 DOI: 10.1109/tcbb.2021.3136905] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Identification of DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) from the primary sequences is essential for further exploring protein-nucleic acid interactions. Previous studies have shown that machine-learning-based methods can efficiently identify DBPs or RBPs. However, the information used in these methods is slightly unitary, and most of them only can predict DBPs or RBPs. In this study, we proposed a computational predictor iDRBP-EL to identify DNA- and RNA- binding proteins, and introduced hierarchical ensemble learning to integrate three level information. The method can integrate the information of different features, machine learning algorithms and data into one multi-label model. The ablation experiment showed that the fusion of different information can improve the prediction performance and overcome the cross-prediction problem. Experimental results on the independent datasets showed that iDRBP-EL outperformed all the other competing methods. Moreover, we established a user-friendly webserver iDRBP-EL (http://bliulab.net/iDRBP-EL), which can predict both DBPs and RBPs only based on protein sequences.
Collapse
|
16
|
Durge AR, Shrimankar DD, Sawarkar AD. Heuristic Analysis of Genomic Sequence Processing Models for High Efficiency Prediction: A Statistical Perspective. Curr Genomics 2022; 23:299-317. [PMID: 36778194 PMCID: PMC9878859 DOI: 10.2174/1389202923666220927105311] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2022] [Revised: 08/29/2022] [Accepted: 09/01/2022] [Indexed: 11/22/2022] Open
Abstract
Genome sequences indicate a wide variety of characteristics, which include species and sub-species type, genotype, diseases, growth indicators, yield quality, etc. To analyze and study the characteristics of the genome sequences across different species, various deep learning models have been proposed by researchers, such as Convolutional Neural Networks (CNNs), Deep Belief Networks (DBNs), Multilayer Perceptrons (MLPs), etc., which vary in terms of evaluation performance, area of application and species that are processed. Due to a wide differentiation between the algorithmic implementations, it becomes difficult for research programmers to select the best possible genome processing model for their application. In order to facilitate this selection, the paper reviews a wide variety of such models and compares their performance in terms of accuracy, area of application, computational complexity, processing delay, precision and recall. Thus, in the present review, various deep learning and machine learning models have been presented that possess different accuracies for different applications. For multiple genomic data, Repeated Incremental Pruning to Produce Error Reduction with Support Vector Machine (Ripper SVM) outputs 99.7% of accuracy, and for cancer genomic data, it exhibits 99.27% of accuracy using the CNN Bayesian method. Whereas for Covid genome analysis, Bidirectional Long Short-Term Memory with CNN (BiLSTM CNN) exhibits the highest accuracy of 99.95%. A similar analysis of precision and recall of different models has been reviewed. Finally, this paper concludes with some interesting observations related to the genomic processing models and recommends applications for their efficient use.
Collapse
Affiliation(s)
- Aditi R. Durge
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| | - Deepti D. Shrimankar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India,Address correspondence to this author at the Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India; Tel: 9860606477; E-mail:
| | - Ankush D. Sawarkar
- Department of Computer Science and Engineering, Visvesvaraya National Institute of Technology (VNIT), Nagpur, India
| |
Collapse
|
17
|
Wang N, Zhang J, Liu B. IDRBP-PPCT: Identifying Nucleic Acid-Binding Proteins Based on Position-Specific Score Matrix and Position-Specific Frequency Matrix Cross Transformation. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2284-2293. [PMID: 33780341 DOI: 10.1109/tcbb.2021.3069263] [Citation(s) in RCA: 20] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs) are two important nucleic acid-binding proteins (NABPs), which play important roles in biological processes such as replication, translation and transcription of genetic material. Some proteins (DRBPs) bind to both DNA and RNA, also play a key role in gene expression. Identification of DBPs, RBPs and DRBPs is important to study protein-nucleic acid interactions. Computational methods are increasingly being proposed to automatically identify DNA- or RNA-binding proteins based only on protein sequences. One challenge is to design an effective protein representation method to convert protein sequences into fixed-dimension feature vectors. In this study, we proposed a novel protein representation method called Position-Specific Scoring Matrix (PSSM) and Position-Specific Frequency Matrix (PSFM) Cross Transformation (PPCT) to represent protein sequences. This method contains the evolutionary information in PSSM and PSFM, and their correlations. A new computational predictor called IDRBP-PPCT was proposed by combining PPCT and the two-layer framework based on the random forest algorithm to identify DBPs, RBPs and DRBPs. The experimental results on the independent dataset and the tomato genome proved the effectiveness of the proposed method. A user-friendly web-server of IDRBP-PPCT was constructed, which is freely available at http://bliulab.net/IDRBP-PPCT.
Collapse
|
18
|
Nie W, Deng L. TSNAPred: predicting type-specific nucleic acid binding residues via an ensemble approach. Brief Bioinform 2022; 23:6618235. [PMID: 35753699 DOI: 10.1093/bib/bbac244] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/22/2022] [Accepted: 05/24/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The interplay between protein and nucleic acid participates in diverse biological activities. Accurately identifying the interaction between protein and nucleic acid can strengthen the understanding of protein function. However, conventional methods are too time-consuming, and computational methods are type-agnostic predictions. We proposed an ensemble predictor termed TSNAPred and first used it to identify residues that bind to A-DNA, B-DNA, ssDNA, mRNA, tRNA and rRNA. TSNAPred combines LightGBM and capsule network, both learned on the feature derived from protein sequence. TSNAPred utilizes the sliding window technique to extract long-distance dependencies between residues and a weighted ensemble strategy to enhance the prediction performance. The results show that TSNAPred can effectively identify type-specific nucleic acid binding residues in our test set. What is more, it also can discriminate DNA-binding and RNA-binding residues, which has improved 5% to 10% on the AUC value compared with other state-of-the-art methods. The dataset and code of TSNAPred are available at: https://github.com/niewenjuan-csu/TSNAPred.
Collapse
Affiliation(s)
- Wenjuan Nie
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, 410075, Changsha, China
| |
Collapse
|
19
|
An Effective Deep Learning-Based Architecture for Prediction of N7-Methylguanosine Sites in Health Systems. ELECTRONICS 2022. [DOI: 10.3390/electronics11121917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
N7-methylguanosine (m7G) is one of the most important epigenetic modifications found in rRNA, mRNA, and tRNA, and performs a promising role in gene expression regulation. Owing to its significance, well-equipped traditional laboratory-based techniques have been performed for the identification of N7-methylguanosine (m7G). Consequently, these approaches were found to be time-consuming and cost-ineffective. To move on from these traditional approaches to predict N7-methylguanosine sites with high precision, the concept of artificial intelligence has been adopted. In this study, an intelligent computational model called N7-methylguanosine-Long short-term memory (m7G-LSTM) is introduced for the prediction of N7-methylguanosine sites. One-hot encoding and word2vec feature schemes are used to express the biological sequences while the LSTM and CNN algorithms have been employed for classification. The proposed “m7G-LSTM” model obtained an accuracy value of 95.95%, a specificity value of 95.94%, a sensitivity value of 95.97%, and Matthew’s correlation coefficient (MCC) value of 0.919. The proposed predictive m7G-LSTM model has significantly achieved better outcomes than previous models in terms of all evaluation parameters. The proposed m7G-LSTM computational system aims to support the drug industry and help researchers in the fields of bioinformatics to enhance innovation for the prediction of the behavior of N7-methylguanosine sites.
Collapse
|
20
|
Yan J, Jiang T, Liu J, Lu Y, Guan S, Li H, Wu H, Ding Y. DNA-binding protein prediction based on deep transfer learning. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2022; 19:7719-7736. [PMID: 35801442 DOI: 10.3934/mbe.2022362] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The study of DNA binding proteins (DBPs) is of great importance in the biomedical field and plays a key role in this field. At present, many researchers are working on the prediction and detection of DBPs. Traditional DBP prediction mainly uses machine learning methods. Although these methods can obtain relatively high pre-diction accuracy, they consume large quantities of human effort and material resources. Transfer learning has certain advantages in dealing with such prediction problems. Therefore, in the present study, two features were extracted from a protein sequence, a transfer learning method was used, and two classical transfer learning algorithms were compared to transfer samples and construct data sets. In the final step, DBPs are detected by building a deep learning neural network model in a way that uses attention mechanisms.
Collapse
Affiliation(s)
- Jun Yan
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Tengsheng Jiang
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Junkai Liu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Yaoyao Lu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Shixuan Guan
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Haiou Li
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Hongjie Wu
- College of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
- Suzhou Smart City Research Institute, Suzhou University of Science and Technology, Suzhou, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, China
| |
Collapse
|
21
|
Zhang J, Yan K, Chen Q, Liu B. PreRBP-TL: prediction of species-specific RNA-binding proteins based on transfer learning. Bioinformatics 2022; 38:2135-2143. [PMID: 35176130 DOI: 10.1093/bioinformatics/btac106] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2021] [Revised: 11/18/2021] [Accepted: 02/15/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION RNA-binding proteins (RBPs) play crucial roles in post-transcriptional regulation. Accurate identification of RBPs helps to understand gene expression, regulation, etc. In recent years, some computational methods were proposed to identify RBPs. However, these methods fail to accurately identify RBPs from some specific species with limited data, such as bacteria. RESULTS In this study, we introduce a computational method called PreRBP-TL for identifying species-specific RBPs based on transfer learning. The weights of the prediction model were initialized by pretraining with the large general RBP dataset and then fine-tuned with the small species-specific RPB dataset by using transfer learning. The experimental results show that the PreRBP-TL achieves better performance for identifying the species-specific RBPs from Human, Arabidopsis, Escherichia coli and Salmonella, outperforming eight state-of-the-art computational methods. It is anticipated PreRBP-TL will become a useful method for identifying RBPs. AVAILABILITY AND IMPLEMENTATION For the convenience of researchers to identify RBPs, the web server of PreRBP-TL was established, freely available at http://bliulab.net/PreRBP-TL. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jun Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Ke Yan
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China
| | - Qingcai Chen
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|
22
|
Cui F, Li S, Zhang Z, Sui M, Cao C, El-Latif Hesham A, Zou Q. DeepMC-iNABP: Deep learning for multiclass identification and classification of nucleic acid-binding proteins. Comput Struct Biotechnol J 2022; 20:2020-2028. [PMID: 35521556 PMCID: PMC9065708 DOI: 10.1016/j.csbj.2022.04.029] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Revised: 04/06/2022] [Accepted: 04/20/2022] [Indexed: 11/29/2022] Open
Abstract
Nucleic acid-binding proteins (NABPs), including DNA-binding proteins (DBPs) and RNA-binding proteins (RBPs), play vital roles in gene expression. Accurate identification of these proteins is crucial. However, there are two existing challenges: one is the problem of ignoring DNA- and RNA-binding proteins (DRBPs), and the other is a cross-predicting problem referring to DBP predictors predicting DBPs as RBPs, and vice versa. In this study, we proposed a computational predictor, called DeepMC-iNABP, with the goal of solving these difficulties by utilizing a multiclass classification strategy and deep learning approaches. DBPs, RBPs, DRBPs and non-NABPs as separate classes of data were used for training the DeepMC-iNABP model. The results on test data collected in this study and two independent test datasets showed that DeepMC-iNABP has a strong advantage in identifying the DRBPs and has the ability to alleviate the cross-prediction problem to a certain extent. The web-server of DeepMC-iNABP is freely available at http://www.deepmc-inabp.net/. The datasets used in this research can also be downloaded from the website.
Collapse
Affiliation(s)
- Feifei Cui
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Shuang Li
- Beidahuang Industry Group General Hospital, Harbin 150001, China
| | - Zilong Zhang
- School of Computer Science and Technology, Hainan University, Haikou 570228, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
| | - Miaomiao Sui
- Graduate School Agricultural and Life Science, The University of Tokyo, Tokyo 1138657, Japan
| | - Chen Cao
- School of Biomedical Engineering and Informatics, Nanjing Medical University, Nanjing 211166, China
| | - Abd El-Latif Hesham
- Genetics Department, Faculty of Agriculture, Beni-Suef University, Beni-Suef 62511, Egypt
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou 324000, China
- Corresponding author at: Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China.
| |
Collapse
|
23
|
Niu M, Wu J, Zou Q, Liu Z, Xu L. rBPDL:Predicting RNA-Binding Proteins Using Deep Learning. IEEE J Biomed Health Inform 2021; 25:3668-3676. [PMID: 33780344 DOI: 10.1109/jbhi.2021.3069259] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
RNA-binding protein (RBP) is a powerful and wide-ranging regulator that plays an important role in cell development, differentiation, metabolism, health and disease. The prediction of RBPs provides valuable guidance for biologists. Although experimental methods have made great progress in predicting RBP, they are time-consuming and not flexible. Therefore, we developed a network model, rBPDL, by combining a convolutional neural network and long short-term memory for multilabel classification of RBPs. Moreover, to achieve better prediction results, we used a voting algorithm for ensemble learning of the model. We compared rBPDL with state-of-the-art methods and found that rBPDL significantly improved identification performance for the RBP68 dataset, with a macro-Area Under Curve (AUC), micro-AUC, and weighted AUC of 0.936, 0.962, and 0.946, respectively. Furthermore, through AUC statistical analysis of the RBP domain, we analyzed the performance of rBPDL and found that the RBP identification performance in the same domain was similar. In addition, we analyzed the performance preferences and physicochemical properties of the binding protein amino acids and explored the characteristics that affect the binding by using the RBP86 dataset.
Collapse
|
24
|
Wei H, Liao Q, Liu B. iLncRNAdis-FB: Identify lncRNA-Disease Associations by Fusing Biological Feature Blocks Through Deep Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1946-1957. [PMID: 31905146 DOI: 10.1109/tcbb.2020.2964221] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Identification of lncRNA-disease associations is not only important for exploring the disease mechanism, but will also facilitate the molecular targeting drug discovery. Fusing multiple biological information is able to generate a more comprehensive view of lncRNA-disease association feature. However, the existing fusion strategies in this field fail to remove the noisy and irrelevant information from each data source. As a result, their predictive performance is still too low to be applied to real world applications. In this regard, a novel computational predictor called iLncRNAdis-FB is proposed based on the Convolution Neural Network (CNN) to integrate different data sources by using the feature blocks in a supervised manner. The lncRNA similarity matrix and disease similarity matrix are constructed, based on which the three-dimensional feature blocks are generated. These feature blocks are then fed into CNN to train the model so as to predict unknown lncRNA-disease associations. Experimental results show that iLncRNAdis-FB achieves better performance compared with other state-of-the-art predictors. Furthermore, a web server of iLncRNAdis-FB has been established at http://bliulab.net/iLncRNAdis-FB/, by which users can submit lncRNA sequences to detect their potential associated diseases.
Collapse
|
25
|
Min X, Lu F, Li C. Sequence-Based Deep Learning Frameworks on Enhancer-Promoter Interactions Prediction. Curr Pharm Des 2021; 27:1847-1855. [PMID: 33234095 DOI: 10.2174/1381612826666201124112710] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2020] [Revised: 07/29/2020] [Accepted: 08/06/2020] [Indexed: 11/22/2022]
Abstract
Enhancer-promoter interactions (EPIs) in the human genome are of great significance to transcriptional regulation, which tightly controls gene expression. Identification of EPIs can help us better decipher gene regulation and understand disease mechanisms. However, experimental methods to identify EPIs are constrained by funds, time, and manpower, while computational methods using DNA sequences and genomic features are viable alternatives. Deep learning methods have shown promising prospects in classification and efforts that have been utilized to identify EPIs. In this survey, we specifically focus on sequence-based deep learning methods and conduct a comprehensive review of the literature. First, we briefly introduce existing sequence- based frameworks on EPIs prediction and their technique details. After that, we elaborate on the dataset, pre-processing means, and evaluation strategies. Finally, we concluded with the challenges these methods are confronted with and suggest several future opportunities. We hope this review will provide a useful reference for further studies on enhancer-promoter interactions.
Collapse
Affiliation(s)
- Xiaoping Min
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Fengqing Lu
- School of Informatics, Xiamen University, Xiamen 361005, China
| | - Chunyan Li
- Graduate School, Yunnan Minzu University, Kunming 650504, China
| |
Collapse
|
26
|
Abstract
Background:
Bioluminescence is a unique and significant phenomenon in nature.
Bioluminescence is important for the lifecycle of some organisms and is valuable in biomedical
research, including for gene expression analysis and bioluminescence imaging technology. In recent
years, researchers have identified a number of methods for predicting bioluminescent proteins
(BLPs), which have increased in accuracy, but could be further improved.
Method:
In this study, a new bioluminescent proteins prediction method, based on a voting
algorithm, is proposed. Four methods of feature extraction based on the amino acid sequence were
used. 314 dimensional features in total were extracted from amino acid composition,
physicochemical properties and k-spacer amino acid pair composition. In order to obtain the highest
MCC value to establish the optimal prediction model, a voting algorithm was then used to build the
model. To create the best performing model, the selection of base classifiers and vote counting rules
are discussed.
Results:
The proposed model achieved 93.4% accuracy, 93.4% sensitivity and
91.7% specificity in the test set, which was better than any other method. A previous prediction of
bioluminescent proteins in three lineages was also improved using the model building method,
resulting in greatly improved accuracy.
Collapse
Affiliation(s)
- Shulin Zhao
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
| | - Ying Ju
- School of Informatics, Xiamen University, Xiamen, China
| | - Xiucai Ye
- Department of Computer Science, University of Tsukuba, Tsukuba Science City, Japan
| | - Jun Zhang
- Rehabilitation Department, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Shuguang Han
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
27
|
Wei H, Ding Y, Liu B. iPiDA-sHN: Identification of Piwi-interacting RNA-disease associations by selecting high quality negative samples. Comput Biol Chem 2020; 88:107361. [PMID: 32916452 DOI: 10.1016/j.compbiolchem.2020.107361] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Revised: 07/31/2020] [Accepted: 08/15/2020] [Indexed: 12/31/2022]
Abstract
As a large group of small non-coding RNAs (ncRNAs), Piwi-interacting RNAs (piRNAs) have been detected to be associated with various diseases. Identifying disease associated piRNAs can provide promising candidate molecular targets to promote the drug design. Although, a few computational ensemble methods have been developed for identifying piRNA-disease associations, the low-quality negative associations even with positive associations used during the training process prevent the predictive performance improvement. In this study, we proposed a new computational predictor named iPiDA-sHN to predict potential piRNA-disease associations. iPiDA-sHN presented the piRNA-disease pairs by incorporating piRNA sequence information, the known piRNA-disease association network, and the disease semantic graph. High-level features of piRNA-disease associations were extracted by the Convolutional Neural Network (CNN). Two-step positive-unlabeled learning strategy based on Support Vector Machine (SVM) was employed to select the high quality negative samples from the unknown piRNA-disease pairs. Finally, the SVM predictor trained with the known piRNA-disease associations and the high quality negative associations was used to predict new piRNA-disease associations. The experimental results showed that iPiDA-sHN achieved superior predictive ability compared with other state-of-the-art predictors.
Collapse
Affiliation(s)
- Hang Wei
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.
| | - Yuxin Ding
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China; School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China; Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China.
| |
Collapse
|
28
|
Wang C, Zhang Y, Han S. Its2vec: Fungal Species Identification Using Sequence Embedding and Random Forest Classification. BIOMED RESEARCH INTERNATIONAL 2020; 2020:2468789. [PMID: 32566672 PMCID: PMC7275950 DOI: 10.1155/2020/2468789] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2020] [Revised: 03/20/2020] [Accepted: 03/25/2020] [Indexed: 12/19/2022]
Abstract
Fungi play essential roles in many ecological processes, and taxonomic classification is fundamental for microbial community characterization and vital for the study and preservation of fungal biodiversity. To cope with massive fungal barcode data, tools that can implement extensive volumes of barcode sequences, especially the internal transcribed spacer (ITS) region, are necessary. However, high variation in the ITS region and computational requirements for processing high-dimensional features remain challenging for existing predictors. In this study, we developed Its2vec, a bioinformatics tool for the classification of fungal ITS barcodes to the species level. An ITS database covering more than 25,000 species in a broad range of fungal taxa was assembled. For dimensionality reduction, a word embedding algorithm was used to represent an ITS sequence as a dense low-dimensional vector. A random forest-based classifier was built for species identification. Benchmarking results showed that our model achieved an accuracy comparable to that of several state-of-the-art predictors, and more importantly, it could implement large datasets and greatly reduce dimensionality. We expect the Its2vec model to be helpful for fungal species identification and, thus, for revealing microbial community structures and in deepening our understanding of their functional mechanisms.
Collapse
Affiliation(s)
- Chao Wang
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu 610054, China
| | - Ying Zhang
- Department of Pharmacy, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin 150088, China
| | - Shuguang Han
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu 60054, China
| |
Collapse
|
29
|
Zeng R, Liao M. Developing a Multi-Layer Deep Learning Based Predictive Model to Identify DNA N4-Methylcytosine Modifications. Front Bioeng Biotechnol 2020; 8:274. [PMID: 32373597 PMCID: PMC7186498 DOI: 10.3389/fbioe.2020.00274] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2020] [Accepted: 03/16/2020] [Indexed: 12/21/2022] Open
Abstract
DNA N4-methylcytosine modification (4mC) plays an essential role in a variety of biological processes. Therefore, accurate identification the 4mC distribution in genome-scale is important for systematically understanding its biological functions. In this study, we present Deep4mcPred, a multi-layer deep learning based predictive model to identify DNA N4-methylcytosine modifications. In this predictor, we for the first time integrate residual network and recurrent neural network to build a multi-layer deep learning predictive system. As compared to existing predictors using traditional machine learning, our proposed method has two advantages. First, our deep learning framework does not need to specify the features when training the predictive model. It can automatically learn the high-level features and capture the characteristic specificity of 4mC sites, benefiting to distinguish true 4mC sites from non-4mC sites. On the other hand, our deep learning method outperforms the traditional machine learning predictors in performance by benchmarking comparison, demonstrating that the proposed Deep4mcPred is more effective in the DNA 4mC site prediction. Moreover, via experimental comparison, we found that attention mechanism introduced into the deep learning framework is useful to capture the critical features. Additionally, we develop a webserver implementing the proposed method for the academic use of research community, which is now available at http://server.malab.cn/Deep4mcPred.
Collapse
Affiliation(s)
- Rao Zeng
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| | - Minghong Liao
- Department of Software Engineering, School of Informatics, Xiamen University, Xiamen, China
| |
Collapse
|
30
|
Meng C, Hu Y, Zhang Y, Guo F. PSBP-SVM: A Machine Learning-Based Computational Identifier for Predicting Polystyrene Binding Peptides. Front Bioeng Biotechnol 2020; 8:245. [PMID: 32296690 PMCID: PMC7137786 DOI: 10.3389/fbioe.2020.00245] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2020] [Accepted: 03/09/2020] [Indexed: 12/11/2022] Open
Abstract
Polystyrene binding peptides (PSBPs) play a key role in the immobilization process. The correct identification of PSBPs is the first step of all related works. In this paper, we proposed a novel support vector machine-based bioinformatic identification model. This model contains four machine learning steps, including feature extraction, feature selection, model training and optimization. In a five-fold cross validation test, this model achieves 90.38, 84.62, 87.50, and 0.90% SN, SP, ACC, and AUC, respectively. The performance of this model outperforms the state-of-the-art identifier in terms of the SN and ACC with a smaller feature set. Furthermore, we constructed a web server that includes the proposed model, which is freely accessible at http://server.malab.cn/PSBP-SVM/index.jsp.
Collapse
Affiliation(s)
- Chaolu Meng
- College of Intelligence and Computing, Tianjin University, Tianjin, China.,College of Computer and Information Engineering, Inner Mongolia Agricultural University, Hohhot, China
| | - Yang Hu
- School of Life Sciences and Technology, Harbin Institute of Technology, Harbin, China
| | - Ying Zhang
- Department of Pharmacy, Heilongjiang Province Land Reclamation Headquarters General Hospital, Harbin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
31
|
Liu Y, Wang X, Liu B. RFPR-IDP: reduce the false positive rates for intrinsically disordered protein and region prediction by incorporating both fully ordered proteins and disordered proteins. Brief Bioinform 2020; 22:2000-2011. [PMID: 32112084 PMCID: PMC7986600 DOI: 10.1093/bib/bbaa018] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
As an important type of proteins, intrinsically disordered proteins/regions (IDPs/IDRs) are related to many crucial biological functions. Accurate prediction of IDPs/IDRs is beneficial to the prediction of protein structures and functions. Most of the existing methods ignore the fully ordered proteins without IDRs during training and test processes. As a result, the corresponding predictors prefer to predict the fully ordered proteins as disordered proteins. Unfortunately, these methods were only evaluated on datasets consisting of disordered proteins without or with only a few fully ordered proteins, and therefore, this problem escapes the attention of the researchers. However, most of the newly sequenced proteins are fully ordered proteins in nature. These predictors fail to accurately predict the ordered and disordered proteins in real-world applications. In this regard, we propose a new method called RFPR-IDP trained with both fully ordered proteins and disordered proteins, which is constructed based on the combination of convolution neural network (CNN) and bidirectional long short-term memory (BiLSTM). The experimental results show that although the existing predictors perform well for predicting the disordered proteins, they tend to predict the fully ordered proteins as disordered proteins. In contrast, the RFPR-IDP predictor can correctly predict the fully ordered proteins and outperform the other 10 state-of-the-art methods when evaluated on a test dataset with both fully ordered proteins and disordered proteins. The web server and datasets of RFPR-IDP are freely available at http://bliulab.net/RFPR-IDP/server.
Collapse
Affiliation(s)
- Yumeng Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Xiaolong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China
| | - Bin Liu
- School of Computer Science and Technology, Harbin Institute of Technology, Shenzhen, Guangdong 518055, China.,School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081, China.,Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing 100081, China
| |
Collapse
|