1
|
Pradhan UK, Naha S, Das R, Gupta A, Parsad R, Meher PK. RBProkCNN: Deep learning on appropriate contextual evolutionary information for RNA binding protein discovery in prokaryotes. Comput Struct Biotechnol J 2024; 23:1631-1640. [PMID: 38660008 PMCID: PMC11039349 DOI: 10.1016/j.csbj.2024.04.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2024] [Revised: 04/12/2024] [Accepted: 04/12/2024] [Indexed: 04/26/2024] Open
Abstract
RNA-binding proteins (RBPs) are central to key functions such as post-transcriptional regulation, mRNA stability, and adaptation to varied environmental conditions in prokaryotes. While the majority of research has concentrated on eukaryotic RBPs, recent developments underscore the crucial involvement of prokaryotic RBPs. Although computational methods have emerged in recent years to identify RBPs, they have fallen short in accurately identifying prokaryotic RBPs due to their generic nature. To bridge this gap, we introduce RBProkCNN, a novel machine learning-driven computational model meticulously designed for the accurate prediction of prokaryotic RBPs. The prediction process involves the utilization of eight shallow learning algorithms and four deep learning models, incorporating PSSM-based evolutionary features. By leveraging a convolutional neural network (CNN) and evolutionarily significant features selected through extreme gradient boosting variable importance measure, RBProkCNN achieved the highest accuracy in five-fold cross-validation, yielding 98.04% auROC and 98.19% auPRC. Furthermore, RBProkCNN demonstrated robust performance with an independent dataset, showcasing a commendable 95.77% auROC and 95.78% auPRC. Noteworthy is its superior predictive accuracy when compared to several state-of-the-art existing models. RBProkCNN is available as an online prediction tool (https://iasri-sg.icar.gov.in/rbprokcnn/), offering free access to interested users. This tool represents a substantial contribution, enriching the array of resources available for the accurate and efficient prediction of prokaryotic RBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ritwika Das
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India
| |
Collapse
|
2
|
Pradhan UK, Behera P, Das R, Naha S, Gupta A, Parsad R, Pradhan SK, Meher PK. AScirRNA: A novel computational approach to discover abiotic stress-responsive circular RNAs in plant genome. Comput Biol Chem 2024; 113:108205. [PMID: 39265460 DOI: 10.1016/j.compbiolchem.2024.108205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 07/12/2024] [Accepted: 09/04/2024] [Indexed: 09/14/2024]
Abstract
In the realm of plant biology, understanding the intricate regulatory mechanisms governing stress responses stands as a pivotal pursuit. Circular RNAs (circRNAs), emerging as critical players in gene regulation, have garnered attention in recent days for their potential roles in abiotic stress adaptation. A comprehensive grasp of circRNAs' functions in stress response offers avenues for breeders to manipulating plants to develop abiotic stress resistant crop cultivars to thrive in challenging climates. This study pioneers a machine learning-based model for predicting abiotic stress-responsive circRNAs. The K-tuple nucleotide composition (KNC) and Pseudo KNC (PKNC) features were utilized to numerically represent circRNAs. Three different feature selection strategies were employed to select relevant and non-redundant features. Eight shallow and four deep learning algorithms were evaluated to build the final predictive model. Following five-fold cross-validation process, XGBoost learning algorithm demonstrated superior performance with LightGBM-chosen 260 KNC features (Accuracy: 74.55 %, auROC: 81.23 %, auPRC: 76.52 %) and 160 PKNC features (Accuracy: 74.32 %, auROC: 81.04 %, auPRC: 76.43 %), over other combinations of learning algorithms and feature selection techniques. Further, the robustness of the developed models were evaluated using an independent test dataset, where the overall accuracy, auROC and auPRC were found to be 73.13 %, 72.34 % and 72.68 % for KNC feature set and 73.52 %, 79.53 % and 73.09 % for PKNC feature set, respectively. This computational approach was also integrated into an online prediction tool, AScirRNA (https://iasri-sg.icar.gov.in/ascirna/) for easy prediction by the users. Both the proposed model and the developed tool are poised to augment ongoing efforts in identifying stress-responsive circRNAs in plants.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Prasanjit Behera
- Department of Bioinformatics, Odisha University of Agriculture & Technology, Bhubaneswar, Odisha 751003, India.
| | - Ritwika Das
- Division of Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Rajender Parsad
- ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| | - Sukanta Kumar Pradhan
- Department of Bioinformatics, Odisha University of Agriculture & Technology, Bhubaneswar, Odisha 751003, India.
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi 110012, India.
| |
Collapse
|
3
|
Jyoti, Ritu, Gupta S, Shankar R. Comprehensive analysis of computational approaches in plant transcription factors binding regions discovery. Heliyon 2024; 10:e39140. [PMID: 39640721 PMCID: PMC11620080 DOI: 10.1016/j.heliyon.2024.e39140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2024] [Revised: 08/23/2024] [Accepted: 10/08/2024] [Indexed: 12/07/2024] Open
Abstract
Transcription factors (TFs) are regulatory proteins which bind to a specific DNA region known as the transcription factor binding regions (TFBRs) to regulate the rate of transcription process. The identification of TFBRs has been made possible by a number of experimental and computational techniques established during the past few years. The process of TFBR identification involves peak identification in the binding data, followed by the identification of motif characteristics. Using the same binding data attempts have been made to raise computational models to identify such binding regions which could save time and resources spent for binding experiments. These computational approaches depend a lot on what way they learn and how. These existing computational approaches are skewed heavily around human TFBRs discovery, while plants have drastically different genomic setup for regulation which these approaches have grossly ignored. Here, we provide a comprehensive study of the current state of the matters in plant specific TF discovery algorithms. While doing so, we encountered several software tools' issues rendering the tools not useable to researches. We fixed them and have also provided the corrected scripts for such tools. We expect this study to serve as a guide for better understanding of software tools' approaches for plant specific TFBRs discovery and the care to be taken while applying them, especially during cross-species applications. The corrected scripts of these software tools are made available at https://github.com/SCBB-LAB/Comparative-analysis-of-plant-TFBS-software.
Collapse
Affiliation(s)
- Jyoti
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Ritu
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Sagar Gupta
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| | - Ravi Shankar
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC Supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, (HP), 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh, 201002, India
| |
Collapse
|
4
|
Pradhan UK, Meher PK, Naha S, Das R, Gupta A, Parsad R. ProkDBP: Toward more precise identification of prokaryotic DNA binding proteins. Protein Sci 2024; 33:e5015. [PMID: 38747369 PMCID: PMC11094783 DOI: 10.1002/pro.5015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2023] [Revised: 04/18/2024] [Accepted: 04/21/2024] [Indexed: 05/19/2024]
Abstract
Prokaryotic DNA binding proteins (DBPs) play pivotal roles in governing gene regulation, DNA replication, and various cellular functions. Accurate computational models for predicting prokaryotic DBPs hold immense promise in accelerating the discovery of novel proteins, fostering a deeper understanding of prokaryotic biology, and facilitating the development of therapeutics targeting for potential disease interventions. However, existing generic prediction models often exhibit lower accuracy in predicting prokaryotic DBPs. To address this gap, we introduce ProkDBP, a novel machine learning-driven computational model for prediction of prokaryotic DBPs. For prediction, a total of nine shallow learning algorithms and five deep learning models were utilized, with the shallow learning models demonstrating higher performance metrics compared to their deep learning counterparts. The light gradient boosting machine (LGBM), coupled with evolutionarily significant features selected via random forest variable importance measure (RF-VIM) yielded the highest five-fold cross-validation accuracy. The model achieved the highest auROC (0.9534) and auPRC (0.9575) among the 14 machine learning models evaluated. Additionally, ProkDBP demonstrated substantial performance with an independent dataset, exhibiting higher values of auROC (0.9332) and auPRC (0.9371). Notably, when benchmarked against several cutting-edge existing models, ProkDBP showcased superior predictive accuracy. Furthermore, to promote accessibility and usability, ProkDBP (https://iasri-sg.icar.gov.in/prokdbp/) is available as an online prediction tool, enabling free access to interested users. This tool stands as a significant contribution, enhancing the repertoire of resources for accurate and efficient prediction of prokaryotic DBPs.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Prabina Kumar Meher
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Sanchita Naha
- Division of Computer ApplicationsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ritwika Das
- Division of Agricultural BioinformaticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Ajit Gupta
- Division of Statistical GeneticsICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| | - Rajender Parsad
- ICAR‐Indian Agricultural Statistics Research Institute, PUSANew DelhiIndia
| |
Collapse
|
5
|
Gupta S, Kesarwani V, Bhati U, Jyoti, Shankar R. PTFSpot: deep co-learning on transcription factors and their binding regions attains impeccable universality in plants. Brief Bioinform 2024; 25:bbae324. [PMID: 39013383 PMCID: PMC11250369 DOI: 10.1093/bib/bbae324] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/07/2024] [Accepted: 06/19/2024] [Indexed: 07/18/2024] Open
Abstract
Unlike animals, variability in transcription factors (TFs) and their binding regions (TFBRs) across the plants species is a major problem that most of the existing TFBR finding software fail to tackle, rendering them hardly of any use. This limitation has resulted into underdevelopment of plant regulatory research and rampant use of Arabidopsis-like model species, generating misleading results. Here, we report a revolutionary transformers-based deep-learning approach, PTFSpot, which learns from TF structures and their binding regions' co-variability to bring a universal TF-DNA interaction model to detect TFBR with complete freedom from TF and species-specific models' limitations. During a series of extensive benchmarking studies over multiple experimentally validated data, it not only outperformed the existing software by >30% lead but also delivered consistently >90% accuracy even for those species and TF families that were never encountered during the model-building process. PTFSpot makes it possible now to accurately annotate TFBRs across any plant genome even in the total lack of any TF information, completely free from the bottlenecks of species and TF-specific models.
Collapse
Affiliation(s)
- Sagar Gupta
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Veerbhan Kesarwani
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Umesh Bhati
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Jyoti
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| | - Ravi Shankar
- Studio of Computational Biology & Bioinformatics, The Himalayan Centre for High-throughput Computational Biology, (HiCHiCoB, A BIC supported by DBT, India), Biotechnology Division, CSIR-Institute of Himalayan Bioresource Technology (CSIR-IHBT), Palampur, Himachal Pradesh 176061, India
- Academy of Scientific and Innovative Research (AcSIR), Ghaziabad, Uttar Pradesh 201002, India
| |
Collapse
|
6
|
Horlacher M, Cantini G, Hesse J, Schinke P, Goedert N, Londhe S, Moyon L, Marsico A. A systematic benchmark of machine learning methods for protein-RNA interaction prediction. Brief Bioinform 2023; 24:bbad307. [PMID: 37635383 PMCID: PMC10516373 DOI: 10.1093/bib/bbad307] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 06/15/2023] [Accepted: 07/18/2023] [Indexed: 08/29/2023] Open
Abstract
RNA-binding proteins (RBPs) are central actors of RNA post-transcriptional regulation. Experiments to profile-binding sites of RBPs in vivo are limited to transcripts expressed in the experimental cell type, creating the need for computational methods to infer missing binding information. While numerous machine-learning based methods have been developed for this task, their use of heterogeneous training and evaluation datasets across different sets of RBPs and CLIP-seq protocols makes a direct comparison of their performance difficult. Here, we compile a set of 37 machine learning (primarily deep learning) methods for in vivo RBP-RNA interaction prediction and systematically benchmark a subset of 11 representative methods across hundreds of CLIP-seq datasets and RBPs. Using homogenized sample pre-processing and two negative-class sample generation strategies, we evaluate methods in terms of predictive performance and assess the impact of neural network architectures and input modalities on model performance. We believe that this study will not only enable researchers to choose the optimal prediction method for their tasks at hand, but also aid method developers in developing novel, high-performing methods by introducing a standardized framework for their evaluation.
Collapse
Affiliation(s)
- Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Germany
- School of Computation, Information and Technology, Technical University Munich (TUM), Germany
| | - Giulia Cantini
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Julian Hesse
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Patrick Schinke
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Nicolas Goedert
- Computational Health Center, Helmholtz Center Munich, Germany
| | | | - Lambert Moyon
- Computational Health Center, Helmholtz Center Munich, Germany
| | | |
Collapse
|