1
|
Khan S, Uddin I, Noor S, AlQahtani SA, Ahmad N. N6-methyladenine identification using deep learning and discriminative feature integration. BMC Med Genomics 2025; 18:58. [PMID: 40158097 PMCID: PMC11955129 DOI: 10.1186/s12920-025-02131-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2024] [Accepted: 03/20/2025] [Indexed: 04/01/2025] Open
Abstract
N6-methyladenine (6 mA) is a pivotal DNA modification that plays a crucial role in epigenetic regulation, gene expression, and various biological processes. With advancements in sequencing technologies and computational biology, there is an increasing focus on developing accurate methods for 6 mA site identification to enhance early detection and understand its biological significance. Despite the rapid progress of machine learning in bioinformatics, accurately detecting 6 mA sites remains a challenge due to the limited generalizability and efficiency of existing approaches. In this study, we present Deep-N6mA, a novel Deep Neural Network (DNN) model incorporating optimal hybrid features for precise 6 mA site identification. The proposed framework captures complex patterns from DNA sequences through a comprehensive feature extraction process, leveraging k-mer, Dinucleotide-based Cross Covariance (DCC), Trinucleotide-based Auto Covariance (TAC), Pseudo Single Nucleotide Composition (PseSNC), Pseudo Dinucleotide Composition (PseDNC), and Pseudo Trinucleotide Composition (PseTNC). To optimize computational efficiency and eliminate irrelevant or noisy features, an unsupervised Principal Component Analysis (PCA) algorithm is employed, ensuring the selection of the most informative features. A multilayer DNN serves as the classification algorithm to identify N6-methyladenine sites accurately. The robustness and generalizability of Deep-N6mA were rigorously validated using fivefold cross-validation on two benchmark datasets. Experimental results reveal that Deep-N6mA achieves an average accuracy of 97.70% on the F. vesca dataset and 95.75% on the R. chinensis dataset, outperforming existing methods by 4.12% and 4.55%, respectively. These findings underscore the effectiveness of Deep-N6mA as a reliable tool for early 6 mA site detection, contributing to epigenetic research and advancing the field of computational biology.
Collapse
Affiliation(s)
- Salman Khan
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Islam Uddin
- Department of Computer Science, Abdul Wali Khan University, Mardan, Pakistan
| | - Sumaiya Noor
- Business and Management Sciences Department, Purdue University, West Lafayette, IN, USA
| | - Salman A AlQahtani
- Department of Computer Engineering, New Emerging Technologies and 5g Network and Beyond Research Chair, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Nijad Ahmad
- Department of Computer Science, Khurasan University, Jalalabad, Afghanistan.
| |
Collapse
|
2
|
Liu L, Tan Z, Wei Y, Sun Q. A multi-perspective deep learning framework for enhancer characterization and identification. Comput Biol Chem 2025; 114:108284. [PMID: 39577030 DOI: 10.1016/j.compbiolchem.2024.108284] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 11/02/2024] [Accepted: 11/13/2024] [Indexed: 11/24/2024]
Abstract
Enhancers are vital elements in the genome that boost the transcriptional activity of neighboring genes and are essential in regulating cell-specific gene expression. Therefore, accurately identifying and characterizing enhancers is essential for comprehending gene regulatory networks and the development of related diseases. This study introduces MPDL-Enhancer, a novel multi-perspective deep learning framework aimed at enhancer characterization and identification. In this study, enhancer sequences are encoded using the dna2vec model along with features derived from the structural properties of DNA sequences. Subsequently, these representations are processed through a novel dual-scale deep neural network designed to discern subtle correlations and extended interactions embedded within the semantic content of DNA. The predictive phase of our methodology employs a Support Vector Machine classifier to render the final classification. To rigorously assess the efficacy of our approach, a comprehensive evaluation was executed utilizing an independent test dataset, thereby substantiating the robustness and accuracy of our model. Our methodology demonstrated superior performance over existing computational techniques, with an accuracy (ACC) of 81.00 %, a sensitivity (SN) of 79.00 %, and specificity (SP) of 83.00 %. The innovative dual-scale deep neural network and the unique feature representation strategy contributed to this performance improvement. MPDL-Enhancer has effectively characterized enhancer sequences and achieved excellent predictive performance. Building upon this foundation, we conducted an interpretability analysis of the model, which can assist researchers in identifying key features and patterns that affect the functionality of enhancers, thereby promoting a deeper understanding of gene regulatory networks.
Collapse
Affiliation(s)
- Liwei Liu
- College of Science, Dalian Jiaotong University, Dalian 116028, China.
| | - Zhebin Tan
- College of Software, Dalian Jiaotong University, Dalian 116028, China
| | - Yuxiao Wei
- College of Software, Dalian Jiaotong University, Dalian 116028, China
| | - Qianhui Sun
- College of Software, Dalian Jiaotong University, Dalian 116028, China
| |
Collapse
|
3
|
Huang G, Xue T, Chen W, Huang L, Dai Q, Jiang J. SVM-LncRNAPro: An SVM-Based Method for Predicting Long Noncoding RNA Promoters. IET Syst Biol 2025; 19:e70013. [PMID: 40188358 PMCID: PMC11972283 DOI: 10.1049/syb2.70013] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2024] [Revised: 01/05/2025] [Accepted: 02/12/2025] [Indexed: 04/08/2025] Open
Abstract
Long non-coding RNAs (lncRNAs) are closely associated with the regulation of gene expression, whose promoters play a crucial role in comprehensively understanding lncRNA regulatory mechanisms, functions and their roles in diseases. Due to limitations of the current techniques, accurately identifying lncRNA promoters remains a challenge. To address this challenge, we propose a support vector machine (SVM)-based method for predicting lncRNA promoters, called SVM-LncRNAPro. This method uses position-specific trinucleotide propensity based on single-strand (PSTNPss) to encode the DNA sequences and employs an SVM as the learning algorithm. The SVM-LncRNAPro achieves state-of-the-art performance with reduced complexity. Additionally, experiments demonstrate that this method exhibits a strong generalisation ability. For the convenience of academic research, we have made the source code of SVM-LncRNAPro publicly available. Researchers can download the code and perform the prediction of the lncRNA promoter via the following link: https://github.com/TG0F7/Prom/tree/master.
Collapse
Affiliation(s)
- Guohua Huang
- College of Information Science and EngineeringShaoyang UniversityShaoyangChina
- Hunan Provincial Key Laboratory of Finance & Economics Big Data Science and TechnologyHunan University of Finance and EconomicsChangshaChina
| | - Taigan Xue
- College of Information Science and EngineeringShaoyang UniversityShaoyangChina
| | - Weihong Chen
- Hunan Provincial Key Laboratory of Finance & Economics Big Data Science and TechnologyHunan University of Finance and EconomicsChangshaChina
| | - Liangliang Huang
- Hunan Provincial Key Laboratory of Finance & Economics Big Data Science and TechnologyHunan University of Finance and EconomicsChangshaChina
| | - Qi Dai
- College of Life Science and MedicineZhejiang Sci‐Tech UniversityHangzhouChina
| | - JinYun Jiang
- College of Information Science and EngineeringShaoyang UniversityShaoyangChina
| |
Collapse
|
4
|
Noor S, Naseem A, Awan HH, Aslam W, Khan S, AlQahtani SA, Ahmad N. Deep-m5U: a deep learning-based approach for RNA 5-methyluridine modification prediction using optimized feature integration. BMC Bioinformatics 2024; 25:360. [PMID: 39563239 DOI: 10.1186/s12859-024-05978-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2024] [Accepted: 11/06/2024] [Indexed: 11/21/2024] Open
Abstract
BACKGROUND RNA 5-methyluridine (m5U) modifications play a crucial role in biological processes, making their accurate identification a key focus in computational biology. This paper introduces Deep-m5U, a robust predictor designed to enhance the prediction of m5U modifications. The proposed method, named Deep-m5U, utilizes a hybrid pseudo-K-tuple nucleotide composition (PseKNC) for sequence formulation, a Shapley Additive exPlanations (SHAP) algorithm for discriminant feature selection, and a deep neural network (DNN) as the classifier. RESULTS The model was evaluated using two benchmark datasets, i.e., Full Transcript and Mature mRNA. Deep-m5U achieved overall accuracies of 91.47% and 95.86% for the Full Transcript and Mature mRNA datasets with 10-fold cross-validation, and for independent samples, the model attained 92.94% and 95.17% accuracy. CONCLUSION Compared to existing models, Deep-m5U showed approximately 5.23% and 3.73% higher accuracy on the training data and 3.95% and 3.26% higher accuracy on independent samples for the Full Transcript and Mature mRNA datasets, respectively. The reliability and effectiveness of Deep-m5U make it a valuable tool for scientists and a potential asset in pharmaceutical design and research.
Collapse
Affiliation(s)
- Sumaiya Noor
- Business and Management Sciences Department, Purdue University, West Lafayette, IN, USA
| | - Afshan Naseem
- Institute of Oceanography and Environment (INOS), Universiti Malaysia Terengganu, 21030, Kuala Nerus, Terengganu, Malaysia
| | - Hamid Hussain Awan
- Department of Computer Science, Muslim Youth University, Islamabad, Pakistan
| | - Wasiq Aslam
- Department of Computer Science, Muslim Youth University, Islamabad, Pakistan
| | - Salman Khan
- New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Salman A AlQahtani
- New Emerging Technologies and 5G Network and Beyond Research Chair, Department of Computer Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, Saudi Arabia
| | - Nijad Ahmad
- Department of Computer Science, Khurasan University, Jalalabad, Afghanistan.
| |
Collapse
|
5
|
Wang X, Yang L, Wang R. mRCat: A Novel CatBoost Predictor for the Binary Classification of mRNA Subcellular Localization by Fusing Large Language Model Representation and Sequence Features. Biomolecules 2024; 14:767. [PMID: 39062481 PMCID: PMC11274395 DOI: 10.3390/biom14070767] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2024] [Revised: 06/23/2024] [Accepted: 06/25/2024] [Indexed: 07/28/2024] Open
Abstract
The subcellular localization of messenger RNAs (mRNAs) is a pivotal aspect of biomolecules, tightly linked to gene regulation and protein synthesis, and offers innovative insights into disease diagnosis and drug development in the field of biomedicine. Several computational methods have been proposed to predict the subcellular localization of mRNAs within cells. However, there remains a deficiency in the accuracy of these predictions. In this study, we propose an mRCat predictor based on the gradient boosting tree algorithm specifically to predict whether mRNAs are localized in the nucleus or in the cytoplasm. This predictor firstly uses large language models to thoroughly explore hidden information within sequences and then integrates traditional sequence features to collectively characterize mRNA gene sequences. Finally, it employs CatBoost as the base classifier for predicting the subcellular localization of mRNAs. The experimental validation on an independent test set demonstrates that mRCat obtained accuracy of 0.761, F1 score of 0.710, MCC of 0.511, and AUROC of 0.751. The results indicate that our method has higher accuracy and robustness compared to other state-of-the-art methods. It is anticipated to offer deep insights for biomolecular research.
Collapse
Affiliation(s)
- Xiao Wang
- School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
- Henan Provincial Key Laboratory of Data Intelligence for Food Safety, Zhengzhou University of Light Industry, Zhengzhou 450002, China
| | - Lixiang Yang
- School of Computer Science and Technology, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
| | - Rong Wang
- School of Electronic Information, Zhengzhou University of Light Industry, Zhengzhou 450002, China;
| |
Collapse
|
6
|
Okay S. Fine-Tuning Gene Expression in Bacteria by Synthetic Promoters. Methods Mol Biol 2024; 2844:179-195. [PMID: 39068340 DOI: 10.1007/978-1-0716-4063-0_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Promoters are key genetic elements in the initiation and regulation of gene expression. A limited number of natural promoters has been described for the control of gene expression in synthetic biology applications. Therefore, synthetic promoters have been developed to fine-tune the transcription for the desired amount of gene product. Mostly, synthetic promoters are characterized using promoter libraries that are constructed via mutagenesis of promoter sequences. The strength of promoters in the library is determined according to the expression of a reporter gene such as gfp encoding green fluorescent protein. Gene expression can be controlled using inducers. The majority of the studies on gram-negative bacteria are conducted using the expression system of the model organism Escherichia coli while that of the model organism Bacillus subtilis is mostly used in the studies on gram-positive bacteria. Additionally, synthetic promoters for the cyanobacteria, which are phototrophic microorganisms, are evaluated, especially using the model cyanobacterium Synechocystis sp. PCC 6803. Moreover, a variety of algorithms based on machine learning methods were developed to characterize the features of promoter elements. Some of these in silico models were verified using in vitro or in vivo experiments. Identification of novel synthetic promoters with improved features compared to natural ones contributes much to the synthetic biology approaches in terms of fine-tuning gene expression.
Collapse
Affiliation(s)
- Sezer Okay
- Department of Vaccine Technology, Vaccine Institute, Hacettepe University, Ankara, Türkiye
| |
Collapse
|
7
|
Chen XG, Yang X, Li C, Lin X, Zhang W. Non-coding RNA identification with pseudo RNA sequences and feature representation learning. Comput Biol Med 2023; 165:107355. [PMID: 37639767 DOI: 10.1016/j.compbiomed.2023.107355] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2023] [Revised: 07/16/2023] [Accepted: 08/12/2023] [Indexed: 08/31/2023]
Abstract
Distinguishing non-coding RNAs (ncRNAs) from coding RNAs is very important in bioinformatics. Although many methods have been proposed for solving this task, it remains highly challenging to further improve the accuracy of ncRNA identification. In this paper, we propose a coding potential predictor using feature representation learning based on pseudo RNA sequences named CPPFLPS. In this method, we use the pseudo RNA sequences generated by simulating RNA sequence mutations as new samples for data augmentation, and six string operations simulating RNA sequence mutations are considered: base replacement, base insertion, base deletion, subsequence reversion, subsequence repetition and subsequence deletion. In the feature representation learning framework, different types of pseudo RNA sequences are added to the training set to form new training sets that can be used to train baseline classifiers, thus obtaining baseline models. The resulting labels of these baseline models are used as feature vectors to represent RNA sequences, and the resulting feature vectors acquired after feature selection are used to train a predictive model for distinguishing ncRNAs from coding RNAs. Our method achieves better performance compared with that of existing state-of-the-art methods. The implementation of the proposed method is available at https://github.com/chenxgscuec/CPPFLPS.
Collapse
Affiliation(s)
- Xian-Gan Chen
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xiaofei Yang
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Chenhong Li
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Xianguang Lin
- School of Biomedical Engineering, South-Central Minzu University, Wuhan, 430074, China; Hubei Key Laboratory of Medical Information Analysis and Tumor Diagnosis & Treatment, South-Central Minzu University, Wuhan, 430074, China; Key Laboratory of Cognitive Science(South-Central Minzu University), State Ethnic Affairs Commission, Wuhan, 430074, China.
| | - Wen Zhang
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China.
| |
Collapse
|
8
|
Fan Y, Xiong H, Sun G. DeepASDPred: a CNN-LSTM-based deep learning method for Autism spectrum disorders risk RNA identification. BMC Bioinformatics 2023; 24:261. [PMID: 37349705 DOI: 10.1186/s12859-023-05378-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Accepted: 06/06/2023] [Indexed: 06/24/2023] Open
Abstract
BACKGROUND Autism spectrum disorders (ASD) are a group of neurodevelopmental disorders characterized by difficulty communicating with society and others, behavioral difficulties, and a brain that processes information differently than normal. Genetics has a strong impact on ASD associated with early onset and distinctive signs. Currently, all known ASD risk genes are able to encode proteins, and some de novo mutations disrupting protein-coding genes have been demonstrated to cause ASD. Next-generation sequencing technology enables high-throughput identification of ASD risk RNAs. However, these efforts are time-consuming and expensive, so an efficient computational model for ASD risk gene prediction is necessary. RESULTS In this study, we propose DeepASDPerd, a predictor for ASD risk RNA based on deep learning. Firstly, we use K-mer to feature encode the RNA transcript sequences, and then fuse them with corresponding gene expression values to construct a feature matrix. After combining chi-square test and logistic regression to select the best feature subset, we input them into a binary classification prediction model constructed by convolutional neural network and long short-term memory for training and classification. The results of the tenfold cross-validation proved our method outperformed the state-of-the-art methods. Dataset and source code are available at https://github.com/Onebear-X/DeepASDPred is freely available. CONCLUSIONS Our experimental results show that DeepASDPred has outstanding performance in identifying ASD risk RNA genes.
Collapse
Affiliation(s)
- Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Hui Xiong
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| | - Guicong Sun
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| |
Collapse
|
9
|
Li H, Liu B. BioSeq-Diabolo: Biological sequence similarity analysis using Diabolo. PLoS Comput Biol 2023; 19:e1011214. [PMID: 37339155 DOI: 10.1371/journal.pcbi.1011214] [Citation(s) in RCA: 57] [Impact Index Per Article: 28.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2022] [Accepted: 05/24/2023] [Indexed: 06/22/2023] Open
Abstract
As the key for biological sequence structure and function prediction, disease diagnosis and treatment, biological sequence similarity analysis has attracted more and more attentions. However, the exiting computational methods failed to accurately analyse the biological sequence similarities because of the various data types (DNA, RNA, protein, disease, etc) and their low sequence similarities (remote homology). Therefore, new concepts and techniques are desired to solve this challenging problem. Biological sequences (DNA, RNA and protein sequences) can be considered as the sentences of "the book of life", and their similarities can be considered as the biological language semantics (BLS). In this study, we are seeking the semantics analysis techniques derived from the natural language processing (NLP) to comprehensively and accurately analyse the biological sequence similarities. 27 semantics analysis methods derived from NLP were introduced to analyse biological sequence similarities, bringing new concepts and techniques to biological sequence similarity analysis. Experimental results show that these semantics analysis methods are able to facilitate the development of protein remote homology detection, circRNA-disease associations identification and protein function annotation, achieving better performance than the other state-of-the-art predictors in the related fields. Based on these semantics analysis methods, a platform called BioSeq-Diabolo has been constructed, which is named after a popular traditional sport in China. The users only need to input the embeddings of the biological sequence data. BioSeq-Diabolo will intelligently identify the task, and then accurately analyse the biological sequence similarities based on biological language semantics. BioSeq-Diabolo will integrate different biological sequence similarities in a supervised manner by using Learning to Rank (LTR), and the performance of the constructed methods will be evaluated and analysed so as to recommend the best methods for the users. The web server and stand-alone package of BioSeq-Diabolo can be accessed at http://bliulab.net/BioSeq-Diabolo/server/.
Collapse
Affiliation(s)
- Hongliang Li
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
| | - Bin Liu
- School of Computer Science and Technology, Beijing Institute of Technology, Beijing, China
- Advanced Research Institute of Multidisciplinary Science, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
10
|
Pradhan UK, Meher PK, Naha S, Rao AR, Kumar U, Pal S, Gupta A. ASmiR: a machine learning framework for prediction of abiotic stress-specific miRNAs in plants. Funct Integr Genomics 2023; 23:92. [PMID: 36939943 DOI: 10.1007/s10142-023-01014-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Revised: 01/18/2023] [Accepted: 03/06/2023] [Indexed: 03/21/2023]
Abstract
Abiotic stresses have become a major challenge in recent years due to their pervasive nature and shocking impacts on plant growth, development, and quality. MicroRNAs (miRNAs) play a significant role in plant response to different abiotic stresses. Thus, identification of specific abiotic stress-responsive miRNAs holds immense importance in crop breeding programmes to develop cultivars resistant to abiotic stresses. In this study, we developed a machine learning-based computational model for prediction of miRNAs associated with four specific abiotic stresses such as cold, drought, heat and salt. The pseudo K-tuple nucleotide compositional features of Kmer size 1 to 5 were used to represent miRNAs in numeric form. Feature selection strategy was employed to select important features. With the selected feature sets, support vector machine (SVM) achieved the highest cross-validation accuracy in all four abiotic stress conditions. The highest cross-validated prediction accuracies in terms of area under precision-recall curve were found to be 90.15, 90.09, 87.71, and 89.25% for cold, drought, heat and salt respectively. Overall prediction accuracies for the independent dataset were respectively observed 84.57, 80.62, 80.38 and 82.78%, for the abiotic stresses. The SVM was also seen to outperform different deep learning models for prediction of abiotic stress-responsive miRNAs. To implement our method with ease, an online prediction server "ASmiR" has been established at https://iasri-sg.icar.gov.in/asmir/ . The proposed computational model and the developed prediction tool are believed to supplement the existing effort for identification of specific abiotic stress-responsive miRNAs in plants.
Collapse
Affiliation(s)
- Upendra Kumar Pradhan
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India.
| | - Sanchita Naha
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | | | - Upendra Kumar
- Department of Molecular Biology, Biotechnology and Bioinformatics, College of Basic Sciences and Humanities, CCS Haryana Agricultural University, Hisar, 125004, India
| | - Soumen Pal
- Division of Computer Applications, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| | - Ajit Gupta
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, PUSA, New Delhi, 110012, India
| |
Collapse
|
11
|
Tang X, Zheng P, Liu Y, Yao Y, Huang G. LangMoDHS: A deep learning language model for predicting DNase I hypersensitive sites in mouse genome. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:1037-1057. [PMID: 36650801 DOI: 10.3934/mbe.2023048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
DNase I hypersensitive sites (DHSs) are a specific genomic region, which is critical to detect or understand cis-regulatory elements. Although there are many methods developed to detect DHSs, there is a big gap in practice. We presented a deep learning-based language model for predicting DHSs, named LangMoDHS. The LangMoDHS mainly comprised the convolutional neural network (CNN), the bi-directional long short-term memory (Bi-LSTM) and the feed-forward attention. The CNN and the Bi-LSTM were stacked in a parallel manner, which was helpful to accumulate multiple-view representations from primary DNA sequences. We conducted 5-fold cross-validations and independent tests over 14 tissues and 4 developmental stages. The empirical experiments showed that the LangMoDHS is competitive with or slightly better than the iDHS-Deep, which is the latest method for predicting DHSs. The empirical experiments also implied substantial contribution of the CNN, Bi-LSTM, and attention to DHSs prediction. We implemented the LangMoDHS as a user-friendly web server which is accessible at http:/www.biolscience.cn/LangMoDHS/. We used indices related to information entropy to explore the sequence motif of DHSs. The analysis provided a certain insight into the DHSs.
Collapse
Affiliation(s)
- Xingyu Tang
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China
| | - Peijie Zheng
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China
| | - Yuewu Liu
- College of Information and Intelligence, Hunan Agricultural University, Changsha 410128, China
| | - Yuhua Yao
- School of Mathematics and Statistics, Hainan Normal University, Haikou 571158, China
| | - Guohua Huang
- School of Electrical Engineering, Shaoyang University, Shaoyang 422000, China
| |
Collapse
|
12
|
Ding Y, He W, Tang J, Zou Q, Guo F. Laplacian Regularized Sparse Representation Based Classifier for Identifying DNA N4-Methylcytosine Sites via L 2,1/2-Matrix Norm. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:500-511. [PMID: 34882559 DOI: 10.1109/tcbb.2021.3133309] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
N4-methylcytosine (4mC) is one of important epigenetic modifications in DNA sequences. Detecting 4mC sites is time-consuming. The computational method based on machine learning has provided effective help for identifying 4mC. To further improve the performance of prediction, we propose a Laplacian Regularized Sparse Representation based Classifier with L2,1/2-matrix norm (LapRSRC). We also utilize kernel trick to derive the kernel LapRSRC for nonlinear modeling. Matrix factorization technology is employed to solve the sparse representation coefficients of all test samples in the training set. And an efficient iterative algorithm is proposed to solve the objective function. We implement our model on six benchmark datasets of 4mC and eight UCI datasets to evaluate performance. The results show that the performance of our method is better or comparable.
Collapse
|
13
|
Shomali A, Vafaei Sadi MS, Bakhtiarizadeh MR, Aliniaeifard S, Trewavas A, Calvo P. Identification of intelligence-related proteins through a robust two-layer predictor. Commun Integr Biol 2022; 15:253-264. [DOI: 10.1080/19420889.2022.2143101] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Affiliation(s)
- Aida Shomali
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
| | | | | | - Sasan Aliniaeifard
- Department of Horticulture, College of Aburaihan, University of Tehran, Tehran, Iran
| | - Anthony Trewavas
- School of Biological Sciences, Institute of Molecular Plant Science, University of Edinburgh, UK
| | - Paco Calvo
- Minimal Intelligence Lab, University of Murcia, Spain
| |
Collapse
|
14
|
Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework. PLoS Comput Biol 2022; 18:e1010779. [PMID: 36520922 PMCID: PMC9836277 DOI: 10.1371/journal.pcbi.1010779] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 01/12/2023] [Accepted: 11/29/2022] [Indexed: 12/23/2022] Open
Abstract
Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene's transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning-based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.
Collapse
|
15
|
Suleman MT, Alkhalifah T, Alturise F, Khan YD. DHU-Pred: accurate prediction of dihydrouridine sites using position and composition variant features on diverse classifiers. PeerJ 2022; 10:e14104. [PMID: 36320563 PMCID: PMC9618264 DOI: 10.7717/peerj.14104] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Accepted: 09/01/2022] [Indexed: 01/21/2023] Open
Abstract
Background Dihydrouridine (D) is a modified transfer RNA post-transcriptional modification (PTM) that occurs abundantly in bacteria, eukaryotes, and archaea. The D modification assists in the stability and conformational flexibility of tRNA. The D modification is also responsible for pulmonary carcinogenesis in humans. Objective For the detection of D sites, mass spectrometry and site-directed mutagenesis have been developed. However, both are labor-intensive and time-consuming methods. The availability of sequence data has provided the opportunity to build computational models for enhancing the identification of D sites. Based on the sequence data, the DHU-Pred model was proposed in this study to find possible D sites. Methodology The model was built by employing comprehensive machine learning and feature extraction approaches. It was then validated using in-demand evaluation metrics and rigorous experimentation and testing approaches. Results The DHU-Pred revealed an accuracy score of 96.9%, which was considerably higher compared to the existing D site predictors. Availability and Implementation A user-friendly web server for the proposed model was also developed and is freely available for the researchers.
Collapse
Affiliation(s)
- Muhammad Taseer Suleman
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass Qassim University, Ar Rass, Qassim, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management & Technology, Lahore, Pakistan
| |
Collapse
|
16
|
Butt AH, Alkhalifah T, Alturise F, Khan YD. A machine learning technique for identifying DNA enhancer regions utilizing CIS-regulatory element patterns. Sci Rep 2022; 12:15183. [PMID: 36071071 PMCID: PMC9452539 DOI: 10.1038/s41598-022-19099-3] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 08/24/2022] [Indexed: 11/26/2022] Open
Abstract
Enhancers regulate gene expression, by playing a crucial role in the synthesis of RNAs and proteins. They do not directly encode proteins or RNA molecules. In order to control gene expression, it is important to predict enhancers and their potency. Given their distance from the target gene, lack of common motifs, and tissue/cell specificity, enhancer regions are thought to be difficult to predict in DNA sequences. Recently, a number of bioinformatics tools were created to distinguish enhancers from other regulatory components and to pinpoint their advantages. However, because the quality of its prediction method needs to be improved, its practical application value must also be improved. Based on nucleotide composition and statistical moment-based features, the current study suggests a novel method for identifying enhancers and non-enhancers and evaluating their strength. The proposed study outperformed state-of-the-art techniques using fivefold and tenfold cross-validation in terms of accuracy. The accuracy from the current study results in 86.5% and 72.3% in enhancer site and its strength prediction respectively. The results of the suggested methodology point to the potential for more efficient and successful outcomes when statistical moment-based features are used. The current study's source code is available to the research community at https://github.com/csbioinfopk/enpred.
Collapse
Affiliation(s)
- Ahmad Hassan Butt
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| | - Tamim Alkhalifah
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia.
| | - Fahad Alturise
- Department of Computer, College of Science and Arts in Ar Rass, Qassim University, Ar Rass, Saudi Arabia
| | - Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, Lahore, Pakistan
| |
Collapse
|
17
|
Wang M, Li F, Wu H, Liu Q, Li S. PredPromoter-MF(2L): A Novel Approach of Promoter Prediction Based on Multi-source Feature Fusion and Deep Forest. Interdiscip Sci 2022; 14:697-711. [PMID: 35488998 DOI: 10.1007/s12539-022-00520-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2021] [Revised: 04/05/2022] [Accepted: 04/05/2022] [Indexed: 12/12/2022]
Abstract
Promoters short DNA sequences play vital roles in initiating gene transcription. However, it remains a challenge to identify promoters using conventional experiment techniques in a high-throughput manner. To this end, several computational predictors based on machine learning models have been developed, while their performance is unsatisfactory. In this study, we proposed a novel two-layer predictor, called PredPromoter-MF(2L), based on multi-source feature fusion and ensemble learning. PredPromoter-MF(2L) was developed based on various deep features learned by a pre-trained deep learning network model and sequence-derived features. Feature selection based on XGBoost was applied to reduce fused features dimensions, and a cascade deep forest model was trained on the selected feature subset for promoter prediction. The results both fivefold cross-validation and independent test demonstrated that PredPromoter-MF(2L) outperformed state-of-the-art methods.
Collapse
Affiliation(s)
- Miao Wang
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China
| | - Fuyi Li
- Department of Microbiology and Immunology, The Peter Doherty Institute for Infection and Immunity, The University of Melbourne, 792 Elizabeth Street, Melbourne, VIC, 3000, Australia
| | - Hao Wu
- School of Software, Shandong University, Jinan, 250100, Shandong, China
| | - Quanzhong Liu
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.
| | - Shuqin Li
- College of Information Engineering, Northwest A&F University, Yangling, 712100, Shanxi, China.
| |
Collapse
|
18
|
Zeng L, Liu Y, Yu ZG, Liu Y. iEnhancer-DLRA: identification of enhancers and their strengths by a self-attention fusion strategy for local and global features. Brief Funct Genomics 2022; 21:399-407. [PMID: 35942693 DOI: 10.1093/bfgp/elac023] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Revised: 06/30/2022] [Accepted: 07/12/2022] [Indexed: 11/14/2022] Open
Abstract
Identification and classification of enhancers are highly significant because they play crucial roles in controlling gene transcription. Recently, several deep learning-based methods for identifying enhancers and their strengths have been developed. However, existing methods are usually limited because they use only local or only global features. The combination of local and global features is critical to further improve the prediction performance. In this work, we propose a novel deep learning-based method, called iEnhancer-DLRA, to identify enhancers and their strengths. iEnhancer-DLRA extracts local and multi-scale global features of sequences by using a residual convolutional network and two bidirectional long short-term memory networks. Then, a self-attention fusion strategy is proposed to deeply integrate these local and global features. The experimental results on the independent test dataset indicate that iEnhancer-DLRA performs better than nine existing state-of-the-art methods in both identification and classification of enhancers in almost all metrics. iEnhancer-DLRA achieves 13.8% (for identifying enhancers) and 12.6% (for classifying strengths) improvement in accuracy compared with the best existing state-of-the-art method. This is the first time that the accuracy of an enhancer identifier exceeds 0.9 and the accuracy of the enhancer classifier exceeds 0.8 on the independent test set. Moreover, iEnhancer-DLRA achieves superior predictive performance on the rice dataset compared with the state-of-the-art method RiceENN.
Collapse
Affiliation(s)
- Li Zeng
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, 411105, Xiangtan, China
| | - Yang Liu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, 411105, Xiangtan, China
| | - Zu-Guo Yu
- Key Laboratory of Intelligent Computing and Information Processing of Ministry of Education, Xiangtan University, 411105, Xiangtan, China
| | - Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, 2 Lushan S Rd, Yuelu District, 410086, Changsha, China
| |
Collapse
|
19
|
Hasan MM, Tsukiyama S, Cho JY, Kurata H, Alam MA, Liu X, Manavalan B, Deng HW. Deepm5C: A deep-learning-based hybrid framework for identifying human RNA N5-methylcytosine sites using a stacking strategy. Mol Ther 2022; 30:2856-2867. [PMID: 35526094 PMCID: PMC9372321 DOI: 10.1016/j.ymthe.2022.05.001] [Citation(s) in RCA: 54] [Impact Index Per Article: 18.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2021] [Revised: 04/25/2022] [Accepted: 05/03/2022] [Indexed: 11/30/2022] Open
Abstract
As one of the most prevalent post-transcriptional epigenetic modifications, N5-methylcytosine (m5C) plays an essential role in various cellular processes and disease pathogenesis. Therefore, it is important accurately identify m5C modifications in order to gain a deeper understanding of cellular processes and other possible functional mechanisms. Although a few computational methods have been proposed, their respective models have been developed using small training datasets. Hence, their practical application is quite limited in genome-wide detection. To overcome the existing limitations, we propose Deepm5C, a bioinformatics method for identifying RNA m5C sites throughout the human genome. To develop Deepm5C, we constructed a novel benchmarking dataset and investigated a mixture of three conventional feature-encoding algorithms and a feature derived from word-embedding approaches. Afterward, four variants of deep-learning classifiers and four commonly used conventional classifiers were employed and trained with the four encodings, ultimately obtaining 32 baseline models. A stacking strategy is effectively utilized by integrating the predicted output of the optimal baseline models and trained with a one-dimensional (1D) convolutional neural network. As a result, the Deepm5C predictor achieved excellent performance during cross-validation with a Matthews correlation coefficient and an accuracy of 0.697 and 0.855, respectively. The corresponding metrics during the independent test were 0.691 and 0.852, respectively. Overall, Deepm5C achieved a more accurate and stable performance than the baseline models and significantly outperformed the existing predictors, demonstrating the effectiveness of our proposed hybrid framework. Furthermore, Deepm5C is expected to assist community-wide efforts in identifying putative m5Cs and to formulate the novel testable biological hypothesis.
Collapse
Affiliation(s)
- Md Mehedi Hasan
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA.
| | - Sho Tsukiyama
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Jae Youl Cho
- Molecular Immunology Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics, Kyushu Institute of Technology, 680-4 Kawazu, Iizuka, Fukuoka 820-8502, Japan
| | - Md Ashad Alam
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Xiaowen Liu
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Balachandran Manavalan
- Computational Biology and Bioinformatics Laboratory, Department of Integrative Biotechnology, College of Biotechnology and Bioengineering, Sungkyunkwan University, Suwon 16419, Gyeonggi-do, Korea.
| | - Hong-Wen Deng
- Tulane Center for Biomedical Informatics and Genomics, Division of Biomedical Informatics and Genomics, John W. Deming Department of Medicine, School of Medicine, Tulane University, New Orleans, LA 70112, USA.
| |
Collapse
|
20
|
Fan Y, Peng B. StackEPI: identification of cell line-specific enhancer-promoter interactions based on stacking ensemble learning. BMC Bioinformatics 2022; 23:272. [PMID: 35820811 PMCID: PMC9277947 DOI: 10.1186/s12859-022-04821-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 07/01/2022] [Indexed: 11/10/2022] Open
Abstract
Background Understanding the regulatory role of enhancer–promoter interactions (EPIs) on specific gene expression in cells contributes to the understanding of gene regulation, cell differentiation, etc., and its identification has been a challenging task. On the one hand, using traditional wet experimental methods to identify EPIs often means a lot of human labor and time costs. On the other hand, although the currently proposed computational methods have good recognition effects, they generally require a long training time. Results In this study, we studied the EPIs of six human cell lines and designed a cell line-specific EPIs prediction method based on a stacking ensemble learning strategy, which has better prediction performance and faster training speed, called StackEPI. Specifically, by combining different encoding schemes and machine learning methods, our prediction method can extract the cell line-specific effective information of enhancer and promoter gene sequences comprehensively and in many directions, and make accurate recognition of cell line-specific EPIs. Ultimately, the source code to implement StackEPI and experimental data involved in the experiment are available at https://github.com/20032303092/StackEPI.git. Conclusions The comparison results show that our model can deliver better performance on the problem of identifying cell line-specific EPIs and outperform other state-of-the-art models. In addition, our model also has a more efficient computation speed. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04821-9.
Collapse
Affiliation(s)
- Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China.
| | - Binchao Peng
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, 541004, China
| |
Collapse
|
21
|
CNNLSTMac4CPred: A Hybrid Model for N4-Acetylcytidine Prediction. Interdiscip Sci 2022; 14:439-451. [PMID: 35106702 DOI: 10.1007/s12539-021-00500-0] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 12/04/2021] [Accepted: 12/13/2021] [Indexed: 12/23/2022]
Abstract
N4-Acetylcytidine (ac4C) is a highly conserved post-transcriptional and an extensively existing RNA modification, playing versatile roles in the cellular processes. Due to the limitation of techniques and knowledge, large-scale identification of ac4C is still a challenging task. RNA sequences are like sentences containing semantics in the natural language. Inspired by the semantics of language, we proposed a hybrid model for ac4C prediction. The model used long short-term memory and convolution neural network to extract the semantic features hidden in the sequences. The semantic and the two traditional features (k-nucleotide frequencies and pseudo tri-tuple nucleotide composition) were combined to represent ac4C or non-ac4C sequences. The eXtreme Gradient Boosting was used as the learning algorithm. Five-fold cross-validation over the training set consisting of 1160 ac4C and 10,855 non-ac4C sequences obtained the area under the receiver operating characteristic curve (AUROC) of 0.9004, and the independent test over 469 ac4C and 4343 non-ac4C sequences reached an AUROC of 0.8825. The model obtained a sensitivity of 0.6474 in the five-fold cross-validation and 0.6290 in the independent test, outperforming two state-of-the-art methods. The performance of semantic features alone was better than those of k-nucleotide frequencies and pseudo tri-tuple nucleotide composition, implying that ac4C sequences are of semantics. The proposed hybrid model was implemented into a user-friendly web-server which is freely available to scientific communities: http://47.113.117.61/ac4c/ . The presented model and tool are beneficial to identify ac4C on large scale.
Collapse
|
22
|
Liu P, Chang H, Xu Q, Wang D, Tang Y, Hu X, Lin M, Liu Z. Peptide Aptamer PA3 Attenuates the Viability of Aeromonas veronii by Hindering of Small Protein B-Outer Membrane Protein A Signal Pathway. Front Microbiol 2022; 13:900234. [PMID: 35663889 PMCID: PMC9159911 DOI: 10.3389/fmicb.2022.900234] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2022] [Accepted: 04/12/2022] [Indexed: 11/15/2022] Open
Abstract
The small protein B (SmpB), previously acting as a ribosome rescue factor for translation quality control, is required for cell viability in bacteria. Here, our study reveals that SmpB possesses new function which regulates the expression of outer membrane protein A (ompA) gene as a transcription factor in Aeromonas veronii. The deletion of SmpB caused the lower transcription expression of ompA by Quantitative Real-Time PCR (qPCR). Electrophoretic mobility shift assay (EMSA) and DNase I Footprinting verified that the SmpB bound at the regions of −46 to −28 bp, −18 to +4 bp, +21 to +31 bp, and +48 to +59 bp of the predicted ompA promoter (PompA). The key sites C52AT was further identified to interact with SmpB when PompA was fused with enhanced green fluorescent protein (EGFP) and co-transformed with SmpB expression vector for the fluorescence detection, and the result was further confirmed in microscale thermophoresis (MST) assays. Besides, the amino acid sites G11S, F26I, and K152 in SmpB were the key sites for binding to PompA. In order to further develop peptide antimicrobial agents, the peptide aptamer PA3 was screened from the peptide aptamer (PA) library by bacterial two-hybrid method. The drug sensitivity test showed that PA3 effectively inhibited the growth of A. veronii. In summary, these results demonstrated that OmpA was a good drug target for A. veronii, which was regulated by the SmpB protein and the selected peptide aptamer PA3 interacted with OmpA protein to disable SmpB-OmpA signal pathway and inhibited A. veronii, suggesting that it could be used as an antimicrobial agent for the prevention and treatment of pathogens.
Collapse
Affiliation(s)
- Peng Liu
- School of Life Sciences, Hainan University, Haikou, China
- Center for Medical Innovation, School of Basic Medical Science, Guangxi University of Chinese Medicine, Nanning, China
| | - Huimin Chang
- School of Life Sciences, Hainan University, Haikou, China
| | - Qi Xu
- School of Life Sciences, Hainan University, Haikou, China
| | - Dan Wang
- School of Life Sciences, Hainan University, Haikou, China
| | - Yanqiong Tang
- School of Life Sciences, Hainan University, Haikou, China
- One Health Institute, Hainan University, Haikou, China
| | - Xinwen Hu
- School of Life Sciences, Hainan University, Haikou, China
| | - Min Lin
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing, China
| | - Zhu Liu
- School of Life Sciences, Hainan University, Haikou, China
- One Health Institute, Hainan University, Haikou, China
- *Correspondence: Zhu Liu,
| |
Collapse
|
23
|
Nguyen TTD, Ho QT, Le NQK, Phan VD, Ou YY. Use Chou's 5-Steps Rule With Different Word Embedding Types to Boost Performance of Electron Transport Protein Prediction Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1235-1244. [PMID: 32750894 DOI: 10.1109/tcbb.2020.3010975] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Living organisms receive necessary energy substances directly from cellular respiration. The completion of electron storage and transportation requires the process of cellular respiration with the aid of electron transport chains. Therefore, the work of deciphering electron transport proteins is inevitably needed. The identification of these proteins with high performance has a prompt dependence on the choice of methods for feature extraction and machine learning algorithm. In this study, protein sequences served as natural language sentences comprising words. The nominated word embedding-based feature sets, hinged on the word embedding modulation and protein motif frequencies, were useful for feature choosing. Five word embedding types and a variety of conjoint features were examined for such feature selection. The support vector machine algorithm consequentially was employed to perform classification. The performance statistics within the 5-fold cross-validation including average accuracy, specificity, sensitivity, as well as MCC rates surpass 0.95. Such metrics in the independent test are 96.82, 97.16, 95.76 percent, and 0.9, respectively. Compared to state-of-the-art predictors, the proposed method can generate more preferable performance above all metrics indicating the effectiveness of the proposed method in determining electron transport proteins. Furthermore, this study reveals insights about the applicability of various word embeddings for understanding surveyed sequences.
Collapse
|
24
|
Guo Y, Zhou D, Cao J, Nie R, Ruan X, Liu Y. Gated residual neural networks with self-normalization for translation initiation site recognition. Knowl Based Syst 2022. [DOI: 10.1016/j.knosys.2021.107783] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
|
25
|
Zhang Z, Wang L. Using Chou's 5-steps rule to identify N 6-methyladenine sites by ensemble learning combined with multiple feature extraction methods. J Biomol Struct Dyn 2022; 40:796-806. [PMID: 32948102 DOI: 10.1080/07391102.2020.1821778] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
Abstract
N6-methyladenine (m6A), a type of modification mostly affecting the downstream biological functions and determining the levels of gene expression, is mediated by the methylation of adenine in nucleic acids. It is also a key factor for influencing biological processes and has attracted attention as a target for treating diseases. Here, an ensemble predictor named as TL-Methy, was developed to identify m6A sites across the genome. TL-Methy is a 2-level machine learning method developed by combining the support vector machine model and multiple features extraction methods, including nucleic acid composition, di-nucleotide composition, tri-nucleotide composition, position-specific trinucleotide propensity, Bi-profile Bayes, binary encoding, and accumulated nucleotide frequency. For Homo sapiens, TL-Methy method reached the accuracy of 91.68% on jackknife test and of 92.23% on 10-fold cross validation test; For Mus musculus, TL-Methy method achieved the accuracy of 93.66% on jackknife test and of 97.07% on 10-fold cross validation test; For Saccharomyces cerevisiae, TL-Methy method obtained the accuracy of 81.57% on jackknife test and of 82.54% on 10-fold cross validation test; For rice genome, TL-Methy method achieved the accuracy of 91.87% on jackknife test and of 93.04% on 10-fold cross validation test. The results via these two test approaches demonstrated the robustness and practicality of our TL-Methy model. The TL-Methy model may be as a potential method for m6A site identification.Communicated by Ramaswamy H. Sarma.
Collapse
Affiliation(s)
- Zhongwang Zhang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| | - Lidong Wang
- College of Science, Dalian Maritime University, Dalian, P.R. China
| |
Collapse
|
26
|
ASRmiRNA: Abiotic Stress-Responsive miRNA Prediction in Plants by Using Machine Learning Algorithms with Pseudo K-Tuple Nucleotide Compositional Features. Int J Mol Sci 2022; 23:ijms23031612. [PMID: 35163534 PMCID: PMC8835813 DOI: 10.3390/ijms23031612] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2021] [Revised: 01/23/2022] [Accepted: 01/26/2022] [Indexed: 02/04/2023] Open
Abstract
MicroRNAs (miRNAs) play a significant role in plant response to different abiotic stresses. Thus, identification of abiotic stress-responsive miRNAs holds immense importance in crop breeding programmes to develop cultivars resistant to abiotic stresses. In this study, we developed a machine learning-based computational method for prediction of miRNAs associated with abiotic stresses. Three types of datasets were used for prediction, i.e., miRNA, Pre-miRNA, and Pre-miRNA + miRNA. The pseudo K-tuple nucleotide compositional features were generated for each sequence to transform the sequence data into numeric feature vectors. Support vector machine (SVM) was employed for prediction. The area under receiver operating characteristics curve (auROC) of 70.21, 69.71, 77.94 and area under precision-recall curve (auPRC) of 69.96, 65.64, 77.32 percentages were obtained for miRNA, Pre-miRNA, and Pre-miRNA + miRNA datasets, respectively. Overall prediction accuracies for the independent test set were 62.33, 64.85, 69.21 percentages, respectively, for the three datasets. The SVM also achieved higher accuracy than other learning methods such as random forest, extreme gradient boosting, and adaptive boosting. To implement our method with ease, an online prediction server “ASRmiRNA” has been developed. The proposed approach is believed to supplement the existing effort for identification of abiotic stress-responsive miRNAs and Pre-miRNAs.
Collapse
|
27
|
Shahraki S, Samareh Delarami H, Poorsargol M, Sori Nezami Z. Structural and functional changes of catalase through interaction with Erlotinib hydrochloride. Use of Chou's 5-steps rule to study mechanisms. SPECTROCHIMICA ACTA. PART A, MOLECULAR AND BIOMOLECULAR SPECTROSCOPY 2021; 260:119940. [PMID: 34038867 DOI: 10.1016/j.saa.2021.119940] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 05/06/2021] [Accepted: 05/07/2021] [Indexed: 06/12/2023]
Abstract
Erlotinib hydrochloride (Erlo) is used in the treatment of non-small cell lung cancer, pancreatic cancer and other types of cancer. Interaction of small molecules with bio-macromolecules can lead to changes in the structure and function of them which is one of the possible side effects of the drugs. In this study, the interaction of Erlo with bovine liver catalase (BLC) using spectroscopic and computational methods is presented in detail. The enzymatic function of BLC decreased to 58.7% when the concentration of the Erlo was 0.5 × 10-7 M. Fluorescence results revealed that the combination of BLC with Erlo undergoes static quenching mechanism (Kb = 1.15 × 104 M-1 at 300 K). The interaction process was spontaneous, exothermic and enthalpy-driven and Van der Waals and hydrogen bonds forces played major roles in the this process. UV-Vis, CD, 3D, and synchronous fluorescence measurements indicated the changes in the microenvironment residues and α-helix contents of BLC in the presence of Erlo. Docking and molecular dynamics presented a stable binding configuration and their results were perfectly consistent with the spectroscopic results. Theoretical calculations and experimental analysis help to fully understand of drug interaction with important biological molecules such as enzymes.
Collapse
|
28
|
Lyu Y, Zhang Z, Li J, He W, Ding Y, Guo F. iEnhancer-KL: A Novel Two-Layer Predictor for Identifying Enhancers by Position Specific of Nucleotide Composition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2809-2815. [PMID: 33481715 DOI: 10.1109/tcbb.2021.3053608] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
An enhancer is a short region of DNA with the ability to recruit transcription factors and their complexes, increasing the likelihood of the transcription of a particular gene. Considering the importance of enhancers, enhancer identification is a prevailing problem in computational biology. In this paper, we propose a novel two-layer enhancer predictor called iEnhancer-KL, using computational biology algorithms to identify enhancers and then classify these enhancers into strong or weak types. Kullback-Leibler (KL) divergence is creatively taken into consideration to improve the feature extraction method PSTNPss. Then, LASSO is used to reduce the dimension of features and finally helps to get better prediction performance. Furthermore, the selected features are tested on several machine learning models, and the SVM algorithm achieves the best performance. The rigorous cross-validation indicates that our predictor is remarkably superior to the existing state-of-the-art methods with an Acc of 84.23 percent and the MCC of 0.6849 for identifying enhancers. Our code and results can be freely downloaded from https://github.com/Not-so-middle/iEnhancer-KL.git.
Collapse
|
29
|
Zheng Y, Wang H, Ding Y, Guo F. CEPZ: A Novel Predictor for Identification of DNase I Hypersensitive Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2768-2774. [PMID: 33481716 DOI: 10.1109/tcbb.2021.3053661] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
DNase I hypersensitive sites (DHSs) have proven to be tightly associated with cis-regulatory elements, commonly indicating specific function on the chromatin structure. Thus, identifying DHSs plays a fundamental role in decoding gene regulatory behavior. While traditional experimental methods turn to be time-consuming and expensive, computational techniques promise to be practical to discovering and analyzing regulatory factors. In this study, we applied an efficient model that considered composition information and physicochemical properties and effectively selected features with a boosting algorithm. CEPZ, our predictor, greatly improved a Matthews correlation coefficient and accuracy of 0.7740 and 0.9113 respectively, more competitive than any predictor before. This result suggests that it may become a useful tool for DHSs research in the human and other complex genomes. Our research was anchored on the properties of dinucleotides and we identified several dinucleotides with significant differences in the distribution of DHS and non-DHS samples, which are likely to have a special meaning in the chromatin structure. The datasets, feature sets and the relevant algorithm are available at https://github.com/YanZheng-16/CEPZ_DHS/.
Collapse
|
30
|
Wang X, Chen J, Ni H, Mustafa G, Yang Y, Wang Q, Fu H, Zhang L, Yang B. Use Chou's 5-steps rule to identify protein post-translational modification and its linkage to secondary metabolism during the floral development of Lonicera japonica Thunb. PLANT PHYSIOLOGY AND BIOCHEMISTRY : PPB 2021; 167:1035-1048. [PMID: 34600181 DOI: 10.1016/j.plaphy.2021.09.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/16/2021] [Revised: 08/01/2021] [Accepted: 09/07/2021] [Indexed: 06/13/2023]
Abstract
Lonicera japonica Thunb. is widely used in traditional medicine systems of East Asian and attracts a large amount of studies on the biosynthesis of its active components. Currently, there is little understanding regarding the regulatory mechanisms behind the accumulation of secondary metabolites during its developmental stages. In this study, published transcriptomic and proteomic data were mined to evaluate potential linkage between protein modification and secondary metabolism during the floral development. Stronger correlations were observed between differentially expressed genes (DEGs) and their corresponding differentially abundant proteins (DAPs) in the comparison of juvenile bud stage (JBS)/third green stage (TGS) vs. silver flowering stage (SFS). Seventy-five and 76 cor-rDEGs and cor-rDAPs (CDDs) showed opposite trends at both transcriptional and translational levels when comparing their levels at JBS and TGS relative to those at SFS. CDDs were mainly involved in elements belonging to the protein metabolism and the TCA cycle. Protein-protein interaction analysis indicated that the interacting proteins in the major cluster were primarily involved in TCA cycle and protein metabolism. In the simple phenylpropanoids biosynthetic pathway of SFS, both phospho-2-dehydro-3-deoxyheptonate aldolase (PDA) and glutamate/aspartate-prephenate aminotransferase (AAT) were decreased at the protein level, but increased at the gene level. A confirmatory experiment indicated that protein ubiquitination and succinylation were more prominent during the early floral developmental stages, in correlation with simple phenylpropanoids accumulation. Taken together, those data indicates that phenylpropanoids metabolism and floral development are putatively regulated through the ubiquitination and succinylation modifications of PDA, AAT, and TCA cycle proteins in L. japonica.
Collapse
Affiliation(s)
- Xueqin Wang
- College of Life Sciences and Medicine, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Jiaqi Chen
- College of Life Sciences and Medicine, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Haofu Ni
- College of Life Sciences and Medicine, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Ghazala Mustafa
- Department of Plant Sciences, Faculty of Biological Sciences, Quaid-i-Azam University, Islamabad, 45320, Pakistan
| | - Yuling Yang
- Wenshan Academy of Agricultural Sciences, Wenshan, 663000, China
| | - Qi Wang
- Key Laboratory of Xinjiang Phytomedicine Resource and Utilization, Ministry of Education, School of Pharmacy, Shihezi University, Shihezi, 832002, China
| | - Hongwei Fu
- College of Life Sciences and Medicine, Zhejiang Sci-Tech University, Hangzhou, 310018, China
| | - Lin Zhang
- College of Life Sciences and Medicine, Zhejiang Sci-Tech University, Hangzhou, 310018, China.
| | - Bingxian Yang
- College of Life Sciences and Medicine, Zhejiang Sci-Tech University, Hangzhou, 310018, China.
| |
Collapse
|
31
|
Alghamdi W, Alzahrani E, Ullah MZ, Khan YD. 4mC-RF: Improving the prediction of 4mC sites using composition and position relative features and statistical moment. Anal Biochem 2021; 633:114385. [PMID: 34571005 DOI: 10.1016/j.ab.2021.114385] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 09/09/2021] [Accepted: 09/13/2021] [Indexed: 01/28/2023]
Abstract
N4-methylcytosine (4 mC) is an important epigenetic modification that occurs enzymatically by the action of DNA methyltransferases. 4 mC sites exist in prokaryotes and eukaryotes while playing a vital role in regulating gene expression, DNA replication, and cell cycle. The efficient and accurate prediction of 4 mC sites has a significant role in the insight of 4 mC biological properties and functions. Therefore, a sequence-based predictor is proposed, namely 4 mC-RF, for identifying 4 mC sites through the integration of statistical moments along with position, and composition-dependent features. Relative and absolute position-based features are computed to extract optimal features. A popular machine learning classifier Random Forest was used for training the model. Validation results were obtained through rigorous processes of self-consistency, 10-fold cross-validation, Independent set testing, and Jackknife yielding 95.1%, 95.2%, 97.0%, and 94.7% accuracies, respectively. Our proposed model depicts the highest prediction accuracies as compared to existing models. Subsequently, the developed 4 mC-RF model was constructed into a web server. A significant and more accurate predictor of 4 mC Methylcytosine sites helps experimental scientists to gather faster, efficient, and cost-effective results.
Collapse
Affiliation(s)
- Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P. O. Box 80221, Jeddah 21589, Saudi Arabia.
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah 21589, Saudi Arabia.
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P. O. Box 80203, Jeddah 21589, Saudi Arabia.
| | - Yaser Daanial Khan
- Department of Computer Science, University of Management and Technology, Lahore 54770, Pakistan.
| |
Collapse
|
32
|
Restrepo G. Semiotic thoughts on biological sequence representations. Comb Chem High Throughput Screen 2021; 25:349-353. [PMID: 34225612 DOI: 10.2174/1386207324666210705112232] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/08/2021] [Accepted: 06/12/2021] [Indexed: 11/22/2022]
Abstract
The deluge of biological sequences ranging from those of proteins, DNA and RNA to genomes has increased the models for their representation, which are further used to contrast those sequences. Here we present a brief bibliometric description of the research area devoted to representation of biological sequences and highlight the semiotic reaches of this process. Finally, we argue that this research area needs further research according to the evolution of mathematical chemistry and its drawbacks are required to be overcome.
Collapse
Affiliation(s)
- Guillermo Restrepo
- Max Planck Institute for Mathematics in the Sciences, Leipzig, Germany; Interdisciplinary Center for Bioinformatics, Universität Leipzig, Leipzig, Germany
| |
Collapse
|
33
|
Feng P, Feng L, Tang C. Comparison and Analysis of Computational Methods for Identifying N6-Methyladenosine Sites in Saccharomyces cerevisiae. Curr Pharm Des 2021; 27:1219-1229. [PMID: 33167827 DOI: 10.2174/1381612826666201109110703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2020] [Accepted: 07/20/2020] [Indexed: 11/22/2022]
Abstract
BACKGROUND N6-methyladenosine (m6A) plays critical roles in a broad range of biological processes. Knowledge about the precise location of m6A site in the transcriptome is vital for deciphering its biological functions. Although experimental techniques have made substantial contributions to identify m6A, they are still labor intensive and time consuming. As complement to experimental methods, in the past few years, a series of computational approaches have been proposed to identify m6A sites. METHODS In order to facilitate researchers to select appropriate methods for identifying m6A sites, it is necessary to conduct a comprehensive review and comparison of existing methods. RESULTS Since research works on m6A in Saccharomyces cerevisiae are relatively clear, in this review, we summarized recent progress of computational prediction of m6A sites in S. cerevisiae and assessed the performance of existing computational methods. Finally, future directions of computationally identifying m6A sites are presented. CONCLUSION Taken together, we anticipate that this review will serve as an important guide for computational analysis of m6A modifications.
Collapse
Affiliation(s)
- Pengmian Feng
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| | - Lijing Feng
- School of Sciences, North China University of Science and Technology, Tangshan 063000, China
| | - Chaohui Tang
- School of Basic Medical Sciences, Chengdu University of Traditional Chinese Medicine, Chengdu 611730, China
| |
Collapse
|
34
|
Cai L, Ren X, Fu X, Peng L, Gao M, Zeng X. iEnhancer-XG: interpretable sequence-based enhancers and their strength predictor. Bioinformatics 2021; 37:1060-1067. [PMID: 33119044 DOI: 10.1093/bioinformatics/btaa914] [Citation(s) in RCA: 56] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2020] [Revised: 09/30/2020] [Accepted: 10/15/2020] [Indexed: 01/10/2023] Open
Abstract
MOTIVATION Enhancers are non-coding DNA fragments with high position variability and free scattering. They play an important role in controlling gene expression. As machine learning has become more widely used in identifying enhancers, a number of bioinformatic tools have been developed. Although several models for identifying enhancers and their strengths have been proposed, their accuracy and efficiency have yet to be improved. RESULTS We propose a two-layer predictor called 'iEnhancer-XG.' It comprises a one-layer predictor (for identifying enhancers) and a second classifier (for their strength) and uses 'XGBoost' as a base classifier and five feature extraction methods, namely, k-Spectrum Profile, Mismatch k-tuple, Subsequence Profile, Position-specific scoring matrix (PSSM) and Pseudo dinucleotide composition (PseDNC). Each method has an independent output. We place the feature vector matrix into the ensemble learning for fusion. This experiment involves the method of 'SHapley Additive explanations' to provide interpretability for the previous black box machine learning methods and improve their credibility. The accuracies of the ensemble learning method are 0.811 (first layer) and 0.657 (second layer). The rigorous 10-fold cross-validation confirms that the proposed method is significantly better than existing technologies. AVAILABILITY AND IMPLEMENTATION The source code and dataset for the enhancer predictions have been uploaded to https://github.com/jimmyrate/ienhancer-xg. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Lijun Cai
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Xuanbai Ren
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Xiangzheng Fu
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Li Peng
- College of Computer Science and Engineering, Hunan University of Science and Technology, 411103 XiangTan, China
| | - Mingyu Gao
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, 410082 Changsha, Hunan, China
| |
Collapse
|
35
|
Rahman CR, Amin R, Shatabda S, Toaha MSI. A convolution based computational approach towards DNA N6-methyladenine site identification and motif extraction in rice genome. Sci Rep 2021; 11:10357. [PMID: 33990665 PMCID: PMC8121938 DOI: 10.1038/s41598-021-89850-9] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2020] [Accepted: 05/04/2021] [Indexed: 12/23/2022] Open
Abstract
DNA N6-methylation (6mA) in Adenine nucleotide is a post replication modification responsible for many biological functions. Automated and accurate computational methods can help to identify 6mA sites in long genomes saving significant time and money. Our study develops a convolutional neural network (CNN) based tool i6mA-CNN capable of identifying 6mA sites in the rice genome. Our model coordinates among multiple types of features such as PseAAC (Pseudo Amino Acid Composition) inspired customized feature vector, multiple one hot representations and dinucleotide physicochemical properties. It achieves auROC (area under Receiver Operating Characteristic curve) score of 0.98 with an overall accuracy of 93.97% using fivefold cross validation on benchmark dataset. Finally, we evaluate our model on three other plant genome 6mA site identification test datasets. Results suggest that our proposed tool is able to generalize its ability of 6mA site identification on plant genomes irrespective of plant species. An algorithm for potential motif extraction and a feature importance analysis procedure are two by products of this research. Web tool for this research can be found at: https://cutt.ly/dgp3QTR.
Collapse
Affiliation(s)
| | - Ruhul Amin
- United International University, Dhaka, Bangladesh
| | | | | |
Collapse
|
36
|
Yao Y, Zhang S, Liang Y. iORI-ENST: identifying origin of replication sites based on elastic net and stacking learning. SAR AND QSAR IN ENVIRONMENTAL RESEARCH 2021; 32:317-331. [PMID: 33730950 DOI: 10.1080/1062936x.2021.1895884] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 02/23/2021] [Indexed: 06/12/2023]
Abstract
DNA replication is not only the basis of biological inheritance but also the most fundamental process in all living organisms. It plays a crucial role in the cell-division cycle and gene expression regulation. Hence, the accurate identification of the origin of replication sites (ORIs) has a great meaning for further understanding the regulatory mechanism of gene expression and treating genic diseases. In this paper, a novel, feasible and powerful model, namely, iORI-ENST is designed for identifying ORIs. Firstly, we extract the different features by incorporating mono-nucleotide binary encoding and dinucleotide-based spatial autocorrelation. Subsequently, elastic net is utilized as the feature selection method to select the optimal feature set. And then stacking learning is employed to predict ORIs and non-ORIs, which contains random forest, adaboost, gradient boosting decision tree, extra trees and support vector machine. Finally, the ORI sites are identified on the benchmark datasets S1 and S2 with their accuracies of 91.41% and 95.07%, respectively. Meanwhile, an independent dataset S3 is employed to verify the validation and transferability of our model and its accuracy reaches 91.10%. Comparing with state-of-the-art methods, our model achieves more remarkable performance. The results show our model is a feasible, effective and powerful tool for identifying ORIs. The source code and datasets are available at https://github.com/YingyingYao/iORI-ENST.
Collapse
Affiliation(s)
- Y Yao
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - S Zhang
- School of Mathematics and Statistics, Xidian University, Xi'an, P. R. China
| | - Y Liang
- School of Science, Xi'an Polytechnic University, Xi'an, P. R. China
| |
Collapse
|
37
|
Khan YD, Alzahrani E, Alghamdi W, Ullah MZ. Sequence-based Identification of Allergen Proteins Developed by Integration of PseAAC and Statistical Moments via 5-Step Rule. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200424085947] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
Abstract
Background:
Allergens are antigens that can stimulate an atopic type I human
hypersensitivity reaction by an immunoglobulin E (IgE) reaction. Some proteins are naturally
allergenic than others. The challenge for toxicologists is to identify properties that allow proteins
to cause allergic sensitization and allergic diseases. The identification of allergen proteins is a very
critical and pivotal task. The experimental identification of protein functions is a hectic, laborious
and costly task; therefore, computer scientists have proposed various methods in the field of
computational biology and bioinformatics using various data science approaches. Objectives:
Herein, we report a novel predictor for the identification of allergen proteins.
Methods:
For feature extraction, statistical moments and various position-based features have been
incorporated into Chou’s pseudo amino acid composition (PseAAC), and are used for training of a
neural network.
Results:
The predictor is validated through 10-fold cross-validation and Jackknife testing, which
gave 99.43% and 99.87% accurate results.
Conclusions:
Thus, the proposed predictor can help in predicting the Allergen proteins in an
efficient and accurate way and can provide baseline data for the discovery of new drugs and
biomarkers.
Collapse
Affiliation(s)
- Yaser Daanial Khan
- Department of Computer Science, School of Systems and Technology, University of Management and Technology, C II Johar Town, Lahore 54770, Pakistan
| | - Ebraheem Alzahrani
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, P.O. Box 80221, Jeddah, Saudi Arabia
| | - Malik Zaka Ullah
- Department of Mathematics, Faculty of Science, King Abdulaziz University, P.O. Box 80203, Jeddah 21589, Saudi Arabia
| |
Collapse
|
38
|
Du X, Hu J, Li S. Using Chou's 5-Step Rule to Predict DNA-Protein Binding with Multi-scale Complementary Feature. J Proteome Res 2021; 20:1639-1656. [PMID: 33522829 DOI: 10.1021/acs.jproteome.0c00864] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Abstract
It is well known that DNA-protein binding (DPB) prediction is not only beneficial to understand the regulation mechanism of gene expression but also a challenging task in the field of computational biology. Traditional methods for DPB prediction that depend on manually extracted features may lead to classification errors. Recently, deep learning such as convolutional neural network (CNN) has been successfully applied to classification tasks and improved DPB prediction performance significantly. Yet, these methods are based on the original DNA sequence modeling, ignoring the hidden complex dependency and complementarity between multiple sequence features. In consideration of this problem, we propose a method to fuse different sequence features and analyze them systematically through multi-scale CNN. First, sliding windows of specified lengths are set on distinct DNA sequences to generate multiple sequence features with unequal lengths. Second, multiple feature sequences are fused and encoded for feature representation. Third, multi-scale CNN with different binding motif lengths is used to automatically learn and mine the influence of internal attributes and hidden complex relations between the fusion sequence features and make full use of the complementary advantages of extracted CNN features to predict DPB. When our model is applied to 690 ChIP-seq datasets, it achieves an average AUC of 0.9112, which is significantly better than the latest methods. The results show that our method is effective for DPB prediction and is freely available at http://121.5.71.120/mscDPB/.
Collapse
Affiliation(s)
- Xiuquan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, China.,School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, China
| | - Jiajia Hu
- School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, China
| | - Shuo Li
- Department of Medical Imaging, Western University, London, ON N6A 3K7, Canada
| |
Collapse
|
39
|
Hasan MM, Shoombuatong W, Kurata H, Manavalan B. Critical evaluation of web-based DNA N6-methyladenine site prediction tools. Brief Funct Genomics 2021; 20:258-272. [PMID: 33491072 DOI: 10.1093/bfgp/elaa028] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 12/11/2020] [Accepted: 12/15/2020] [Indexed: 12/13/2022] Open
Abstract
Methylation of DNA N6-methyladenosine (6mA) is a type of epigenetic modification that plays pivotal roles in various biological processes. The accurate genome-wide identification of 6mA is a challenging task that leads to understanding the biological functions. For the last 5 years, a number of bioinformatics approaches and tools for 6mA site prediction have been established, and some of them are easily accessible as web application. Nevertheless, the accurate genome-wide identification of 6mA is still one of the challenging works that lead to understanding the biological functions. Especially in practical applications, these tools have implemented diverse encoding schemes, machine learning algorithms and feature selection methods, whereas few systematic performance comparisons of 6mA site predictors have been reported. In this review, 11 publicly available 6mA predictors evaluated with seven different species-specific datasets (Arabidopsis thaliana, Tolypocladium, Diospyros lotus, Saccharomyces cerevisiae, Drosophila melanogaster, Caenorhabditis elegans and Escherichia coli). Of those, few species are close homologs, and the remaining datasets are distant sequences. Our independent, validation tests demonstrated that Meta-i6mA and MM-6mAPred models for A. thaliana, Tolypocladium, S. cerevisiae and D. melanogaster achieved excellent overall performance when compared with their counterparts. However, none of the existing methods were suitable for E. coli, C. elegans and D. lotus. A feasibility of the existing predictors is also discussed for the seven species. Our evaluation provides useful guidelines for the development of 6mA site predictors and helps biologists selecting suitable prediction tools.
Collapse
Affiliation(s)
| | - Watshara Shoombuatong
- Data Mining and Biomedical Informatics, Faculty of Medical Technology, Mahidol University
| | - Hiroyuki Kurata
- Department of Bioscience and Bioinformatics in the Kyushu Institute of Technology, Japan
| | | |
Collapse
|
40
|
Wang H, Ding Y, Tang J, Zou Q, Guo F. Identify RNA-associated subcellular localizations based on multi-label learning using Chou's 5-steps rule. BMC Genomics 2021; 22:56. [PMID: 33451286 PMCID: PMC7811227 DOI: 10.1186/s12864-020-07347-7] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 12/22/2020] [Indexed: 12/04/2022] Open
Abstract
BACKGROUND Biological functions of biomolecules rely on the cellular compartments where they are located in cells. Importantly, RNAs are assigned in specific locations of a cell, enabling the cell to implement diverse biochemical processes in the way of concurrency. However, lots of existing RNA subcellular localization classifiers only solve the problem of single-label classification. It is of great practical significance to expand RNA subcellular localization into multi-label classification problem. RESULTS In this study, we extract multi-label classification datasets about RNA-associated subcellular localizations on various types of RNAs, and then construct subcellular localization datasets on four RNA categories. In order to study Homo sapiens, we further establish human RNA subcellular localization datasets. Furthermore, we utilize different nucleotide property composition models to extract effective features to adequately represent the important information of nucleotide sequences. In the most critical part, we achieve a major challenge that is to fuse the multivariate information through multiple kernel learning based on Hilbert-Schmidt independence criterion. The optimal combined kernel can be put into an integration support vector machine model for identifying multi-label RNA subcellular localizations. Our method obtained excellent results of 0.703, 0.757, 0.787, and 0.800, respectively on four RNA data sets on average precision. CONCLUSION To be specific, our novel method performs outstanding rather than other prediction tools on novel benchmark datasets. Moreover, we establish user-friendly web server with the implementation of our method.
Collapse
Affiliation(s)
- Hao Wang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Yijie Ding
- School of Electronic and Information Engineering, Suzhou University of Science and Technology, Suzhou, China
| | - Jijun Tang
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China
- School of Computational Science and Engineering, University of South Carolina, Columbia, 29208, SC, US
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin, China.
| |
Collapse
|
41
|
Zhuang J, Liu D, Lin M, Qiu W, Liu J, Chen S. PseUdeep: RNA Pseudouridine Site Identification with Deep Learning Algorithm. Front Genet 2021; 12:773882. [PMID: 34868261 PMCID: PMC8637112 DOI: 10.3389/fgene.2021.773882] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2021] [Accepted: 10/04/2021] [Indexed: 11/16/2022] Open
Abstract
Background: Pseudouridine (Ψ) is a common ribonucleotide modification that plays a significant role in many biological processes. The identification of Ψ modification sites is of great significance for disease mechanism and biological processes research in which machine learning algorithms are desirable as the lab exploratory techniques are expensive and time-consuming. Results: In this work, we propose a deep learning framework, called PseUdeep, to identify Ψ sites of three species: H. sapiens, S. cerevisiae, and M. musculus. In this method, three encoding methods are used to extract the features of RNA sequences, that is, one-hot encoding, K-tuple nucleotide frequency pattern, and position-specific nucleotide composition. The three feature matrices are convoluted twice and fed into the capsule neural network and bidirectional gated recurrent unit network with a self-attention mechanism for classification. Conclusion: Compared with other state-of-the-art methods, our model gets the highest accuracy of the prediction on the independent testing data set S-200; the accuracy improves 12.38%, and on the independent testing data set H-200, the accuracy improves 0.68%. Moreover, the dimensions of the features we derive from the RNA sequences are only 109,109, and 119 in H. sapiens, M. musculus, and S. cerevisiae, which is much smaller than those used in the traditional algorithms. On evaluation via tenfold cross-validation and two independent testing data sets, PseUdeep outperforms the best traditional machine learning model available. PseUdeep source code and data sets are available at https://github.com/dan111262/PseUdeep.
Collapse
Affiliation(s)
- Jujuan Zhuang
- College of Science, Dalian Maritime University, Dalian, China
| | - Danyang Liu
- College of Science, Dalian Maritime University, Dalian, China
| | - Meng Lin
- College of Science, Dalian Maritime University, Dalian, China
| | - Wenjing Qiu
- Electrical and Information Engineering, Anhui University of Technology, Anhui, China
- Geneis (Beijing) Co., Ltd., Beijing, China
| | | | - Size Chen
- Department of Oncology, The First Affiliated Hospital of Guangdong Pharmaceutical University, Guangzhou, China
- Guangdong Provincial Engineering Research Center for Esophageal Cancer Precise Therapy, The First Affiliated Hospital of Guangdong Pharmaceutical University, Guangzhou, China
- Central Laboratory, The First Affiliated Hospital of Guangdong Pharmaceutical University, Guangzhou, China
- *Correspondence: Size Chen,
| |
Collapse
|
42
|
Yang XF, Zhou YK, Zhang L, Gao Y, Du PF. Predicting LncRNA Subcellular Localization Using Unbalanced Pseudo-k Nucleotide Compositions. Curr Bioinform 2020. [DOI: 10.2174/1574893614666190902151038] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Background:
Long non-coding RNAs (lncRNAs) are transcripts with a length more
than 200 nucleotides, functioning in the regulation of gene expression. More evidence has shown
that the biological functions of lncRNAs are intimately related to their subcellular localizations.
Therefore, it is very important to confirm the lncRNA subcellular localization.
Methods:
In this paper, we proposed a novel method to predict the subcellular localization of
lncRNAs. To more comprehensively utilize lncRNA sequence information, we exploited both kmer
nucleotide composition and sequence order correlated factors of lncRNA to formulate
lncRNA sequences. Meanwhile, a feature selection technique which was based on the Analysis Of
Variance (ANOVA) was applied to obtain the optimal feature subset. Finally, we used the support
vector machine (SVM) to perform the prediction.
Results:
The AUC value of the proposed method can reach 0.9695, which indicated the proposed
predictor is an efficient and reliable tool for determining lncRNA subcellular localization. Furthermore,
the predictor can reach the maximum overall accuracy of 90.37% in leave-one-out cross
validation, which clearly outperforms the existing state-of- the-art method.
Conclusion:
It is demonstrated that the proposed predictor is feasible and powerful for the prediction
of lncRNA subcellular. To facilitate subsequent genetic sequence research, we shared the
source code at https://github.com/NicoleYXF/lncRNA.
Collapse
Affiliation(s)
- Xiao-Fei Yang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yuan-Ke Zhou
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Lin Zhang
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Yang Gao
- School of Medicine, Nankai University, Tianjin 300071, China
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| |
Collapse
|
43
|
Liu GH, Zhang BW, Qian G, Wang B, Mao B, Bichindaritz I. Bioimage-Based Prediction of Protein Subcellular Location in Human Tissue with Ensemble Features and Deep Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:1966-1980. [PMID: 31107658 DOI: 10.1109/tcbb.2019.2917429] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
Prediction of protein subcellular location has currently become a hot topic because it has been proven to be useful for understanding both the disease mechanisms and novel drug design. With the rapid development of automated microscopic imaging technology in recent years, classification methods of bioimage-based protein subcellular location have attracted considerable attention for images can describe the protein distribution intuitively and in detail. In the current study, a prediction method of protein subcellular location was proposed based on multi-view image features that are extracted from three different views, including the four texture features of the original image, the global and local features of the protein extracted from the protein channel images after color segmentation, and the global features of DNA extracted from the DNA channel image. Finally, the extracted features were combined together to improve the performance of subcellular localization prediction. From the performance comparison of different combination features under the same classifier, the best ensemble features could be obtained. In this work, a classifier based on Stacked Auto-encoders and the random forest was also put forward. To improve the prediction results, the deep network was combined with the traditional statistical classification methods. Stringent cross-validation and independent validation tests on the benchmark dataset demonstrated the efficacy of the proposed method.
Collapse
|
44
|
LncLocation: Efficient Subcellular Location Prediction of Long Non-Coding RNA-Based Multi-Source Heterogeneous Feature Fusion. Int J Mol Sci 2020; 21:ijms21197271. [PMID: 33019721 PMCID: PMC7582431 DOI: 10.3390/ijms21197271] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2020] [Revised: 09/27/2020] [Accepted: 09/28/2020] [Indexed: 12/13/2022] Open
Abstract
Recent studies uncover that subcellular location of long non-coding RNAs (lncRNAs) can provide significant information on its function. Due to the lack of experimental data, the number of lncRNAs is very limited, experimentally verified subcellular localization, and the numbers of lncRNAs located in different organelle are wildly imbalanced. The prediction of subcellular location of lncRNAs is actually a multi-classification small sample imbalance problem. The imbalance of data results in the poor recognition effect of machine learning models on small data subsets, which is a puzzling and challenging problem in the existing research. In this study, we integrate multi-source features to construct a sequence-based computational tool, lncLocation, to predict the subcellular location of lncRNAs. Autoencoder is used to enhance part of the features, and the binomial distribution-based filtering method and recursive feature elimination (RFE) are used to filter some of the features. It improves the representation ability of data and reduces the problem of unbalanced multi-classification data. By comprehensive experiments on different feature combinations and machine learning models, we select the optimal features and classifier model scheme to construct a subcellular location prediction tool, lncLocation. LncLocation can obtain an 87.78% accuracy using 5-fold cross validation on the benchmark data, which is higher than the state-of-the-art tools, and the classification performance, especially for small class sets, is improved significantly.
Collapse
|
45
|
Tang Q, Nie F, Kang J, Chen W. ncPro-ML: An integrated computational tool for identifying non-coding RNA promoters in multiple species. Comput Struct Biotechnol J 2020; 18:2445-2452. [PMID: 33005306 PMCID: PMC7509369 DOI: 10.1016/j.csbj.2020.09.001] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Revised: 08/30/2020] [Accepted: 09/01/2020] [Indexed: 02/07/2023] Open
Abstract
A computational method for identifying non-coding promoters was proposed for the first time. A high-quality dataset was built to train and test the models for identifying non-coding promoters. A user-friendly web server was developed to recognize non-coding promoters.
The promoter is located near the transcription start sites and regulates transcription initiation of the gene. Accurate identification of promoters is essential for understanding the mechanism of gene regulation. Since experimental methods are costly and ineffective, developing efficient and accurate computational tools to identify promoters are necessary. Although a series of methods have been proposed for identifying promoters, none of them is able to identify the promoters of non-coding RNA (ncRNA). In the present work, a new method called ncPro-ML was proposed to identify the promoter of ncRNA in Homo sapiens and Mus musculus, in which different kinds of sequence encoding schemes were used to convert DNA sequences into feature vectors. To test the length effect, for each species, datasets including sequences with different lengths were built. The results demonstrated that ncPro-ML achieved the best performance based on the dataset with the sequence length of 221 nucleotides for human and mouse. The performances of ncPro-ML were also satisfying from both independent dataset test and cross-species test. The results indicate that the proposed predictor can server as a powerful tool for the discovery of ncRNA promoters. In addition, a web-server for ncPro-ML was developed, which can be freely accessed at http://www.bio-bigdata.cn/ncPro-ML/.
Collapse
Affiliation(s)
- Qiang Tang
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
| | - Fulei Nie
- Center for Genomics and Computational Biology, Scholl of Life Sciences, North China University of Science and Technology, Tangshan 063210, China
- School of Public Health, North China University of Science and Technology, Tangshan 063210, China
| | - Juanjuan Kang
- Affiliated Foshan Maternity & Child Healthcare Hospital, Southern Medical University (Foshan Maternity & Child Healthcare Hospital), Foshan 528000, China
| | - Wei Chen
- Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China
- Center for Genomics and Computational Biology, Scholl of Life Sciences, North China University of Science and Technology, Tangshan 063210, China
- School of Public Health, North China University of Science and Technology, Tangshan 063210, China
- Corresponding author: Innovative Institute of Chinese Medicine and Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu 611137, China.
| |
Collapse
|
46
|
Roy T, Bhattacharjee P. A LabVIEW-based real-time modeling approach for detection of abnormalities in cancer cells. GENE REPORTS 2020. [DOI: 10.1016/j.genrep.2020.100788] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
|
47
|
Use Chou’s 5-steps rule to identify DNase I hypersensitive sites via dinucleotide property matrix and extreme gradient boosting. Mol Genet Genomics 2020; 295:1431-1442. [DOI: 10.1007/s00438-020-01711-8] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2020] [Accepted: 07/11/2020] [Indexed: 01/08/2023]
|
48
|
Saikia S, Bordoloi M, Sarmah R. Established and In-trial GPCR Families in Clinical Trials: A Review for Target Selection. Curr Drug Targets 2020; 20:522-539. [PMID: 30394207 DOI: 10.2174/1389450120666181105152439] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2018] [Revised: 08/28/2018] [Accepted: 10/22/2018] [Indexed: 12/14/2022]
Abstract
The largest family of drug targets in clinical trials constitute of GPCRs (G-protein coupled receptors) which accounts for about 34% of FDA (Food and Drug Administration) approved drugs acting on 108 unique GPCRs. Factors such as readily identifiable conserved motif in structures, 127 orphan GPCRs despite various de-orphaning techniques, directed functional antibodies for validation as drug targets, etc. has widened their therapeutic windows. The availability of 44 crystal structures of unique receptors, unexplored non-olfactory GPCRs (encoded by 50% of the human genome) and 205 ligand receptor complexes now present a strong foundation for structure-based drug discovery and design. The growing impact of polypharmacology for complex diseases like schizophrenia, cancer etc. warrants the need for novel targets and considering the undiscriminating and selectivity of GPCRs, they can fulfill this purpose. Again, natural genetic variations within the human genome sometimes delude the therapeutic expectations of some drugs, resulting in medication response differences and ADRs (adverse drug reactions). Around ~30 billion US dollars are dumped annually for poor accounting of ADRs in the US alone. To curb such undesirable reactions, the knowledge of established and currently in clinical trials GPCRs families can offer huge understanding towards the drug designing prospects including "off-target" effects reducing economical resource and time. The druggability of GPCR protein families and critical roles played by them in complex diseases are explained. Class A, class B1, class C and class F are generally established family and GPCRs in phase I (19%), phase II(29%), phase III(52%) studies are also reviewed. From the phase I studies, frizzled receptors accounted for the highest in trial targets, neuropeptides in phase II and melanocortin in phase III studies. Also, the bioapplications for nanoparticles along with future prospects for both nanomedicine and GPCR drug industry are discussed. Further, the use of computational techniques and methods employed for different target validations are also reviewed along with their future potential for the GPCR based drug discovery.
Collapse
Affiliation(s)
- Surovi Saikia
- Natural Products Chemistry Group, CSIR North East Institute of Science & Technology, Jorhat-785006, Assam, India
| | - Manobjyoti Bordoloi
- Natural Products Chemistry Group, CSIR North East Institute of Science & Technology, Jorhat-785006, Assam, India
| | - Rajeev Sarmah
- Allied Health Sciences, Assam Down Town University, Panikhaiti, Guwahati 781026, Assam, India
| |
Collapse
|
49
|
Hu Y, Lu Y, Wang S, Zhang M, Qu X, Niu B. Application of Machine Learning Approaches for the Design and Study of Anticancer Drugs. Curr Drug Targets 2020; 20:488-500. [PMID: 30091413 DOI: 10.2174/1389450119666180809122244] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Revised: 06/19/2018] [Accepted: 06/25/2018] [Indexed: 12/14/2022]
Abstract
BACKGROUND Globally the number of cancer patients and deaths are continuing to increase yearly, and cancer has, therefore, become one of the world's highest causes of morbidity and mortality. In recent years, the study of anticancer drugs has become one of the most popular medical topics. OBJECTIVE In this review, in order to study the application of machine learning in predicting anticancer drugs activity, some machine learning approaches such as Linear Discriminant Analysis (LDA), Principal components analysis (PCA), Support Vector Machine (SVM), Random forest (RF), k-Nearest Neighbor (kNN), and Naïve Bayes (NB) were selected, and the examples of their applications in anticancer drugs design are listed. RESULTS Machine learning contributes a lot to anticancer drugs design and helps researchers by saving time and is cost effective. However, it can only be an assisting tool for drug design. CONCLUSION This paper introduces the application of machine learning approaches in anticancer drug design. Many examples of success in identification and prediction in the area of anticancer drugs activity prediction are discussed, and the anticancer drugs research is still in active progress. Moreover, the merits of some web servers related to anticancer drugs are mentioned.
Collapse
Affiliation(s)
- Yan Hu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Yi Lu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Shuo Wang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Mengying Zhang
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| | - Xiaosheng Qu
- National Engineering Laboratory of Southwest Endangered Medicinal Resources Development, Guangxi Botanical Garden of Medicinal Plants, 530023,Nanning, China
| | - Bing Niu
- School of Life Sciences, Shanghai University, Shanghai 200444, China
| |
Collapse
|
50
|
Mohabatkar H, Ebrahimi S, Moradi M. Using Chou’s Five-steps Rule to Classify and Predict Glutathione S-transferases with Different Machine Learning Algorithms and Pseudo Amino Acid Composition. Int J Pept Res Ther 2020. [DOI: 10.1007/s10989-020-10087-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/24/2023]
|