1
|
Zhang Q, Xu Y, Wang S, Wu Y, Ye Y, Yuan CA, Gribova V, Filaretov VF, Huang DS. Using Fully Convolutional Network to Locate Transcription Factor Binding Sites Based on DNA Sequence and Conservation Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:2690-2699. [PMID: 36374878 DOI: 10.1109/tcbb.2022.3219831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
Transcription factors (TFs) play a part in gene expression. TFs can form complex gene expression regulation system by combining with DNA. Thereby, identifying the binding regions has become an indispensable step for understanding the regulatory mechanism of gene expression. Due to the great achievements of applying deep learning (DL) to computer vision and language processing in recent years, many scholars are inspired to use these methods to predict TF binding sites (TFBSs), achieving extraordinary results. However, these methods mainly focus on whether DNA sequences include TFBSs. In this paper, we propose a fully convolutional network (FCN) coupled with refinement residual block (RRB) and global average pooling layer (GAPL), namely FCNARRB. Our model could classify binding sequences at nucleotide level by outputting dense label for input data. Experimental results on human ChIP-seq datasets show that the RRB and GAPL structures are very useful for improving model performance. Adding GAPL improves the performance by 9.32% and 7.61% in terms of IoU (Intersection of Union) and PRAUC (Area Under Curve of Precision and Recall), and adding RRB improves the performance by 7.40% and 4.64%, respectively. In addition, we find that conservation information can help locate TFBSs.
Collapse
|
2
|
Wang X, Zhang M, Long C, Yao L, Zhu M. Self-Attention Based Neural Network for Predicting RNA-Protein Binding Sites. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1469-1479. [PMID: 36067103 DOI: 10.1109/tcbb.2022.3204661] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Proteins binding to Ribonucleic Acid (RNA) inside cells are called RNA-binding proteins (RBP), which play a crucial role in gene regulation. The identification of RNA-protein binding sites helps to understand the function of RBP better. Although many computational methods have been developed to predict RNA-protein binding sites, their prediction accuracy on small sample datasets needs improvement. To overcome this limitation, we propose a novel model called SA-Net, which utilizes k-mer embedding to encode RNA sequences and a self-attention-based neural network to extract sequence features. K-mer embedding assists the model to discover significant subsequence fragments associated with binding sites. The self-attention mechanism captures contextual information from the entire input sequence globally, performing well in small sample sequence learning. Experimental results demonstrate that SA-Net attains state-of-the-art results on the RBP-24 dataset. We find that 4-mer embedding aids the model to achieve optimal performance. We also show that the self-attention network outperforms the commonly used CNN and CNN-BLSTM models in sequence feature extraction.
Collapse
|
3
|
Han K, Wang J, Wang Y, Zhang L, Yu M, Xie F, Zheng D, Xu Y, Ding Y, Wan J. A review of methods for predicting DNA N6-methyladenine sites. Brief Bioinform 2023; 24:6887111. [PMID: 36502371 DOI: 10.1093/bib/bbac514] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2022] [Revised: 10/07/2022] [Accepted: 10/27/2022] [Indexed: 12/14/2022] Open
Abstract
Deoxyribonucleic acid(DNA) N6-methyladenine plays a vital role in various biological processes, and the accurate identification of its site can provide a more comprehensive understanding of its biological effects. There are several methods for 6mA site prediction. With the continuous development of technology, traditional techniques with the high costs and low efficiencies are gradually being replaced by computer methods. Computer methods that are widely used can be divided into two categories: traditional machine learning and deep learning methods. We first list some existing experimental methods for predicting the 6mA site, then analyze the general process from sequence input to results in computer methods and review existing model architectures. Finally, the results were summarized and compared to facilitate subsequent researchers in choosing the most suitable method for their work.
Collapse
Affiliation(s)
- Ke Han
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China.,College of Pharmacy, Harbin University of Commerce, Harbin, 150076, China
| | - Jianchun Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yu Wang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Lei Zhang
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Mengyao Yu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Fang Xie
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Dequan Zheng
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yaoqun Xu
- School of Computer and Information Engineering, Heilongjiang Provincial Key Laboratory of Electronic Commerce and Information Processing, Harbin University of Commerce, Harbin, 150028, China
| | - Yijie Ding
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, 324000, China
| | - Jie Wan
- Laboratory for Space Environment and Physical Sciences, Harbin Institute of Technology, Harbin, 150001, China
| |
Collapse
|
4
|
Laverty KU, Jolma A, Pour SE, Zheng H, Ray D, Morris Q, Hughes TR. PRIESSTESS: interpretable, high-performing models of the sequence and structure preferences of RNA-binding proteins. Nucleic Acids Res 2022; 50:e111. [PMID: 36018788 DOI: 10.1093/nar/gkac694] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 07/22/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022] Open
Abstract
Modelling both primary sequence and secondary structure preferences for RNA binding proteins (RBPs) remains an ongoing challenge. Current models use varied RNA structure representations and can be difficult to interpret and evaluate. To address these issues, we present a universal RNA motif-finding/scanning strategy, termed PRIESSTESS (Predictive RBP-RNA InterpretablE Sequence-Structure moTif regrESSion), that can be applied to diverse RNA binding datasets. PRIESSTESS identifies dozens of enriched RNA sequence and/or structure motifs that are subsequently reduced to a set of core motifs by logistic regression with LASSO regularization. Importantly, these core motifs are easily visualized and interpreted, and provide a measure of RBP secondary structure specificity. We used PRIESSTESS to interrogate new HTR-SELEX data for 23 RBPs with diverse RNA binding modes and captured known primary sequence and secondary structure preferences for each. Moreover, when applying PRIESSTESS to 144 RBPs across 202 RNA binding datasets, 75% showed an RNA secondary structure preference but only 10% had a preference besides unpaired bases, suggesting that most RBPs simply recognize the accessibility of primary sequences.
Collapse
Affiliation(s)
- Kaitlin U Laverty
- Department of Molecular Genetics, University of Toronto, Toronto, Canada
| | - Arttu Jolma
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Donnelly Centre, University of Toronto, Toronto, Canada
| | - Sara E Pour
- Department of Molecular Genetics, University of Toronto, Toronto, Canada
| | - Hong Zheng
- Donnelly Centre, University of Toronto, Toronto, Canada
| | - Debashish Ray
- Donnelly Centre, University of Toronto, Toronto, Canada
| | - Quaid Morris
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Computational and Systems Biology, Memorial Sloan Kettering Cancer Center, New York, USA
| | - Timothy R Hughes
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Donnelly Centre, University of Toronto, Toronto, Canada
| |
Collapse
|
5
|
Ali SD, Alam W, Tayara H, Chong KT. Identification of Functional piRNAs Using a Convolutional Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1661-1669. [PMID: 33119510 DOI: 10.1109/tcbb.2020.3034313] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Piwi-interacting RNAs (piRNAs) are a distinct sub-class of small non-coding RNAs that are mainly responsible for germline stem cell maintenance, gene stability, and maintaining genome integrity by repression of transposable elements. piRNAs are also expressed aberrantly and associated with various kinds of cancers. To identify piRNAs and their role in guiding target mRNA deadenylation, the currently available computational methods require urgent improvements in performance. To facilitate this, we propose a robust predictor based on a lightweight and simplified deep learning architecture using a convolutional neural network (CNN) to extract significant features from raw RNA sequences without the need for more customized features. The proposed model's performance is comprehensively evaluated using k-fold cross-validation on a benchmark dataset. The proposed model significantly outperforms existing computational methods in the prediction of piRNAs and their role in target mRNA deadenylation. In addition, a user-friendly and publicly-accessible web server is available at http://nsclbio.jbnu.ac.kr/tools/2S-piRCNN/.
Collapse
|
6
|
Hussain H, Tamizharasan PS, Rahul CS. Design possibilities and challenges of DNN models: a review on the perspective of end devices. Artif Intell Rev 2022. [DOI: 10.1007/s10462-022-10138-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023]
|
7
|
i4mC-Deep: An Intelligent Predictor of N4-Methylcytosine Sites Using a Deep Learning Approach with Chemical Properties. Genes (Basel) 2021; 12:genes12081117. [PMID: 34440291 PMCID: PMC8393747 DOI: 10.3390/genes12081117] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 07/15/2021] [Accepted: 07/16/2021] [Indexed: 01/26/2023] Open
Abstract
DNA is subject to epigenetic modification by the molecule N4-methylcytosine (4mC). N4-methylcytosine plays a crucial role in DNA repair and replication, protects host DNA from degradation, and regulates DNA expression. However, though current experimental techniques can identify 4mC sites, such techniques are expensive and laborious. Therefore, computational tools that can predict 4mC sites would be very useful for understanding the biological mechanism of this vital type of DNA modification. Conventional machine-learning-based methods rely on hand-crafted features, but the new method saves time and computational cost by making use of learned features instead. In this study, we propose i4mC-Deep, an intelligent predictor based on a convolutional neural network (CNN) that predicts 4mC modification sites in DNA samples. The CNN is capable of automatically extracting important features from input samples during training. Nucleotide chemical properties and nucleotide density, which together represent a DNA sequence, act as CNN input data. The outcome of the proposed method outperforms several state-of-the-art predictors. When i4mC-Deep was used to analyze G. subterruneus DNA, the accuracy of the results was improved by 3.9% and MCC increased by 10.5% compared to a conventional predictor.
Collapse
|
8
|
XG-ac4C: identification of N4-acetylcytidine (ac4C) in mRNA using eXtreme gradient boosting with electron-ion interaction pseudopotentials. Sci Rep 2020; 10:20942. [PMID: 33262392 PMCID: PMC7708984 DOI: 10.1038/s41598-020-77824-2] [Citation(s) in RCA: 33] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Accepted: 10/22/2020] [Indexed: 02/06/2023] Open
Abstract
N4-acetylcytidine (ac4C) is a post-transcriptional modification in mRNA which plays a major role in the stability and regulation of mRNA translation. The working mechanism of ac4C modification in mRNA is still unclear and traditional laboratory experiments are time-consuming and expensive. Therefore, we propose an XG-ac4C machine learning model based on the eXtreme Gradient Boost classifier for the identification of ac4C sites. The XG-ac4C model uses a combination of electron-ion interaction pseudopotentials and electron-ion interaction pseudopotentials of trinucleotide of the nucleotides in ac4C sites. Moreover, Shapley additive explanations and local interpretable model-agnostic explanations are applied to understand the importance of features and their contribution to the final prediction outcome. The obtained results demonstrate that XG-ac4C outperforms existing state-of-the-art methods. In more detail, the proposed model improves the area under the precision-recall curve by 9.4% and 9.6% in cross-validation and independent tests, respectively. Finally, a user-friendly web server based on the proposed model for ac4C site identification is made freely available at http://nsclbio.jbnu.ac.kr/tools/xgac4c/ .
Collapse
|
9
|
Chantsalnyam T, Lim DY, Tayara H, Chong KT. ncRDeep: Non-coding RNA classification with convolutional neural network. Comput Biol Chem 2020; 88:107364. [DOI: 10.1016/j.compbiolchem.2020.107364] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2020] [Revised: 08/04/2020] [Accepted: 08/18/2020] [Indexed: 12/21/2022]
|