101
|
Henderson J, Ly V, Olichwier S, Chainani P, Liu Y, Soibam B. Accurate prediction of boundaries of high resolution topologically associated domains (TADs) in fruit flies using deep learning. Nucleic Acids Res 2020; 47:e78. [PMID: 31049567 PMCID: PMC6648328 DOI: 10.1093/nar/gkz315] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2019] [Revised: 04/02/2019] [Accepted: 04/18/2019] [Indexed: 01/01/2023] Open
Abstract
Genomes are organized into self-interacting chromatin regions called topologically associated domains (TADs). A significant number of TAD boundaries are shared across multiple cell types and conserved across species. Disruption of TAD boundaries may affect the expression of nearby genes and could lead to several diseases. Even though detection of TAD boundaries is important and useful, there are experimental challenges in obtaining high resolution TAD locations. Here, we present computational prediction of TAD boundaries from high resolution Hi-C data in fruit flies. By extensive exploration and testing of several deep learning model architectures with hyperparameter optimization, we show that a unique deep learning model consisting of three convolution layers followed by a long short-term-memory layer achieves an accuracy of 96%. This outperforms feature-based models’ accuracy of 91% and an existing method's accuracy of 73–78% based on motif TRAP scores. Our method also detects previously reported motifs such as Beaf-32 that are enriched in TAD boundaries in fruit flies and also several unreported motifs.
Collapse
Affiliation(s)
- John Henderson
- Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, USA
| | - Vi Ly
- Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, USA
| | - Shawn Olichwier
- Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, USA
| | - Pranik Chainani
- Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, USA
| | - Yu Liu
- Biology and Biochemistry, University of Houston, Houston, TX 77204, USA
| | - Benjamin Soibam
- Computer Science and Engineering Technology, University of Houston-Downtown, Houston, TX 77002, USA
| |
Collapse
|
102
|
Zhang K, Pan X, Yang Y, Shen HB. CRIP: predicting circRNA-RBP-binding sites using a codon-based encoding and hybrid deep neural networks. RNA (NEW YORK, N.Y.) 2019; 25:1604-1615. [PMID: 31537716 PMCID: PMC6859861 DOI: 10.1261/rna.070565.119] [Citation(s) in RCA: 68] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/18/2019] [Accepted: 08/21/2019] [Indexed: 05/21/2023]
Abstract
Circular RNAs (circRNAs), with their crucial roles in gene regulation and disease development, have become rising stars in the RNA world. To understand the regulatory function of circRNAs, many studies focus on the interactions between circRNAs and RNA-binding proteins (RBPs). Recently, the abundant CLIP-seq experimental data has enabled the large-scale identification and analysis of circRNA-RBP interactions, whereas, as far as we know, no computational tool based on machine learning has been proposed yet. We develop CRIP (CircRNAs Interact with Proteins) for the prediction of RBP-binding sites on circRNAs using RNA sequences alone. CRIP consists of a stacked codon-based encoding scheme and a hybrid deep learning architecture, in which a convolutional neural network (CNN) learns high-level abstract features and a recurrent neural network (RNN) learns long dependency in the sequences. We construct 37 data sets including sequence fragments of binding sites on circRNAs, and each set corresponds to an RBP. The experimental results show that the new encoding scheme is superior to the existing feature representation methods for RNA sequences, and the hybrid network outperforms conventional classifiers by a large margin, where both the CNN and RNN components contribute to the performance improvement.
Collapse
Affiliation(s)
- Kaiming Zhang
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
- Department of Medical Informatics, Erasmus Medical Center, Rotterdam 3015 CE, Netherlands
| | - Yang Yang
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China
- Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| |
Collapse
|
103
|
Ju Y, Yuan L, Yang Y, Zhao H. CircSLNN: Identifying RBP-Binding Sites on circRNAs via Sequence Labeling Neural Networks. Front Genet 2019; 10:1184. [PMID: 31824574 PMCID: PMC6886371 DOI: 10.3389/fgene.2019.01184] [Citation(s) in RCA: 36] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2019] [Accepted: 10/25/2019] [Indexed: 11/28/2022] Open
Abstract
The interactions between RNAs and RNA binding proteins (RBPs) are crucial for understanding post-transcriptional regulation mechanisms. A lot of computational tools have been developed to automatically predict the binding relationship between RNAs and RBPs. However, most of the methods can only predict the presence or absence of binding sites for a sequence fragment, without providing specific information on the position or length of the binding sites. Besides, the existing tools focus on the interaction between RBPs and linear RNAs, while the binding sites on circular RNAs (circRNAs) have been rarely studied. In this study, we model the prediction of binding sites on RNAs as a sequence labeling problem, and propose a new model called circSLNN to identify the specific location of RBP-binding sites on circRNAs. CircSLNN is driven by pretrained RNA embedding vectors and a composite labeling model. On our constructed circRNA datasets, our model has an average F1 score of 0.790. We assess the performance on full-length RNA sequences, the proposed model outperforms previous classification-based models by a large margin.
Collapse
Affiliation(s)
- Yuqi Ju
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Liangliang Yuan
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China
| | - Yang Yang
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China.,Brain Science and Technology Research Center, Shanghai Jiao Tong University, Shanghai, China
| | - Hai Zhao
- Center for Brain-Like Computing and Machine Intelligence, Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai Jiao Tong University, Shanghai, China.,Brain Science and Technology Research Center, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
104
|
Luo X, Chi W, Deng M. Deepprune: Learning Efficient and Interpretable Convolutional Networks Through Weight Pruning for Predicting DNA-Protein Binding. Front Genet 2019; 10:1145. [PMID: 31824562 PMCID: PMC6879555 DOI: 10.3389/fgene.2019.01145] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 10/21/2019] [Indexed: 12/02/2022] Open
Abstract
Convolutional neural network (CNN) based methods have outperformed conventional machine learning methods in predicting the binding preference of DNA-protein binding. Although studies in the past have shown that more convolutional kernels help to achieve better performance, visualization of the model can be obscured by the use of many kernels, resulting in overfitting and reduced interpretation because the number of motifs in true models is limited. Therefore, we aim to arrive at high performance, but with limited kernel numbers, in CNN-based models for motif inference. We herein present Deepprune, a novel deep learning framework, which prunes the weights in the dense layer and fine-tunes iteratively. These two steps enable the training of CNN-based models with limited kernel numbers, allowing easy interpretation of the learned model. We demonstrate that Deepprune significantly improves motif inference performance for the simulated datasets. Furthermore, we show that Deepprune outperforms the baseline with limited kernel numbers when inferring DNA-binding sites from ChIP-seq data.
Collapse
Affiliation(s)
- Xiao Luo
- School of Mathematical Sciences, Peking University, Beijing, China
| | - Weilai Chi
- Center for Quantitative Biology, Peking University, Beijing, China
| | - Minghua Deng
- School of Mathematical Sciences, Peking University, Beijing, China
- Center for Quantitative Biology, Peking University, Beijing, China
| |
Collapse
|
105
|
Wang Z, Lei X, Wu FX. Identifying Cancer-Specific circRNA-RBP Binding Sites Based on Deep Learning. Molecules 2019; 24:E4035. [PMID: 31703384 PMCID: PMC6891306 DOI: 10.3390/molecules24224035] [Citation(s) in RCA: 50] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2019] [Revised: 10/25/2019] [Accepted: 11/06/2019] [Indexed: 12/17/2022] Open
Abstract
Circular RNAs (circRNAs) are extensively expressed in cells and tissues, and play crucial roles in human diseases and biological processes. Recent studies have reported that circRNAs could function as RNA binding protein (RBP) sponges, meanwhile RBPs can also be involved in back-splicing. The interaction with RBPs is also considered an important factor for investigating the function of circRNAs. Hence, it is necessary to understand the interaction mechanisms of circRNAs and RBPs, especially in human cancers. Here, we present a novel method based on deep learning to identify cancer-specific circRNA-RBP binding sites (CSCRSites), only using the nucleotide sequences as the input. In CSCRSites, an architecture with multiple convolution layers is utilized to detect the features of the raw circRNA sequence fragments, and further identify the binding sites through a fully connected layer with the softmax output. The experimental results show that CSCRSites outperform the conventional machine learning classifiers and some representative deep learning methods on the benchmark data. In addition, the features learnt by CSCRSites are converted to sequence motifs, some of which can match to human known RNA motifs involved in human diseases, especially cancer. Therefore, as a deep learning-based tool, CSCRSites could significantly contribute to the function analysis of cancer-associated circRNAs.
Collapse
Affiliation(s)
- Zhengfeng Wang
- School of Computer Science, Shaanxi Normal University, Xi’an 710119, China;
- College of Information Science and Engineering, Guilin University of Technology, Guilin 541004, China
| | - Xiujuan Lei
- School of Computer Science, Shaanxi Normal University, Xi’an 710119, China;
| | - Fang-Xiang Wu
- Department of Mechanical Engineering and Division of Biomedical Engineering, University of Saskatchewan, Saskatoon, SK S7N 5A9, Canada;
| |
Collapse
|
106
|
Sagar A, Xue B. Recent Advances in Machine Learning Based Prediction of RNA-protein Interactions. Protein Pept Lett 2019; 26:601-619. [PMID: 31215361 DOI: 10.2174/0929866526666190619103853] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2018] [Revised: 04/04/2019] [Accepted: 06/01/2019] [Indexed: 12/18/2022]
Abstract
The interactions between RNAs and proteins play critical roles in many biological processes. Therefore, characterizing these interactions becomes critical for mechanistic, biomedical, and clinical studies. Many experimental methods can be used to determine RNA-protein interactions in multiple aspects. However, due to the facts that RNA-protein interactions are tissuespecific and condition-specific, as well as these interactions are weak and frequently compete with each other, those experimental techniques can not be made full use of to discover the complete spectrum of RNA-protein interactions. To moderate these issues, continuous efforts have been devoted to developing high quality computational techniques to study the interactions between RNAs and proteins. Many important progresses have been achieved with the application of novel techniques and strategies, such as machine learning techniques. Especially, with the development and application of CLIP techniques, more and more experimental data on RNA-protein interaction under specific biological conditions are available. These CLIP data altogether provide a rich source for developing advanced machine learning predictors. In this review, recent progresses on computational predictors for RNA-protein interaction were summarized in the following aspects: dataset, prediction strategies, and input features. Possible future developments were also discussed at the end of the review.
Collapse
Affiliation(s)
- Amit Sagar
- Department of Cell Biology, Microbiology and Molecular Biology, School of Natural Sciences and Mathematics, College of Arts and Sciences, University of South Florida, Tampa, Florida 33620, United States
| | - Bin Xue
- Department of Cell Biology, Microbiology and Molecular Biology, School of Natural Sciences and Mathematics, College of Arts and Sciences, University of South Florida, Tampa, Florida 33620, United States
| |
Collapse
|
107
|
Yang F, Liu Y, Wang Y, Yin Z, Yang Z. MIC_Locator: a novel image-based protein subcellular location multi-label prediction model based on multi-scale monogenic signal representation and intensity encoding strategy. BMC Bioinformatics 2019; 20:522. [PMID: 31655541 PMCID: PMC6815465 DOI: 10.1186/s12859-019-3136-3] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2019] [Accepted: 10/09/2019] [Indexed: 12/20/2022] Open
Abstract
Background Protein subcellular localization plays a crucial role in understanding cell function. Proteins need to be in the right place at the right time, and combine with the corresponding molecules to fulfill their functions. Furthermore, prediction of protein subcellular location not only should be a guiding role in drug design and development due to potential molecular targets but also be an essential role in genome annotation. Taking the current status of image-based protein subcellular localization as an example, there are three common drawbacks, i.e., obsolete datasets without updating label information, stereotypical feature descriptor on spatial domain or grey level, and single-function prediction algorithm’s limited capacity of handling single-label database. Results In this paper, a novel human protein subcellular localization prediction model MIC_Locator is proposed. Firstly, the latest datasets are collected and collated as our benchmark dataset instead of obsolete data while training prediction model. Secondly, Fourier transformation, Riesz transformation, Log-Gabor filter and intensity coding strategy are employed to obtain frequency feature based on three components of monogenic signal with different frequency scales. Thirdly, a chained prediction model is proposed to handle multi-label instead of single-label datasets. The experiment results showed that the MIC_Locator can achieve 60.56% subset accuracy and outperform the existing majority of prediction models, and the frequency feature and intensity coding strategy can be conducive to improving the classification accuracy. Conclusions Our results demonstrate that the frequency feature is more beneficial for improving the performance of model compared to features extracted from spatial domain, and the MIC_Locator proposed in this paper can speed up validation of protein annotation, knowledge of protein function and proteomics research.
Collapse
Affiliation(s)
- Fan Yang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China. .,Department of Biological Chemistry and Molecular Pharmacology, Harvard Medical School, Boston, MA, 02115, USA.
| | - Yang Liu
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Yanbin Wang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Zhijian Yin
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| | - Zhen Yang
- School of Communications and Electronics, Jiangxi Science & Technology Normal University, Nanchang, 330003, China
| |
Collapse
|
108
|
Pan X, Shen HB. Inferring Disease-Associated MicroRNAs Using Semi-supervised Multi-Label Graph Convolutional Networks. iScience 2019; 20:265-277. [PMID: 31605942 PMCID: PMC6817654 DOI: 10.1016/j.isci.2019.09.013] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Revised: 09/05/2019] [Accepted: 09/11/2019] [Indexed: 01/22/2023] Open
Abstract
MicroRNAs (miRNAs) play crucial roles in biological processes involved in diseases. The associations between diseases and protein-coding genes (PCGs) have been well investigated, and miRNAs interact with PCGs to trigger them to be functional. We present a computational method, DimiG, to infer miRNA-associated diseases using a semi-supervised Graph Convolutional Network model (GCN). DimiG uses a multi-label framework to integrate PCG-PCG interactions, PCG-miRNA interactions, PCG-disease associations, and tissue expression profiles. DimiG is trained on disease-PCG associations and an interaction network using a GCN, which is further used to score associations between diseases and miRNAs. We evaluate DimiG on a benchmark set from verified disease-miRNA associations. Our results demonstrate that DimiG outperforms the best unsupervised method and is comparable to two supervised methods. Three case studies of prostate cancer, lung cancer, and inflammatory bowel disease further demonstrate the efficacy of DimiG, where top miRNAs predicted by DimiG are supported by literature.
Collapse
Affiliation(s)
- Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China; Department of Medical informatics, Erasmus Medical Center, 3015 CE Rotterdam, the Netherlands.
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, 200240 Shanghai, China.
| |
Collapse
|
109
|
Zhang SW, Wang Y, Zhang XX, Wang JQ. Prediction of the RBP binding sites on lncRNAs using the high-order nucleotide encoding convolutional neural network. Anal Biochem 2019; 583:113364. [PMID: 31323206 DOI: 10.1016/j.ab.2019.113364] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2019] [Revised: 07/10/2019] [Accepted: 07/15/2019] [Indexed: 01/09/2023]
Abstract
Long non-coding RNA (lncRNA) plays an important role in cells through the interaction with RNA-binding proteins (RBPs). Finding the RBPs binding sites on the lncRNA chains can help to understand the post-transcriptional regulatory mechanism, exploring the pathogenesis of cancers and possible roles in other diseases. Although many genome-wide RBP experimental techniques can identify the RNA-protein interactions and detect the binding sites on RNA chains, they are still time-consuming, labor-intensive and cost-heavy. Thus, many computational methods have been developed to predict the RBPs sites by integrating the RNA sequence, structure and domain specific features, etc. However, current approaches that focus on predicting the RBPs binding sites on RNA chains lack a consideration of the dependencies among nucleotides. In this work, we propose a higher-order nucleotide encoding convolutional neural network-based method (namely HOCNNLB) to predict the RBPs binding sites on lncRNA chains. HOCNNLB first employs a high-order one-hot encoding strategy to encode the lncRNA sequences by considering the dependence among nucleotides, then the encoded lncRNA sequences are fed into the convolutional neural network (CNN) to predict the RBP binding sites. We evaluate HOCNNLB on 31 experimental datasets of 12 lncRNA binding proteins. The average AUC of HOCNNLB achieves 0.953, which is 0.247, 0.175 higher than that of iDeepS and DeepBind, respectively. The average accuracy is 90.2%, which is 26.8%, 19.5% higher than that of iDeepS and DeepBind, respectively. These results demonstrate that HOCNNLB can reliably predict the RBP binding sites on lncRNA chains and outperforms the state-of-the-art methods. The source code of HOCNNLB and the datasets used in this work are available at https://github.com/NWPU-903PR/HOCNNLB for academic users.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China.
| | - Ya Wang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China
| | - Xi-Xi Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China
| | - Jia-Qi Wang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, China
| |
Collapse
|
110
|
Chen M, Ju CJT, Zhou G, Chen X, Zhang T, Chang KW, Zaniolo C, Wang W. Multifaceted protein-protein interaction prediction based on Siamese residual RCNN. Bioinformatics 2019; 35:i305-i314. [PMID: 31510705 PMCID: PMC6681469 DOI: 10.1093/bioinformatics/btz328] [Citation(s) in RCA: 160] [Impact Index Per Article: 26.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
MOTIVATION Sequence-based protein-protein interaction (PPI) prediction represents a fundamental computational biology problem. To address this problem, extensive research efforts have been made to extract predefined features from the sequences. Based on these features, statistical algorithms are learned to classify the PPIs. However, such explicit features are usually costly to extract, and typically have limited coverage on the PPI information. RESULTS We present an end-to-end framework, PIPR (Protein-Protein Interaction Prediction Based on Siamese Residual RCNN), for PPI predictions using only the protein sequences. PIPR incorporates a deep residual recurrent convolutional neural network in the Siamese architecture, which leverages both robust local features and contextualized information, which are significant for capturing the mutual influence of proteins sequences. PIPR relieves the data pre-processing efforts that are required by other systems, and generalizes well to different application scenarios. Experimental evaluations show that PIPR outperforms various state-of-the-art systems on the binary PPI prediction problem. Moreover, it shows a promising performance on more challenging problems of interaction type prediction and binding affinity estimation, where existing approaches fall short. AVAILABILITY AND IMPLEMENTATION The implementation is available at https://github.com/muhaochen/seq_ppi.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Muhao Chen
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Chelsea J -T Ju
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Guangyu Zhou
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Xuelu Chen
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Tianran Zhang
- Department of Bioengineering, University of California, Los Angeles, Los Angeles, CA, USA
| | - Kai-Wei Chang
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Carlo Zaniolo
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, Los Angeles, CA, USA
| |
Collapse
|
111
|
Nguyen TTD, Le NQK, Kusuma RMI, Ou YY. Prediction of ATP-binding sites in membrane proteins using a two-dimensional convolutional neural network. J Mol Graph Model 2019; 92:86-93. [PMID: 31344547 DOI: 10.1016/j.jmgm.2019.07.003] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 06/24/2019] [Accepted: 07/13/2019] [Indexed: 12/28/2022]
Abstract
Membrane proteins, the most important drug targets, account for around 30% of total proteins encoded by the genome of living organisms. An important role of these proteins is to bind adenosine triphosphate (ATP), facilitating crucial biological processes such as metabolism and cell signaling. There are several reports elucidating ATP-binding sites within proteins. However, such studies on membrane proteins are limited. Our prediction tool, DeepATP, combines evolutionary information in the form of Position Specific Scoring Matrix and two-dimensional Convolutional Neural Network to predict ATP-binding sites in membrane proteins with an MCC of 0.89 and an AUC of 99%. Compared to recently published ATP-binding site predictors and classifiers that use traditional machine learning algorithms, our approach performs significantly better. We suggest this method as a reliable tool for biologists for ATP-binding site prediction in membrane proteins.
Collapse
Affiliation(s)
| | - Nguyen-Quoc-Khanh Le
- School of Humanities, Nanyang Technological University, 48 Nanyang Ave, 6397983, Singapore
| | | | - Yu-Yen Ou
- Department of Computer Science and Engineering, Yuan Ze University, Chung-Li, 32003, Taiwan.
| |
Collapse
|
112
|
Wekesa JS, Luan Y, Chen M, Meng J. A Hybrid Prediction Method for Plant lncRNA-Protein Interaction. Cells 2019; 8:E521. [PMID: 31151273 PMCID: PMC6627874 DOI: 10.3390/cells8060521] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2019] [Revised: 05/22/2019] [Accepted: 05/29/2019] [Indexed: 01/23/2023] Open
Abstract
Long non-protein-coding RNAs (lncRNAs) identification and analysis are pervasive in transcriptome studies due to their roles in biological processes. In particular, lncRNA-protein interaction has plausible relevance to gene expression regulation and in cellular processes such as pathogen resistance in plants. While lncRNA-protein interaction has been studied in animals, there has yet to be extensive research in plants. In this paper, we propose a novel plant lncRNA-protein interaction prediction method, namely PLRPIM, which combines deep learning and shallow machine learning methods. The selection of an optimal feature subset and subsequent efficient compression are significant challenges for deep learning models. The proposed method adopts k-mer and extracts high-level abstraction sequence-based features using stacked sparse autoencoder. Based on the extracted features, the fusion of random forest (RF) and light gradient boosting machine (LGBM) is used to build the prediction model. The performances are evaluated on Arabidopsis thaliana and Zea mays datasets. Results from experiments demonstrate PLRPIM's superiority compared with other prediction tools on the two datasets. Based on 5-fold cross-validation, we obtain 89.98% and 93.44% accuracy, 0.954 and 0.982 AUC for Arabidopsis thaliana and Zea mays, respectively. PLRPIM predicts potential lncRNA-protein interaction pairs effectively, which can facilitate lncRNA related research including function prediction.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, Liaoning, China.
- Department of Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian 116023, Liaoning, China.
| | - Ming Chen
- College of Life Sciences, Zhejiang University, Hangzhou 310058, Zhejiang, China.
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116023, Liaoning, China.
| |
Collapse
|
113
|
Pan X, Yang Y, Xia C, Mirza AH, Shen H. Recent methodology progress of deep learning for RNA–protein interaction prediction. WILEY INTERDISCIPLINARY REVIEWS-RNA 2019; 10:e1544. [DOI: 10.1002/wrna.1544] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/28/2019] [Revised: 04/07/2019] [Accepted: 04/11/2019] [Indexed: 12/17/2022]
Affiliation(s)
- Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing Ministry of Education of China Shanghai China
- IDLab, Department for Electronics and Information Systems Ghent University Ghent Belgium
- BASF Agriculture Solution Ghent Belgium
| | - Yang Yang
- Department of Computer Science Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering Shanghai China
| | - Chun‐Qiu Xia
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing Ministry of Education of China Shanghai China
| | - Aashiq H. Mirza
- Department of Pharmacology Weill Cornell Medicine New York New York
| | - Hong‐Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing Ministry of Education of China Shanghai China
- Department of Computer Science Shanghai Jiao Tong University, and Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering Shanghai China
| |
Collapse
|
114
|
Chung T, Kim D. Prediction of binding property of RNA-binding proteins using multi-sized filters and multi-modal deep convolutional neural network. PLoS One 2019; 14:e0216257. [PMID: 31026297 PMCID: PMC6485761 DOI: 10.1371/journal.pone.0216257] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2019] [Accepted: 04/16/2019] [Indexed: 11/18/2022] Open
Abstract
RNA-binding proteins (RBPs) are important in gene expression regulations by post-transcriptional control of RNAs and immune system development and its function. Due to the help of sequencing technology, numerous RNA sequences are newly discovered without knowing their binding partner RBPs. Therefore, demands for accurate prediction method for RBP binding sites are increasing. There are many attempts for RBP binding site predictions using various machine-learning techniques combined with various RNA features. In this work, we present a new deep convolution neural network model trained on CLIP-seq datasets using multi-sized filters and multi-modal features to predict the binding property of RBPs. With this model, we integrated sequence and structure information to extract sequence motifs, structure motifs, and combined motifs at the same time. The RBP binding site prediction on RBP-24 dataset was compared with two multi-modal methods, GraphProt and Deepnet-rbp, using area under curve (AUC) of receiver-operating characteristics (ROC). Our method (average AUC = 0.920) outperformed 20 RBPs with GraphProt (average AUC = 0.888) and 15 RBP with Deepnet-rbp (average AUC = 0.902). The improvement was achieved by using multi-sized convolution filters, where average relative error reduction was 17%. By introducing new RNA structure representation, structure probability matrix, average relative error was reduced by 3% when compared to one-hot encoded secondary structure representation. Interestingly, structure probability matrix was more effective on ALKBH5, where relative error reduction was 30%. We developed new sequence motif enrichment method, which we stated as response enrichment method. We successfully enriched sequence motif for 12 RBPs, which had high resemblance with other literature evidences, RBPgroup and CISBP-RNA. Finally by analyzing these results altogether, we found intricate interplay between sequence motif and structure motif, which agreed with other researches.
Collapse
Affiliation(s)
- Taesu Chung
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| | - Dongsup Kim
- Department of Bio and Brain Engineering, Korea Advanced Institute of Science and Technology, Daejeon, Republic of Korea
| |
Collapse
|
115
|
Deep learning in bioinformatics: Introduction, application, and perspective in the big data era. Methods 2019; 166:4-21. [PMID: 31022451 DOI: 10.1016/j.ymeth.2019.04.008] [Citation(s) in RCA: 137] [Impact Index Per Article: 22.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 03/23/2019] [Accepted: 04/15/2019] [Indexed: 12/13/2022] Open
Abstract
Deep learning, which is especially formidable in handling big data, has achieved great success in various fields, including bioinformatics. With the advances of the big data era in biology, it is foreseeable that deep learning will become increasingly important in the field and will be incorporated in vast majorities of analysis pipelines. In this review, we provide both the exoteric introduction of deep learning, and concrete examples and implementations of its representative applications in bioinformatics. We start from the recent achievements of deep learning in the bioinformatics field, pointing out the problems which are suitable to use deep learning. After that, we introduce deep learning in an easy-to-understand fashion, from shallow neural networks to legendary convolutional neural networks, legendary recurrent neural networks, graph neural networks, generative adversarial networks, variational autoencoder, and the most recent state-of-the-art architectures. After that, we provide eight examples, covering five bioinformatics research directions and all the four kinds of data type, with the implementation written in Tensorflow and Keras. Finally, we discuss the common issues, such as overfitting and interpretability, that users will encounter when adopting deep learning methods and provide corresponding suggestions. The implementations are freely available at https://github.com/lykaust15/Deep_learning_examples.
Collapse
|
116
|
Haberal İ, Oğul H. Prediction of Protein Metal Binding Sites Using Deep Neural Networks. Mol Inform 2019; 38:e1800169. [PMID: 30977960 DOI: 10.1002/minf.201800169] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2018] [Accepted: 03/29/2019] [Indexed: 11/06/2022]
Abstract
Metals have crucial roles for many physiological, pathological and diagnostic processes. Metal binding proteins or metalloproteins are important for metabolism functions. The proteins that reach the three-dimensional structure by folding show which vital function is fulfilled. The prediction of metal-binding in proteins will be considered as a step-in function assignment for new proteins, which helps to obtain functional proteins in genomic studies, is critical to protein function annotation and drug discovery. Computational predictions made by using machine learning methods from the data obtained from amino acid sequences are widely used in the protein metal-binding and various bioinformatics fields. In this work, we present three different deep learning architectures for prediction of metal-binding of Histidines (HIS) and Cysteines (CYS) amino acids. These architectures are as follows: 2D Convolutional Neural Network, Long-Short Term Memory and Recurrent Neural Network. Their comparison is carried out on the three different sets of attributes derived from a public dataset of protein sequences. These three sets of features extracted from the protein sequence were obtained using the PAM scoring matrix, protein composition server, and binary representation methods. The results show that a better performance for prediction of protein metal- binding sites is obtained through Convolutional Neural Network architecture.
Collapse
Affiliation(s)
- İsmail Haberal
- Department of Computer Engineering, Başkent University, Fatih Sultan Mahallesi Eskişehir Yolu 18. km, 06790, Etimesgut, Ankara, Turkey
| | - Hasan Oğul
- Department of Computer Engineering, Başkent University, Fatih Sultan Mahallesi Eskişehir Yolu 18. km, 06790, Etimesgut, Ankara, Turkey.,Faculty of Computer Sciences, Østfold University College, Halden, Norway
| |
Collapse
|
117
|
Cui Y, Dong Q, Hong D, Wang X. Predicting protein-ligand binding residues with deep convolutional neural networks. BMC Bioinformatics 2019; 20:93. [PMID: 30808287 PMCID: PMC6390579 DOI: 10.1186/s12859-019-2672-1] [Citation(s) in RCA: 47] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Accepted: 02/07/2019] [Indexed: 02/01/2023] Open
Abstract
Background Ligand-binding proteins play key roles in many biological processes. Identification of protein-ligand binding residues is important in understanding the biological functions of proteins. Existing computational methods can be roughly categorized as sequence-based or 3D-structure-based methods. All these methods are based on traditional machine learning. In a series of binding residue prediction tasks, 3D-structure-based methods are widely superior to sequence-based methods. However, due to the great number of proteins with known amino acid sequences, sequence-based methods have considerable room for improvement with the development of deep learning. Therefore, prediction of protein-ligand binding residues with deep learning requires study. Results In this study, we propose a new sequence-based approach called DeepCSeqSite for ab initio protein-ligand binding residue prediction. DeepCSeqSite includes a standard edition and an enhanced edition. The classifier of DeepCSeqSite is based on a deep convolutional neural network. Several convolutional layers are stacked on top of each other to extract hierarchical features. The size of the effective context scope is expanded as the number of convolutional layers increases. The long-distance dependencies between residues can be captured by the large effective context scope, and stacking several layers enables the maximum length of dependencies to be precisely controlled. The extracted features are ultimately combined through one-by-one convolution kernels and softmax to predict whether the residues are binding residues. The state-of-the-art ligand-binding method COACH and some of its submethods are selected as baselines. The methods are tested on a set of 151 nonredundant proteins and three extended test sets. Experiments show that the improvement of the Matthews correlation coefficient (MCC) is no less than 0.05. In addition, a training data augmentation method that slightly improves the performance is discussed in this study. Conclusions Without using any templates that include 3D-structure data, DeepCSeqSite significantlyoutperforms existing sequence-based and 3D-structure-based methods, including COACH. Augmentation of the training sets slightly improves the performance. The model, code and datasets are available at https://github.com/yfCuiFaith/DeepCSeqSite.
Collapse
Affiliation(s)
- Yifeng Cui
- Faculty of Education, East China Normal University, 3663 N. Zhongshan Rd., Shanghai, 200062, China.,School of Data Science & Engineering, East China Normal University, Shanghai, 3663 N. Zhongshan Rd., Shanghai, 200062, China
| | - Qiwen Dong
- Faculty of Education, East China Normal University, 3663 N. Zhongshan Rd., Shanghai, 200062, China. .,School of Data Science & Engineering, East China Normal University, Shanghai, 3663 N. Zhongshan Rd., Shanghai, 200062, China.
| | - Daocheng Hong
- School of Data Science & Engineering, East China Normal University, Shanghai, 3663 N. Zhongshan Rd., Shanghai, 200062, China
| | - Xikun Wang
- The High School Affiliated of Liaoning Normal University, Dalian, China
| |
Collapse
|
118
|
Zhang SY, Zhang SW, Fan XN, Meng J, Chen Y, Gao SJ, Huang Y. Global analysis of N6-methyladenosine functions and its disease association using deep learning and network-based methods. PLoS Comput Biol 2019; 15:e1006663. [PMID: 30601803 PMCID: PMC6331136 DOI: 10.1371/journal.pcbi.1006663] [Citation(s) in RCA: 35] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2018] [Revised: 01/14/2019] [Accepted: 11/21/2018] [Indexed: 02/03/2023] Open
Abstract
N6-methyladenosine (m6A) is the most abundant methylation, existing in >25% of human mRNAs. Exciting recent discoveries indicate the close involvement of m6A in regulating many different aspects of mRNA metabolism and diseases like cancer. However, our current knowledge about how m6A levels are controlled and whether and how regulation of m6A levels of a specific gene can play a role in cancer and other diseases is mostly elusive. We propose in this paper a computational scheme for predicting m6A-regulated genes and m6A-associated disease, which includes Deep-m6A, the first model for detecting condition-specific m6A sites from MeRIP-Seq data with a single base resolution using deep learning and Hot-m6A, a new network-based pipeline that prioritizes functional significant m6A genes and its associated diseases using the Protein-Protein Interaction (PPI) and gene-disease heterogeneous networks. We applied Deep-m6A and this pipeline to 75 MeRIP-seq human samples, which produced a compact set of 709 functionally significant m6A-regulated genes and nine functionally enriched subnetworks. The functional enrichment analysis of these genes and networks reveal that m6A targets key genes of many critical biological processes including transcription, cell organization and transport, and cell proliferation and cancer-related pathways such as Wnt pathway. The m6A-associated disease analysis prioritized five significantly associated diseases including leukemia and renal cell carcinoma. These results demonstrate the power of our proposed computational scheme and provide new leads for understanding m6A regulatory functions and its roles in diseases.
Collapse
Affiliation(s)
- Song-Yao Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, Texas, United States of America
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| | - Xiao-Nan Fan
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| | - Jia Meng
- Department of Biological Sciences, HRINU, SUERI, Xi’an Jiaotong-Liverpool University, Suzhou, Jiangsu, China
| | - Yidong Chen
- Department of Epidemiology and Biostatistics, University of Texas Health San Antonio, San Antonio, Texas, United States of America
| | - Shou-Jiang Gao
- UPMC Hillman Cancer Center and Department of Microbiology and Molecular Genetics, University of Pittsburgh, Pittsburgh, Pennsylvania, United States of America
| | - Yufei Huang
- Department of Electrical and Computer Engineering, University of Texas at San Antonio, San Antonio, Texas, United States of America
- Department of Epidemiology and Biostatistics, University of Texas Health San Antonio, San Antonio, Texas, United States of America
| |
Collapse
|
119
|
Wang F, Chainani P, White T, Yang J, Liu Y, Soibam B. Deep learning identifies genome-wide DNA binding sites of long noncoding RNAs. RNA Biol 2018; 15:1468-1476. [PMID: 30486737 PMCID: PMC6333433 DOI: 10.1080/15476286.2018.1551704] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
Long noncoding RNAs (lncRNAs) can exert their function by interacting with the DNA via triplex structure formation. Even though this has been validated with a handful of experiments, a genome-wide analysis of lncRNA-DNA binding is needed. In this paper, we develop and interpret deep learning models that predict the genome-wide binding sites deciphered by ChIRP-Seq experiments of 12 different lncRNAs. Among the several deep learning architectures tested, a simple architecture consisting of two convolutional neural network layers performed the best suggesting local sequence patterns as determinants of the interaction. Further interpretation of the kernels in the model revealed that these local sequence patterns form triplex structures with the corresponding lncRNAs. We uncovered several novel triplexes forming domains (TFDs) of these 12 lncRNAs and previously experimentally verified TFDs of lncRNAs HOTAIR and MEG3. We experimentally verified such two novel TFDs of lncRNAs HOTAIR and TUG1 predicted by our method (but previously unreported) using Electrophoretic mobility shift assays. In conclusion, we show that simple deep learning architecture can accurately predict genome-wide binding sites of lncRNAs and interpretation of the models suggest RNA:DNA:DNA triplex formation as a viable mechanism underlying lncRNA-DNA interactions at genome-wide level.
Collapse
Affiliation(s)
- Fan Wang
- a Department of Oncology , The First Affiliated Hospital of Xian Jiaotong University , Xi'an , P.R. China.,b Department of Biology and Biochemistry , University of Houston , Houston , TX , USA
| | - Pranik Chainani
- c Computer Science and Engineering Technology , University of Houston-Downtown , Houston , TX , USA
| | - Tommy White
- c Computer Science and Engineering Technology , University of Houston-Downtown , Houston , TX , USA
| | - Jin Yang
- a Department of Oncology , The First Affiliated Hospital of Xian Jiaotong University , Xi'an , P.R. China
| | - Yu Liu
- b Department of Biology and Biochemistry , University of Houston , Houston , TX , USA
| | - Benjamin Soibam
- c Computer Science and Engineering Technology , University of Houston-Downtown , Houston , TX , USA
| |
Collapse
|
120
|
Deep-RBPPred: Predicting RNA binding proteins in the proteome scale based on deep learning. Sci Rep 2018; 8:15264. [PMID: 30323214 PMCID: PMC6189057 DOI: 10.1038/s41598-018-33654-x] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/17/2018] [Accepted: 09/28/2018] [Indexed: 12/22/2022] Open
Abstract
RNA binding protein (RBP) plays an important role in cellular processes. Identifying RBPs by computation and experiment are both essential. Recently, an RBP predictor, RBPPred, is proposed in our group to predict RBPs. However, RBPPred is too slow for that it needs to generate PSSM matrix as its feature. Herein, based on the protein feature of RBPPred and Convolutional Neural Network (CNN), we develop a deep learning model called Deep-RBPPred. With the balance and imbalance training set, we obtain Deep-RBPPred-balance and Deep-RBPPred-imbalance models. Deep-RBPPred has three advantages comparing to previous methods. (1) Deep-RBPPred only needs few physicochemical properties based on protein sequences. (2) Deep-RBPPred runs much faster. (3) Deep-RBPPred has a good generalization ability. In the meantime, Deep-RBPPred is still as good as the state-of-the-art method. Testing in A. thaliana, S. cerevisiae and H. sapiens proteomes, MCC values are 0.82 (0.82), 0.65 (0.69) and 0.85 (0.80) for balance model (imbalance model) when the score cutoff is set to 0.5, respectively. In the same testing dataset, different machine learning algorithms (CNN and SVM) are also compared. The results show that CNN-based model can identify more RBPs than SVM-based. In comparing the balance and imbalance model, both CNN-base and SVM-based tend to favor the majority class in the imbalance set. Deep-RBPPred forecasts 280 (balance model) and 265 (imbalance model) of 299 new RBP. The sensitivity of balance model is about 7% higher than the state-of-the-art method. We also apply deep-RBPPred to 30 eukaryotes and 109 bacteria proteomes downloaded from Uniprot to estimate all possible RBPs. The estimating result shows that rates of RBPs in eukaryote proteomes are much higher than bacteria proteomes.
Collapse
|
121
|
Pan X, Rijnbeek P, Yan J, Shen HB. Prediction of RNA-protein sequence and structure binding preferences using deep convolutional and recurrent neural networks. BMC Genomics 2018; 19:511. [PMID: 29970003 PMCID: PMC6029131 DOI: 10.1186/s12864-018-4889-1] [Citation(s) in RCA: 151] [Impact Index Per Article: 21.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2017] [Accepted: 06/19/2018] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND RNA regulation is significantly dependent on its binding protein partner, known as the RNA-binding proteins (RBPs). Unfortunately, the binding preferences for most RBPs are still not well characterized. Interdependencies between sequence and secondary structure specificities is challenging for both predicting RBP binding sites and accurate sequence and structure motifs detection. RESULTS In this study, we propose a deep learning-based method, iDeepS, to simultaneously identify the binding sequence and structure motifs from RNA sequences using convolutional neural networks (CNNs) and a bidirectional long short term memory network (BLSTM). We first perform one-hot encoding for both the sequence and predicted secondary structure, to enable subsequent convolution operations. To reveal the hidden binding knowledge from the observed sequences, the CNNs are applied to learn the abstract features. Considering the close relationship between sequence and predicted structures, we use the BLSTM to capture possible long range dependencies between binding sequence and structure motifs identified by the CNNs. Finally, the learned weighted representations are fed into a classification layer to predict the RBP binding sites. We evaluated iDeepS on verified RBP binding sites derived from large-scale representative CLIP-seq datasets. The results demonstrate that iDeepS can reliably predict the RBP binding sites on RNAs, and outperforms the state-of-the-art methods. An important advantage compared to other methods is that iDeepS can automatically extract both binding sequence and structure motifs, which will improve our understanding of the mechanisms of binding specificities of RBPs. CONCLUSION Our study shows that the iDeepS method identifies the sequence and structure motifs to accurately predict RBP binding sites. iDeepS is available at https://github.com/xypan1232/iDeepS .
Collapse
Affiliation(s)
- Xiaoyong Pan
- Department of Medical Informatics, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Peter Rijnbeek
- Department of Medical Informatics, Erasmus Medical Center, Rotterdam, The Netherlands
| | - Junchi Yan
- Institute of Software Engineering, East China Normal University, Shanghai, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|