51
|
Niu M, Zou Q, Lin C. CRBPDL: Identification of circRNA-RBP interaction sites using an ensemble neural network approach. PLoS Comput Biol 2022; 18:e1009798. [PMID: 35051187 PMCID: PMC8806072 DOI: 10.1371/journal.pcbi.1009798] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Revised: 02/01/2022] [Accepted: 01/02/2022] [Indexed: 02/06/2023] Open
Abstract
Circular RNAs (circRNAs) are non-coding RNAs with a special circular structure produced formed by the reverse splicing mechanism. Increasing evidence shows that circular RNAs can directly bind to RNA-binding proteins (RBP) and play an important role in a variety of biological activities. The interactions between circRNAs and RBPs are key to comprehending the mechanism of posttranscriptional regulation. Accurately identifying binding sites is very useful for analyzing interactions. In past research, some predictors on the basis of machine learning (ML) have been presented, but prediction accuracy still needs to be ameliorated. Therefore, we present a novel calculation model, CRBPDL, which uses an Adaboost integrated deep hierarchical network to identify the binding sites of circular RNA-RBP. CRBPDL combines five different feature encoding schemes to encode the original RNA sequence, uses deep multiscale residual networks (MSRN) and bidirectional gating recurrent units (BiGRUs) to effectively learn high-level feature representations, it is sufficient to extract local and global context information at the same time. Additionally, a self-attention mechanism is employed to train the robustness of the CRBPDL. Ultimately, the Adaboost algorithm is applied to integrate deep learning (DL) model to improve prediction performance and reliability of the model. To verify the usefulness of CRBPDL, we compared the efficiency with state-of-the-art methods on 37 circular RNA data sets and 31 linear RNA data sets. Moreover, results display that CRBPDL is capable of performing universal, reliable, and robust. The code and data sets are obtainable at https://github.com/nmt315320/CRBPDL.git. More and more evidences show that circular RNA can directly bind to proteins and participate in countless different biological processes. The calculation method can quickly and accurately predict the binding site of circular RNA and RBP. In order to identify the interaction of circRNA with 37 different types of circRNA binding proteins, we developed an integrated deep learning network based on hierarchical network, called CRBPDL. It can effectively learn high-level feature representations. The performance of the model was verified through comparative experiments of different feature extraction algorithms, different deep learning models and classifier models. Moreover, the CRBPDL model was applied to 31 linear RNAs, and the effectiveness of our method was proved by comparison with the results of current excellent algorithms. It is expected that the CRBPDL model can effectively predict the binding site of circular RNA-RBP and provide reliable candidates for further biological experiments.
Collapse
Affiliation(s)
- Mengting Niu
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Chen Lin
- School of Informatics, Xiamen University, Xiamen, China
- * E-mail:
| |
Collapse
|
52
|
3D Modeling of Non-coding RNA Interactions. ADVANCES IN EXPERIMENTAL MEDICINE AND BIOLOGY 2022; 1385:281-317. [DOI: 10.1007/978-3-031-08356-3_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
|
53
|
Marques-Pereira C, Pires M, Moreira IS. Discovery of Virus-Host interactions using bioinformatic tools. Methods Cell Biol 2022; 169:169-198. [PMID: 35623701 DOI: 10.1016/bs.mcb.2022.02.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/18/2022]
|
54
|
|
55
|
Sivangi KB, Dasari CM, Amilpur S, Bhukya R. NoAS-DS: Neural optimal architecture search for detection of diverse DNA signals. Neural Netw 2021; 147:63-71. [PMID: 34979461 DOI: 10.1016/j.neunet.2021.12.009] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Revised: 11/22/2021] [Accepted: 12/17/2021] [Indexed: 01/05/2023]
Abstract
Neural network architectures are high-performing variable models that can solve many learning tasks. Designing architectures manually require substantial time and also prior knowledge and expertise to develop a high-accuracy model. Most of the architecture search methods are developed over the task of image classification resulting in the building of complex architectures intended for large data inputs such as images. Motivated by the applications of DNA computing in Neural Architecture Search (NAS), we propose NoAS-DS which is specifically built for the architecture search of sequence-based classification tasks. Furthermore, NoAS-DS is applied to the task of predicting binding sites. Unlike other methods that implement only Convolution layers, NoAS-DS, specifically combines Convolution and LSTM layers that helps in the process of automatic architecture building. This hybrid approach helped in achieving high accuracy results on TFBS and RBP datasets which outperformed other models in TF-DNA binding prediction tasks. The best architectures generated by the proposed model can be applied to other DNA datasets of similar nature using transfer learning technique that demonstrates its generalization capability. This greatly reduces the effort required to build new architectures for other prediction tasks.
Collapse
Affiliation(s)
| | | | - Santhosh Amilpur
- National Institute of Technology (NIT), Warangal, Telangana, 506004, India
| | - Raju Bhukya
- National Institute of Technology (NIT), Warangal, Telangana, 506004, India
| |
Collapse
|
56
|
RBPSpot: Learning on appropriate contextual information for RBP binding sites discovery. iScience 2021; 24:103381. [PMID: 34841226 PMCID: PMC8605353 DOI: 10.1016/j.isci.2021.103381] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2021] [Revised: 09/01/2021] [Accepted: 10/27/2021] [Indexed: 11/29/2022] Open
Abstract
Identifying the factors determining the RBP-RNA interactions remains a big challenge. It involves sparse binding motifs and a suitable sequence context for binding. The present work describes an approach to detect RBP binding sites in RNAs using an ultra-fast inexact k-mers search for statistically significant seeds. The seeds work as an anchor to evaluate the context and binding potential using flanking region information while leveraging from Deep Feed-forward Neural Network. The developed models also received support from MD-simulation studies. The implemented software, RBPSpot, scored consistently high for all the performance metrics including average accuracy of ∼90% across a large number of validated datasets. It outperformed the compared tools, including some with much complex deep-learning models, during a comprehensive benchmarking process. RBPSpot can identify RBP binding sites in the human system and can also be used to develop new models, making it a valuable resource in the area of regulatory system studies. Efficient motif anchoring helps to get good quality contextual information on binding Realistic and high granularity datasets ensure better performance of the classifiers DNN models on the contextual features outperform more complex deep learning tools RBPSpot algorithm may be used to develop RBP binding models for other species also
Collapse
|
57
|
Li X, Zhang S, Wong KC. Multiobjective Genome-Wide RNA-Binding Event Identification From CLIP-Seq Data. IEEE TRANSACTIONS ON CYBERNETICS 2021; 51:5811-5824. [PMID: 31940583 DOI: 10.1109/tcyb.2019.2960515] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-binding proteins (RBPs) are the master regulators of mRNA processing, which are vital players for the post-transcriptional control of gene expression. In recent years, crosslinking immunoprecipitation sequencing (CLIP-seq) technologies have enabled us to sequence massive amounts of genome-wide RNA-binding event data. Its increasing availability provides opportunities to identify protein-RNA interactions on a genome-wide scale. Genome-wide RNA-binding event detection methods have been developed to the understanding of the proteins' functions within cellular processes. Unfortunately, those methods often suffer from realistic restrictions, such as high costs, intensive computation, high dimensionality, numerical instability, and data sparsity. We present a computational method [multiobjective forest algorithm (MFA)] to identify protein-RNA interactions from CLIP-seq data by synergizing multiobjective biogeography-based optimization (BBO) with random forest (RF). Since most of the tree-structured classifiers in RF are unnecessarily bulky with extra time costs and memory consumption, multiobjective BBO is designed to prune the unsuitable tree-structured classifiers dynamically. Moreover, to direct the evolution dynamics of the MFA, two objective functions are formulated to balance model generality and complexity for robust performance. To validate our MFA method, we compare its performance across 31 large-scale CLIP-seq datasets. The experimental results demonstrate that MFA can obtain superior performance over the current state-of-the-art methods. Mechanistic insights are also revealed and discussed to explore the multifaceted aspects of MFA through data source importance analysis, matrix rank estimations, seeding component perturbations, and multiobjective optimization methodology comparisons.
Collapse
|
58
|
Li G, Iyer B, Prasath VBS, Ni Y, Salomonis N. DeepImmuno: deep learning-empowered prediction and generation of immunogenic peptides for T-cell immunity. Brief Bioinform 2021; 22:bbab160. [PMID: 34009266 PMCID: PMC8135853 DOI: 10.1093/bib/bbab160] [Citation(s) in RCA: 56] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2020] [Revised: 03/26/2021] [Accepted: 04/05/2021] [Indexed: 02/07/2023] Open
Abstract
Cytolytic T-cells play an essential role in the adaptive immune system by seeking out, binding and killing cells that present foreign antigens on their surface. An improved understanding of T-cell immunity will greatly aid in the development of new cancer immunotherapies and vaccines for life-threatening pathogens. Central to the design of such targeted therapies are computational methods to predict non-native peptides to elicit a T-cell response, however, we currently lack accurate immunogenicity inference methods. Another challenge is the ability to accurately simulate immunogenic peptides for specific human leukocyte antigen alleles, for both synthetic biological applications, and to augment real training datasets. Here, we propose a beta-binomial distribution approach to derive peptide immunogenic potential from sequence alone. We conducted systematic benchmarking of five traditional machine learning (ElasticNet, K-nearest neighbors, support vector machine, Random Forest and AdaBoost) and three deep learning models (convolutional neural network (CNN), Residual Net and graph neural network) using three independent prior validated immunogenic peptide collections (dengue virus, cancer neoantigen and SARS-CoV-2). We chose the CNN as the best prediction model, based on its adaptivity for small and large datasets and performance relative to existing methods. In addition to outperforming two highly used immunogenicity prediction algorithms, DeepImmuno-CNN correctly predicts which residues are most important for T-cell antigen recognition and predicts novel impacts of SARS-CoV-2 variants. Our independent generative adversarial network (GAN) approach, DeepImmuno-GAN, was further able to accurately simulate immunogenic peptides with physicochemical properties and immunogenicity predictions similar to that of real antigens. We provide DeepImmuno-CNN as source code and an easy-to-use web interface.
Collapse
Affiliation(s)
- Guangyuan Li
- University of Cincinnati, 3333 Burnet Ave, MLC7024, Cincinnati, OH 45267, USA
| | | | - V B Surya Prasath
- Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, USA
| | - Yizhao Ni
- Cincinnati Children’s Hospital Medical Center, USA
| | | |
Collapse
|
59
|
Tayara H, Chong KT. Improved Predicting of The Sequence Specificities of RNA Binding Proteins by Deep Learning. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:2526-2534. [PMID: 32191896 DOI: 10.1109/tcbb.2020.2981335] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
RNA-binding proteins (RBPs) have a significant role in various regulatory tasks. However, the mechanism by which RBPs identify the subsequence target RNAs is still not clear. In recent years, several machine and deep learning-based computational models have been proposed for understanding the binding preferences of RBPs. These methods required integrating multiple features with raw RNA sequences such as secondary structure and their performances can be further improved. In this paper, we propose an efficient and simple convolution neural network, RBPCNN, that relies on the combination of the raw RNA sequence and evolutionary information. We show that conservation scores (evolutionary information) for the RNA sequences can significantly improve the overall performance of the proposed predictor. In addition, the automatic extraction of the binding sequence motifs can enhance our understanding of the binding specificities of RBPs. The experimental results show that RBPCNN outperforms significantly the current state-of-the-art methods. More specifically, the average area under the receiver operator curve was improved by 2.67 percent and the mean average precision was improved by 8.03 percent. The datasets and results can be downloaded from https://home.jbnu.ac.kr/NSCL/RBPCNN.htm.
Collapse
|
60
|
Deng Z, Zhang J, Li J, Zhang X. Application of Deep Learning in Plant-Microbiota Association Analysis. Front Genet 2021; 12:697090. [PMID: 34691142 PMCID: PMC8531731 DOI: 10.3389/fgene.2021.697090] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 08/31/2021] [Indexed: 01/04/2023] Open
Abstract
Unraveling the association between microbiome and plant phenotype can illustrate the effect of microbiome on host and then guide the agriculture management. Adequate identification of species and appropriate choice of models are two challenges in microbiome data analysis. Computational models of microbiome data could help in association analysis between the microbiome and plant host. The deep learning methods have been widely used to learn the microbiome data due to their powerful strength of handling the complex, sparse, noisy, and high-dimensional data. Here, we review the analytic strategies in the microbiome data analysis and describe the applications of deep learning models for plant–microbiome correlation studies. We also introduce the application cases of different models in plant–microbiome correlation analysis and discuss how to adapt the models on the critical steps in data processing. From the aspect of data processing manner, model structure, and operating principle, most deep learning models are suitable for the plant microbiome data analysis. The ability of feature representation and pattern recognition is the advantage of deep learning methods in modeling and interpretation for association analysis. Based on published computational experiments, the convolutional neural network and graph neural networks could be recommended for plant microbiome analysis.
Collapse
Affiliation(s)
- Zhiyu Deng
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Jinming Zhang
- Department of Infectious Diseases, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Junya Li
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xiujun Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
| |
Collapse
|
61
|
Busa VF, Favorov AV, Fertig EJ, Leung AK. Spatial correlation statistics enable transcriptome-wide characterization of RNA structure binding. CELL REPORTS METHODS 2021; 1:100088. [PMID: 35474897 PMCID: PMC9017189 DOI: 10.1016/j.crmeth.2021.100088] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Revised: 06/23/2021] [Accepted: 08/30/2021] [Indexed: 11/20/2022]
Abstract
Molecular interactions at identical transcriptomic locations or at proximal but non-overlapping sites can mediate RNA modification and regulation, necessitating tools to uncover these spatial relationships. We present nearBynding, a flexible algorithm and software pipeline that models spatial correlation between transcriptome-wide tracks from diverse data types. nearBynding can process and correlate interval as well as continuous data and incorporate experimentally derived or in silico predicted transcriptomic tracks. nearBynding offers visualization functions for its statistics to identify colocalizations and adjacent features. We demonstrate the application of nearBynding to correlate RNA-binding protein (RBP) binding preferences with other RBPs, RNA structure, or RNA modification. By cross-correlating RBP binding and RNA structure data, we demonstrate that nearBynding recapitulates known RBP binding to structural motifs and provides biological insights into RBP binding preference of G-quadruplexes. nearBynding is available as an R/Bioconductor package and can run on a personal computer, making correlation of transcriptomic features broadly accessible.
Collapse
Affiliation(s)
- Veronica F. Busa
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Biochemistry and Molecular Biology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
| | - Alexander V. Favorov
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Laboratory of Systems Biology and Computational Genetics, Vavilov Institute of General Genetics, Russian Academy of Sciences, Moscow, Russia
| | - Elana J. Fertig
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Biomedical Engineering, Johns Hopkins University Whiting School of Engineering, Baltimore, MD 21205, USA
- Department of Applied Mathematics and Statistics, Johns Hopkins University Whiting School of Engineering, Baltimore, MD 21205, USA
| | - Anthony K.L. Leung
- McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Biochemistry and Molecular Biology, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA
- Department of Molecular Biology and Genetics, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Oncology, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| |
Collapse
|
62
|
LGFC-CNN: Prediction of lncRNA-Protein Interactions by Using Multiple Types of Features through Deep Learning. Genes (Basel) 2021; 12:genes12111689. [PMID: 34828296 PMCID: PMC8621699 DOI: 10.3390/genes12111689] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2021] [Revised: 10/11/2021] [Accepted: 10/22/2021] [Indexed: 12/12/2022] Open
Abstract
Long noncoding RNA (lncRNA) plays a crucial role in many critical biological processes and participates in complex human diseases through interaction with proteins. Considering that identifying lncRNA–protein interactions through experimental methods is expensive and time-consuming, we propose a novel method based on deep learning that combines raw sequence composition features, hand-designed features and structure features, called LGFC-CNN, to predict lncRNA–protein interactions. The two sequence preprocessing methods and CNN modules (GloCNN and LocCNN) are utilized to extract the raw sequence global and local features. Meanwhile, we select hand-designed features by comparing the predictive effect of different lncRNA and protein features combinations. Furthermore, we obtain the structure features and unifying the dimensions through Fourier transform. In the end, the four types of features are integrated to comprehensively predict the lncRNA–protein interactions. Compared with other state-of-the-art methods on three lncRNA–protein interaction datasets, LGFC-CNN achieves the best performance with an accuracy of 94.14%, on RPI21850; an accuracy of 92.94%, on RPI7317; and an accuracy of 98.19% on RPI1847. The results show that our LGFC-CNN can effectively predict the lncRNA–protein interactions by combining raw sequence composition features, hand-designed features and structure features.
Collapse
|
63
|
Xu W, Gao Y, Wang Y, Guan J. Protein-protein interaction prediction based on ordinal regression and recurrent convolutional neural networks. BMC Bioinformatics 2021; 22:485. [PMID: 34625020 PMCID: PMC8501564 DOI: 10.1186/s12859-021-04369-0] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2021] [Accepted: 09/02/2021] [Indexed: 11/10/2022] Open
Abstract
Background Protein protein interactions (PPIs) are essential to most of the biological processes. The prediction of PPIs is beneficial to the understanding of protein functions and thus is helpful to pathological analysis, disease diagnosis and drug design etc. As the amount of protein data is growing fast in the post genomic era, high-throughput experimental methods are expensive and time-consuming for the prediction of PPIs. Thus, computational methods have attracted researcher’s attention in recent years. A large number of computational methods have been proposed based on different protein sequence encoders. Results Notably, the confidence score of a protein sequence pair could be regarded as a kind of measurement to PPIs. The higher the confidence score for one protein pair is, the more likely the protein pair interacts. Thus in this paper, a deep learning framework, called ordinal regression and recurrent convolutional neural network (OR-RCNN) method, is introduced to predict PPIs from the perspective of confidence score. It mainly contains two parts: the encoder part of protein sequence pair and the prediction part of PPIs by confidence score. In the first part, two recurrent convolutional neural networks (RCNNs) with shared parameters are applied to construct two protein sequence embedding vectors, which can automatically extract robust local features and sequential information from the protein pairs. Based on it, the two embedding vectors are encoded into one novel embedding vector by element-wise multiplication. By taking the ordinal information behind confidence score into consideration, ordinal regression is used to construct multiple sub-classifiers in the second part. The results of multiple sub-classifiers are aggregated to obtain the final confidence score. Following that, the existence of PPIs is determined by the confidence score. We set a threshold \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\theta$$\end{document}θ, and say the interaction exists between the protein pair if its confidence score is bigger than \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\theta$$\end{document}θ. Conclusions We applied our method to predict PPIs on data sets S. cerevisiae and Homo sapiens. Through experimental verification, our method outperforms state-of-the-art PPI prediction models.
Collapse
Affiliation(s)
- Weixia Xu
- School of Information Management, Shanghai Lixin University of Accounting and Finance, No. 995 Shangchuan Road, Shanghai, 201209, China
| | - Yangyun Gao
- Shanghai Key Laboratory of Intelligent Information Processing, and School of Computer Science, Fudan University, No. 220 Handan Road, Shanghai, 200433, China
| | - Yang Wang
- Department of Computer Science and Technology, Tongji University, No. 4800 Caoan Road, Shanghai, 201804, China
| | - Jihong Guan
- Department of Computer Science and Technology, Tongji University, No. 4800 Caoan Road, Shanghai, 201804, China.
| |
Collapse
|
64
|
Li H, Deng Z, Yang H, Pan X, Wei Z, Shen HB, Choi KS, Wang L, Wang S, Wu J. circRNA-binding protein site prediction based on multi-view deep learning, subspace learning and multi-view classifier. Brief Bioinform 2021; 23:6375057. [PMID: 34571539 DOI: 10.1093/bib/bbab394] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 08/08/2021] [Accepted: 08/30/2021] [Indexed: 12/22/2022] Open
Abstract
Circular RNAs (circRNAs) generally bind to RNA-binding proteins (RBPs) to play an important role in the regulation of autoimmune diseases. Thus, it is crucial to study the binding sites of RBPs on circRNAs. Although many methods, including traditional machine learning and deep learning, have been developed to predict the interactions between RNAs and RBPs, and most of them are focused on linear RNAs. At present, few studies have been done on the binding relationships between circRNAs and RBPs. Thus, in-depth research is urgently needed. In the existing circRNA-RBP binding site prediction methods, circRNA sequences are the main research subjects, but the relevant characteristics of circRNAs have not been fully exploited, such as the structure and composition information of circRNA sequences. Some methods have extracted different views to construct recognition models, but how to efficiently use the multi-view data to construct recognition models is still not well studied. Considering the above problems, this paper proposes a multi-view classification method called DMSK based on multi-view deep learning, subspace learning and multi-view classifier for the identification of circRNA-RBP interaction sites. In the DMSK method, first, we converted circRNA sequences into pseudo-amino acid sequences and pseudo-dipeptide components for extracting high-dimensional sequence features and component features of circRNAs, respectively. Then, the structure prediction method RNAfold was used to predict the secondary structure of the RNA sequences, and the sequence embedding model was used to extract the context-dependent features. Next, we fed the above four views' raw features to a hybrid network, which is composed of a convolutional neural network and a long short-term memory network, to obtain the deep features of circRNAs. Furthermore, we used view-weighted generalized canonical correlation analysis to extract four views' common features by subspace learning. Finally, the learned subspace common features and multi-view deep features were fed to train the downstream multi-view TSK fuzzy system to construct a fuzzy rule and fuzzy inference-based multi-view classifier. The trained classifier was used to predict the specific positions of the RBP binding sites on the circRNAs. The experiments show that the prediction performance of the proposed method DMSK has been improved compared with the existing methods. The code and dataset of this study are available at https://github.com/Rebecca3150/DMSK.
Collapse
Affiliation(s)
- Hui Li
- Jiangnan University, Wuxi, Jiangsu 214012, China
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science of Jiangnan University, Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (LCNBI) and ZJLab, Wuxi, Jiangsu 214012, China
| | - Haitao Yang
- Jiangnan University, Wuxi, Jiangsu 214012, China
| | - Xiaoyong Pan
- Department of Automation of Shanghai Jiao Tong University, Wuxi, Jiangsu 214012, China
| | - Zhisheng Wei
- School of Biotechnology and Key Laboratory of Industrial Biotechnology Ministry in Jiangnan University, Wuxi, Jiangsu 214012, China
| | - Hong-Bin Shen
- Shanghai Jiao Tong University, Wuxi, Jiangsu 214012, China
| | - Kup-Sze Choi
- Hong Kong Polytechnic University, Wuxi, Jiangsu 214012, China
| | - Lei Wang
- School of Biotechnology and Key Laboratory of Industrial Biotechnology Ministry in Jiangnan University, Wuxi, Jiangsu 214012, China
| | - Shitong Wang
- School of Artificial Intelligence and Computer Science of Jiangnan University, Wuxi, Jiangsu 214012, China
| | - Jing Wu
- School of Biotechnology and Key Laboratory of Industrial Biotechnology Ministry in Jiangnan University, Wuxi, Jiangsu 214012, China
| |
Collapse
|
65
|
Abstract
Interpreting the effects of genetic variants is key to understanding individual susceptibility to disease and designing personalized therapeutic approaches. Modern experimental technologies are enabling the generation of massive compendia of human genome sequence data and associated molecular and phenotypic traits, together with genome-scale expression, epigenomics and other functional genomic data. Integrative computational models can leverage these data to understand variant impact, elucidate the effect of dysregulated genes on biological pathways in specific disease and tissue contexts, and interpret disease risk beyond what is feasible with experiments alone. In this Review, we discuss recent developments in machine learning algorithms for genome interpretation and for integrative molecular-level modelling of cells, tissues and organs relevant to disease. More specifically, we highlight existing methods and key challenges and opportunities in identifying specific disease-causing genetic variants and linking them to molecular pathways and, ultimately, to disease phenotypes.
Collapse
|
66
|
Vaz JM, Balaji S. Convolutional neural networks (CNNs): concepts and applications in pharmacogenomics. Mol Divers 2021; 25:1569-1584. [PMID: 34031788 PMCID: PMC8342355 DOI: 10.1007/s11030-021-10225-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Accepted: 04/21/2021] [Indexed: 12/17/2022]
Abstract
Convolutional neural networks (CNNs) have been used to extract information from various datasets of different dimensions. This approach has led to accurate interpretations in several subfields of biological research, like pharmacogenomics, addressing issues previously faced by other computational methods. With the rising attention for personalized and precision medicine, scientists and clinicians have now turned to artificial intelligence systems to provide them with solutions for therapeutics development. CNNs have already provided valuable insights into biological data transformation. Due to the rise of interest in precision and personalized medicine, in this review, we have provided a brief overview of the possibilities of implementing CNNs as an effective tool for analyzing one-dimensional biological data, such as nucleotide and protein sequences, as well as small molecular data, e.g., simplified molecular-input line-entry specification, InChI, binary fingerprints, etc., to categorize the models based on their objective and also highlight various challenges. The review is organized into specific research domains that participate in pharmacogenomics for a more comprehensive understanding. Furthermore, the future intentions of deep learning are outlined.
Collapse
Affiliation(s)
- Joel Markus Vaz
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
| | - S Balaji
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India.
| |
Collapse
|
67
|
Wu H, Pan X, Yang Y, Shen HB. Recognizing binding sites of poorly characterized RNA-binding proteins on circular RNAs using attention Siamese network. Brief Bioinform 2021; 22:6326526. [PMID: 34297803 DOI: 10.1093/bib/bbab279] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2021] [Revised: 06/04/2021] [Accepted: 07/01/2021] [Indexed: 12/24/2022] Open
Abstract
Circular RNAs (circRNAs) interact with RNA-binding proteins (RBPs) to play crucial roles in gene regulation and disease development. Computational approaches have attracted much attention to quickly predict highly potential RBP binding sites on circRNAs using the sequence or structure statistical binding knowledge. Deep learning is one of the popular learning models in this area but usually requires a lot of labeled training data. It would perform unsatisfactorily for the less characterized RBPs with a limited number of known target circRNAs. How to improve the prediction performance for such small-size labeled characterized RBPs is a challenging task for deep learning-based models. In this study, we propose an RBP-specific method iDeepC for predicting RBP binding sites on circRNAs from sequences. It adopts a Siamese neural network consisting of a lightweight attention module and a metric module. We have found that Siamese neural network effectively enhances the network capability of capturing mutual information between circRNAs with pairwise metric learning. To further deal with the small-sample size problem, we have performed the pretraining using available labeled data from other RBPs and also demonstrate the efficacy of this transfer-learning pipeline. We comprehensively evaluated iDeepC on the benchmark datasets of RBP-binding circRNAs, and the results suggest iDeepC achieving promising results on the poorly characterized RBPs. The source code is available at https://github.com/hehew321/iDeepC.
Collapse
Affiliation(s)
- Hehe Wu
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Yang Yang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai 200240, China
| |
Collapse
|
68
|
Sohrabi-Jahromi S, Söding J. Thermodynamic modeling reveals widespread multivalent binding by RNA-binding proteins. Bioinformatics 2021; 37:i308-i316. [PMID: 34252974 PMCID: PMC8275352 DOI: 10.1093/bioinformatics/btab300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Understanding how proteins recognize their RNA targets is essential to elucidate regulatory processes in the cell. Many RNA-binding proteins (RBPs) form complexes or have multiple domains that allow them to bind to RNA in a multivalent, cooperative manner. They can thereby achieve higher specificity and affinity than proteins with a single RNA-binding domain. However, current approaches to de novo discovery of RNA binding motifs do not take multivalent binding into account. RESULTS We present Bipartite Motif Finder (BMF), which is based on a thermodynamic model of RBPs with two cooperatively binding RNA-binding domains. We show that bivalent binding is a common strategy among RBPs, yielding higher affinity and sequence specificity. We furthermore illustrate that the spatial geometry between the binding sites can be learned from bound RNA sequences. These discovered bipartite motifs are consistent with previously known motifs and binding behaviors. Our results demonstrate the importance of multivalent binding for RNA-binding proteins and highlight the value of bipartite motif models in representing the multivalency of protein-RNA interactions. AVAILABILITY AND IMPLEMENTATION BMF source code is available at https://github.com/soedinglab/bipartite_motif_finder under a GPL license. The BMF web server is accessible at https://bmf.soedinglab.org. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Salma Sohrabi-Jahromi
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany
| | - Johannes Söding
- Quantitative and Computational Biology, Max Planck Institute for Biophysical Chemistry, Göttingen 37077, Germany.,Campus-Institut Data Science (CIDAS), Göttingen 37077, Germany
| |
Collapse
|
69
|
Zhu Q, Fan Y, Pan X. Fusing Multiple Biological Networks to Effectively Predict miRNA-disease Associations. Curr Bioinform 2021. [DOI: 10.2174/1574893615999200715165335] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
MicroRNAs (miRNAs) are a class of endogenous non-coding RNAs with
about 22 nucleotides, and they play a significant role in a variety of complex biological processes.
Many researches have shown that miRNAs are closely related to human diseases. Although the
biological experiments are reliable in identifying miRNA-disease associations, they are timeconsuming
and costly.
Objective:
Thus, computational methods are urgently needed to effectively predict miRNA-disease
associations.
Methods:
In this paper, we proposed a novel method, BIRWMDA, based on a bi-random walk
model to predict miRNA-disease associations. Specifically, in BIRWMDA, the similarity network
fusion algorithm is used to combine the multiple similarity matrices to obtain a miRNA-miRNA
similarity matrix and a disease-disease similarity matrix, then the miRNA-disease associations were
predicted by the bi-random walk model.
Results:
To evaluate the performance of BIRWMDA, we ran the leave-one-out cross-validation and
5-fold cross-validation, and their corresponding AUCs were 0.9303 and 0.9223 ± 0.00067,
respectively. To further demonstrate the effectiveness of the BIRWMDA, from the perspective of
exploring disease-related miRNAs, we conducted three case studies of breast neoplasms, prostate
neoplasms and gastric neoplasms, where 48, 50 and 50 out of the top 50 predicted miRNAs were
confirmed by literature, respectively. From the perspective of exploring miRNA-related diseases, we
conducted two case studies of hsa-mir-21 and hsa-mir-155, where 7 and 5 out of the top 10 predicted
diseases were confirmed by literatures, respectively.
Conclusion:
The fusion of multiple biological networks could effectively predict miRNA-diseases
associations. We expected BIRWMDA to serve as a biological tool for mining potential miRNAdisease
associations.
Collapse
Affiliation(s)
- Qingqi Zhu
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China
| | - Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China
| | - Xiaoyong Pan
- Institute of Image Processing and Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, China
| |
Collapse
|
70
|
Studying RNA-Protein Complexes Using X-Ray Crystallography. Methods Mol Biol 2021; 2263:423-446. [PMID: 33877611 DOI: 10.1007/978-1-0716-1197-5_20] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/14/2023]
Abstract
A wide range of biological processes rely on complexes between ribonucleic acids (RNAs) and proteins. Determining the three-dimensional structures of RNA-protein complexes is crucial to elucidate the relationship between structure and biological function. X-ray crystallography represents the most widely used technique to characterize RNA-protein complexes at atomic resolution; however, determining their three-dimensional structures remains challenging. RNase contamination can ruin crystallization experiments by degrading RNA in complex with protein, leading to sample heterogeneity, and the conformational flexibility inherent in both protein and RNA can limit crystallizability. Furthermore, the three-dimensional structure can be difficult to accurately model at the typical diffraction limit of 2.5 Å resolution or lower for RNA-protein complex crystals. At this resolution, phosphates, which are electron dense, and bases, which are large, rigid, and planar, tend to be well resolved and easy to position in the electron density map, whereas other features, e.g., sugar atoms, can be difficult to accurately position. This chapter focuses on methods that can be used to overcome the unique problems faced when crystallizing RNA-protein complexes and determining their three-dimensional structures using X-ray crystallography.
Collapse
|
71
|
Tang L, Li P, Jang M, Zhu W. Circular RNAs and Cardiovascular Regeneration. Front Cardiovasc Med 2021; 8:672600. [PMID: 33928139 PMCID: PMC8076501 DOI: 10.3389/fcvm.2021.672600] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2021] [Accepted: 03/22/2021] [Indexed: 01/22/2023] Open
Abstract
circular RNAs (circRNAs) are a type of non-coding RNAs that are widely present in eukaryotic cells. They have the characteristics of stable structure, high abundance, and cell or tissue specific expression. circRNAs are single-stranded RNAs that are covalently back spliced to form closed circular loops. They may participate in gene expression and regulation through a variety of action modes. circRNAs can encode proteins or function by acting as miRNA sponges for protein translation. Since 2016, a growing number of research studies have shown that circRNAs play important role in the pathogenesis of cardiovascular disease. With the construction of circRNA database, the differential expression of circRNAs in the heart tissue samples from different species and the gradual elucidation of its mode of action in disease may become an ideal diagnosis biomarker and an effective therapeutic target. What can be expected surely has a broader application prospect. In this review, we summarize recent publications on circRNA biogenesis, expression profiles, functions, and the most recent studies of circRNAs in the field of cardiovascular diseases with special emphasis on cardiac regeneration.
Collapse
Affiliation(s)
- Ling Tang
- Department of Cardiovascular Diseases, Physiology and Biomedical Engineering, Center of Regenerative Medicine, Mayo Clinic, Scottsdale, AZ, United States
| | - Pengsheng Li
- Department of Cardiovascular Diseases, Physiology and Biomedical Engineering, Center of Regenerative Medicine, Mayo Clinic, Scottsdale, AZ, United States
| | - Michelle Jang
- Department of Cardiovascular Diseases, Physiology and Biomedical Engineering, Center of Regenerative Medicine, Mayo Clinic, Scottsdale, AZ, United States
| | - Wuqiang Zhu
- Department of Cardiovascular Diseases, Physiology and Biomedical Engineering, Center of Regenerative Medicine, Mayo Clinic, Scottsdale, AZ, United States
| |
Collapse
|
72
|
Wang J, Zhao Y, Gong W, Liu Y, Wang M, Huang X, Tan J. EDLMFC: an ensemble deep learning framework with multi-scale features combination for ncRNA-protein interaction prediction. BMC Bioinformatics 2021; 22:133. [PMID: 33740884 PMCID: PMC7980572 DOI: 10.1186/s12859-021-04069-9] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2021] [Accepted: 03/05/2021] [Indexed: 11/29/2022] Open
Abstract
Background Non-coding RNA (ncRNA) and protein interactions play essential roles in various physiological and pathological processes. The experimental methods used for predicting ncRNA–protein interactions are time-consuming and labor-intensive. Therefore, there is an increasing demand for computational methods to accurately and efficiently predict ncRNA–protein interactions. Results In this work, we presented an ensemble deep learning-based method, EDLMFC, to predict ncRNA–protein interactions using the combination of multi-scale features, including primary sequence features, secondary structure sequence features, and tertiary structure features. Conjoint k-mer was used to extract protein/ncRNA sequence features, integrating tertiary structure features, then fed into an ensemble deep learning model, which combined convolutional neural network (CNN) to learn dominating biological information with bi-directional long short-term memory network (BLSTM) to capture long-range dependencies among the features identified by the CNN. Compared with other state-of-the-art methods under five-fold cross-validation, EDLMFC shows the best performance with accuracy of 93.8%, 89.7%, and 86.1% on RPI1807, NPInter v2.0, and RPI488 datasets, respectively. The results of the independent test demonstrated that EDLMFC can effectively predict potential ncRNA–protein interactions from different organisms. Furtherly, EDLMFC is also shown to predict hub ncRNAs and proteins presented in ncRNA–protein networks of Mus musculus successfully. Conclusions In general, our proposed method EDLMFC improved the accuracy of ncRNA–protein interaction predictions and anticipated providing some helpful guidance on ncRNA functions research. The source code of EDLMFC and the datasets used in this work are available at https://github.com/JingjingWang-87/EDLMFC. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04069-9.
Collapse
Affiliation(s)
- Jingjing Wang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Yanpeng Zhao
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Weikang Gong
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Yang Liu
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Mei Wang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Xiaoqian Huang
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China
| | - Jianjun Tan
- Department of Biomedical Engineering, Faculty of Environment and Life, Beijing International Science and Technology Cooperation Base for Intelligent Physiological Measurement and Clinical Transformation, Beijing University of Technology, Beijing, 100124, China.
| |
Collapse
|
73
|
Emami N, Ferdousi R. AptaNet as a deep learning approach for aptamer-protein interaction prediction. Sci Rep 2021; 11:6074. [PMID: 33727685 PMCID: PMC7971039 DOI: 10.1038/s41598-021-85629-0] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2020] [Accepted: 03/03/2021] [Indexed: 02/08/2023] Open
Abstract
Aptamers are short oligonucleotides (DNA/RNA) or peptide molecules that can selectively bind to their specific targets with high specificity and affinity. As a powerful new class of amino acid ligands, aptamers have high potentials in biosensing, therapeutic, and diagnostic fields. Here, we present AptaNet-a new deep neural network-to predict the aptamer-protein interaction pairs by integrating features derived from both aptamers and the target proteins. Aptamers were encoded by using two different strategies, including k-mer and reverse complement k-mer frequency. Amino acid composition (AAC) and pseudo amino acid composition (PseAAC) were applied to represent target information using 24 physicochemical and conformational properties of the proteins. To handle the imbalance problem in the data, we applied a neighborhood cleaning algorithm. The predictor was constructed based on a deep neural network, and optimal features were selected using the random forest algorithm. As a result, 99.79% accuracy was achieved for the training dataset, and 91.38% accuracy was obtained for the testing dataset. AptaNet achieved high performance on our constructed aptamer-protein benchmark dataset. The results indicate that AptaNet can help identify novel aptamer-protein interacting pairs and build more-efficient insights into the relationship between aptamers and proteins. Our benchmark dataset and the source codes for AptaNet are available in: https://github.com/nedaemami/AptaNet .
Collapse
Affiliation(s)
- Neda Emami
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran
| | - Reza Ferdousi
- Department of Health Information Technology, School of Management and Medical Informatics, Tabriz University of Medical Sciences, Tabriz, Iran.
- Research Center for Pharmaceutical Nanotechnology, Biomedicine Institute, Tabriz University of Medical Sciences, Tabriz, Iran.
| |
Collapse
|
74
|
Yan Z, Hamilton WL, Blanchette M. Graph neural representational learning of RNA secondary structures for predicting RNA-protein interactions. Bioinformatics 2021; 36:i276-i284. [PMID: 32657407 PMCID: PMC7355240 DOI: 10.1093/bioinformatics/btaa456] [Citation(s) in RCA: 24] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022] Open
Abstract
Motivation RNA-protein interactions are key effectors of post-transcriptional regulation. Significant experimental and bioinformatics efforts have been expended on characterizing protein binding mechanisms on the molecular level, and on highlighting the sequence and structural traits of RNA that impact the binding specificity for different proteins. Yet our ability to predict these interactions in silico remains relatively poor. Results In this study, we introduce RPI-Net, a graph neural network approach for RNA-protein interaction prediction. RPI-Net learns and exploits a graph representation of RNA molecules, yielding significant performance gains over existing state-of-the-art approaches. We also introduce an approach to rectify an important type of sequence bias caused by the RNase T1 enzyme used in many CLIP-Seq experiments, and we show that correcting this bias is essential in order to learn meaningful predictors and properly evaluate their accuracy. Finally, we provide new approaches to interpret the trained models and extract simple, biologically interpretable representations of the learned sequence and structural motifs. Availability and implementation Source code can be accessed at https://www.github.com/HarveyYan/RNAonGraph. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Zichao Yan
- School of Computer Science, McGill University, Montreal, QC H3A 2B2, Canada.,MILA, Quebec AI Institute, Montreal, QC H2S 3H1, Canada
| | - William L Hamilton
- School of Computer Science, McGill University, Montreal, QC H3A 2B2, Canada.,MILA, Quebec AI Institute, Montreal, QC H2S 3H1, Canada
| | - Mathieu Blanchette
- School of Computer Science, McGill University, Montreal, QC H3A 2B2, Canada
| |
Collapse
|
75
|
Yang S, Liu X, Ng RT. ProbeRating: a recommender system to infer binding profiles for nucleic acid-binding proteins. Bioinformatics 2021; 36:4797-4804. [PMID: 32573679 PMCID: PMC7750938 DOI: 10.1093/bioinformatics/btaa580] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2019] [Revised: 05/18/2020] [Accepted: 06/18/2020] [Indexed: 12/15/2022] Open
Abstract
MOTIVATION The interaction between proteins and nucleic acids plays a crucial role in gene regulation and cell function. Determining the binding preferences of nucleic acid-binding proteins (NBPs), namely RNA-binding proteins (RBPs) and transcription factors (TFs), is the key to decipher the protein-nucleic acids interaction code. Today, available NBP binding data from in vivo or in vitro experiments are still limited, which leaves a large portion of NBPs uncovered. Unfortunately, existing computational methods that model the NBP binding preferences are mostly protein specific: they need the experimental data for a specific protein in interest, and thus only focus on experimentally characterized NBPs. The binding preferences of experimentally unexplored NBPs remain largely unknown. RESULTS Here, we introduce ProbeRating, a nucleic acid recommender system that utilizes techniques from deep learning and word embeddings of natural language processing. ProbeRating is developed to predict binding profiles for unexplored or poorly studied NBPs by exploiting their homologs NBPs which currently have available binding data. Requiring only sequence information as input, ProbeRating adapts FastText from Facebook AI Research to extract biological features. It then builds a neural network-based recommender system. We evaluate the performance of ProbeRating on two different tasks: one for RBP and one for TF. As a result, ProbeRating outperforms previous methods on both tasks. The results show that ProbeRating can be a useful tool to study the binding mechanism for the many NBPs that lack direct experimental evidence. and implementation. AVAILABILITY AND IMPLEMENTATION The source code is freely available at <https://github.com/syang11/ProbeRating>. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shu Yang
- Department of Computer Science, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| | - Xiaoxi Liu
- RIKEN Center for Integrative Medical Sciences (IMS), Yokohama 230-0045, Japan
| | - Raymond T Ng
- Department of Computer Science, University of British Columbia, Vancouver, BC V6T1Z4, Canada
| |
Collapse
|
76
|
Kuang S, Wang L. Identification and analysis of consensus RNA motifs binding to the genome regulator CTCF. NAR Genom Bioinform 2021; 2:lqaa031. [PMID: 33575587 PMCID: PMC7671415 DOI: 10.1093/nargab/lqaa031] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Revised: 03/21/2020] [Accepted: 04/28/2020] [Indexed: 12/14/2022] Open
Abstract
CCCTC-binding factor (CTCF) is a key regulator of 3D genome organization and gene expression. Recent studies suggest that RNA transcripts, mostly long non-coding RNAs (lncRNAs), can serve as locus-specific factors to bind and recruit CTCF to the chromatin. However, it remains unclear whether specific sequence patterns are shared by the CTCF-binding RNA sites, and no RNA motif has been reported so far for CTCF binding. In this study, we have developed DeepLncCTCF, a new deep learning model based on a convolutional neural network and a bidirectional long short-term memory network, to discover the RNA recognition patterns of CTCF and identify candidate lncRNAs binding to CTCF. When evaluated on two different datasets, human U2OS dataset and mouse ESC dataset, DeepLncCTCF was shown to be able to accurately predict CTCF-binding RNA sites from nucleotide sequence. By examining the sequence features learned by DeepLncCTCF, we discovered a novel RNA motif with the consensus sequence, AGAUNGGA, for potential CTCF binding in humans. Furthermore, the applicability of DeepLncCTCF was demonstrated by identifying nearly 5000 candidate lncRNAs that might bind to CTCF in the nucleus. Our results provide useful information for understanding the molecular mechanisms of CTCF function in 3D genome organization.
Collapse
Affiliation(s)
- Shuzhen Kuang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA.,Department of Biological Sciences, Clemson University, Clemson, SC 29634, USA
| | - Liangjiang Wang
- Department of Genetics and Biochemistry, Clemson University, Clemson, SC 29634, USA
| |
Collapse
|
77
|
Yuan L, Yang Y. DeCban: Prediction of circRNA-RBP Interaction Sites by Using Double Embeddings and Cross-Branch Attention Networks. Front Genet 2021; 11:632861. [PMID: 33552144 PMCID: PMC7862712 DOI: 10.3389/fgene.2020.632861] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Accepted: 12/23/2020] [Indexed: 12/17/2022] Open
Abstract
Circular RNAs (circRNAs), as a rising star in the RNA world, play important roles in various biological processes. Understanding the interactions between circRNAs and RNA binding proteins (RBPs) can help reveal the functions of circRNAs. For the past decade, the emergence of high-throughput experimental data, like CLIP-Seq, has made the computational identification of RNA-protein interactions (RPIs) possible based on machine learning methods. However, as the underlying mechanisms of RPIs have not been fully understood yet and the information sources of circRNAs are limited, the computational tools for predicting circRNA-RBP interactions have been very few. In this study, we propose a deep learning method to identify circRNA-RBP interactions, called DeCban, which is featured by hybrid double embeddings for representing RNA sequences and a cross-branch attention neural network for classification. To capture more information from RNA sequences, the double embeddings include pre-trained embedding vectors for both RNA segments and their converted amino acids. Meanwhile, the cross-branch attention network aims to address the learning of very long sequences by integrating features of different scales and focusing on important information. The experimental results on 37 benchmark datasets show that both double embeddings and the cross-branch attention model contribute to the improvement of performance. DeCban outperforms the mainstream deep learning-based methods on not only prediction accuracy but also computational efficiency. The data sets and source code of this study are freely available at: https://github.com/AaronYll/DECban.
Collapse
Affiliation(s)
- Liangliang Yuan
- Department of Computer Science and Engineering, Center for Brain-Like Computing and Machine Intelligence, Shanghai Jiao Tong University, Shanghai, China
| | - Yang Yang
- Department of Computer Science and Engineering, Center for Brain-Like Computing and Machine Intelligence, Shanghai Jiao Tong University, Shanghai, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, China
| |
Collapse
|
78
|
Prediction of Protein-ATP Binding Residues Based on Ensemble of Deep Convolutional Neural Networks and LightGBM Algorithm. Int J Mol Sci 2021; 22:ijms22020939. [PMID: 33477866 PMCID: PMC7832895 DOI: 10.3390/ijms22020939] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2020] [Revised: 01/13/2021] [Accepted: 01/16/2021] [Indexed: 12/13/2022] Open
Abstract
Accurately identifying protein-ATP binding residues is important for protein function annotation and drug design. Previous studies have used classic machine-learning algorithms like support vector machine (SVM) and random forest to predict protein-ATP binding residues; however, as new machine-learning techniques are being developed, the prediction performance could be further improved. In this paper, an ensemble predictor that combines deep convolutional neural network and LightGBM with ensemble learning algorithm is proposed. Three subclassifiers have been developed, including a multi-incepResNet-based predictor, a multi-Xception-based predictor, and a LightGBM predictor. The final prediction result is the combination of outputs from three subclassifiers with optimized weight distribution. We examined the performance of our proposed predictor using two datasets: a classic ATP-binding benchmark dataset and a newly proposed ATP-binding dataset. Our predictor achieved area under the curve (AUC) values of 0.925 and 0.902 and Matthews Correlation Coefficient (MCC) values of 0.639 and 0.642, respectively, which are both better than other state-of-art prediction methods.
Collapse
|
79
|
iPTT(2 L)-CNN: A Two-Layer Predictor for Identifying Promoters and Their Types in Plant Genomes by Convolutional Neural Network. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2021; 2021:6636350. [PMID: 33488763 PMCID: PMC7803414 DOI: 10.1155/2021/6636350] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/24/2020] [Revised: 12/13/2020] [Accepted: 12/16/2020] [Indexed: 11/18/2022]
Abstract
A promoter is a short DNA sequence near to the start codon, responsible for initiating transcription of a specific gene in genome. The accurate recognition of promoters has great significance for a better understanding of the transcriptional regulation. Because of their importance in the process of biological transcriptional regulation, there is an urgent need to develop in silico tools to identify promoters and their types timely and accurately. A number of prediction methods had been developed in this regard; however, almost all of them were merely used for identifying promoters and their strength or sigma types. Owing to that TATA box region in TATA promoter that influences posttranscriptional processes, in the current study, we developed a two-layer predictor called iPTT(2L)-CNN by using the convolutional neural network (CNN) for identifying TATA and TATA-less promoters. The first layer can be used to identify a given DNA sequence as a promoter or nonpromoter. The second layer is used to identify whether the recognized promoter is TATA promoter or not. The 5-fold crossvalidation and independent testing results demonstrate that the constructed predictor is promising for identifying promoter and classifying TATA and TATA-less promoter. Furthermore, to make it easier for most experimental scientists get the results they need, a user-friendly web server has been established at http://www.jci-bioinfo.cn/iPPT(2L)-CNN.
Collapse
|
80
|
Xia CQ, Pan X, Yang Y, Huang Y, Shen HB. Recent Progresses of Computational Analysis of RNA-Protein Interactions. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11315-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
|
81
|
Li G, Iyer B, Prasath VBS, Ni Y, Salomonis N. DeepImmuno: Deep learning-empowered prediction and generation of immunogenic peptides for T cell immunity. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2020:2020.12.24.424262. [PMID: 33398286 PMCID: PMC7781330 DOI: 10.1101/2020.12.24.424262] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
T-cells play an essential role in the adaptive immune system by seeking out, binding and destroying foreign antigens presented on the cell surface of diseased cells. An improved understanding of T-cell immunity will greatly aid in the development of new cancer immunotherapies and vaccines for life threatening pathogens. Central to the design of such targeted therapies are computational methods to predict non-native epitopes to elicit a T cell response, however, we currently lack accurate immunogenicity inference methods. Another challenge is the ability to accurately simulate immunogenic peptides for specific human leukocyte antigen (HLA) alleles, for both synthetic biological applications and to augment real training datasets. Here, we proposed a beta-binomial distribution approach to derive epitope immunogenic potential from sequence alone. We conducted systematic benchmarking of five traditional machine learning (ElasticNet, KNN, SVM, Random Forest, AdaBoost) and three deep learning models (CNN, ResNet, GNN) using three independent prior validated immunogenic peptide collections (dengue virus, cancer neoantigen and SARS-Cov-2). We chose the CNN model as the best prediction model based on its adaptivity for small and large datasets, and performance relative to existing methods. In addition to outperforming two highly used immunogenicity prediction algorithms, DeepHLApan and IEDB, DeepImmuno-CNN further correctly predicts which residues are most important for T cell antigen recognition. Our independent generative adversarial network (GAN) approach, DeepImmuno-GAN, was further able to accurately simulate immunogenic peptides with physiochemical properties and immunogenicity predictions similar to that of real antigens. We provide DeepImmuno-CNN as source code and an easy-to-use web interface. DATA AVAILABILITY DeepImmuno Python3 code is available at https://github.com/frankligy/DeepImmuno . The DeepImmuno web portal is available from https://deepimmuno.herokuapp.com . The data in this article is available in GitHub and supplementary materials.
Collapse
Affiliation(s)
- Guangyuan Li
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA
- Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH, 45267 USA
| | - Balaji Iyer
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA
- Department of Electrical Engineering and Computer Science, University of Cincinnati, OH 45221 USA
| | - V. B. Surya Prasath
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio, USA
- Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH, 45267 USA
- Department of Electrical Engineering and Computer Science, University of Cincinnati, OH 45221 USA
| | - Yizhao Ni
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio, USA
- Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH, 45267 USA
| | - Nathan Salomonis
- Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH, USA
- Department of Pediatrics, University of Cincinnati School of Medicine, Cincinnati, Ohio, USA
- Department of Biomedical Informatics, College of Medicine, University of Cincinnati, OH, 45267 USA
- Department of Electrical Engineering and Computer Science, University of Cincinnati, OH 45221 USA
| |
Collapse
|
82
|
Pan X, Fang Y, Li X, Yang Y, Shen HB. RBPsuite: RNA-protein binding sites prediction suite based on deep learning. BMC Genomics 2020; 21:884. [PMID: 33297946 PMCID: PMC7724624 DOI: 10.1186/s12864-020-07291-6] [Citation(s) in RCA: 58] [Impact Index Per Article: 11.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2020] [Accepted: 11/28/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND RNA-binding proteins (RBPs) play crucial roles in various biological processes. Deep learning-based methods have been demonstrated powerful on predicting RBP sites on RNAs. However, the training of deep learning models is very time-intensive and computationally intensive. RESULTS Here we present a deep learning-based RBPsuite, an easy-to-use webserver for predicting RBP binding sites on linear and circular RNAs. For linear RNAs, RBPsuite predicts the RBP binding scores with them using our updated iDeepS. For circular RNAs (circRNAs), RBPsuite predicts the RBP binding scores with them using our developed CRIP. RBPsuite first breaks the input RNA sequence into segments of 101 nucleotides and scores the interaction between the segments and the RBPs. RBPsuite further detects the verified motifs on the binding segments gives the binding scores distribution along the full-length sequence. CONCLUSIONS RBPsuite is an easy-to-use online webserver for predicting RBP binding sites and freely available at http://www.csbio.sjtu.edu.cn/bioinf/RBPsuite/ .
Collapse
Affiliation(s)
- Xiaoyong Pan
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, 200240, China.
| | - Yi Fang
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, 200240, China
| | - Xianfeng Li
- Key laboratory of Carcinogenesis and Translational Research, Peking University Cancer Hospital, Beijing, 100142, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, Center for Brain-Like Computing and Machine Intelligence, Shanghai, 200240, China
| | - Hong-Bin Shen
- Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, and Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai, 200240, China.
| |
Collapse
|
83
|
Alam T, Al-Absi HRH, Schmeier S. Deep Learning in LncRNAome: Contribution, Challenges, and Perspectives. Noncoding RNA 2020; 6:E47. [PMID: 33266128 PMCID: PMC7711891 DOI: 10.3390/ncrna6040047] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2020] [Revised: 10/27/2020] [Accepted: 11/06/2020] [Indexed: 12/11/2022] Open
Abstract
Long non-coding RNAs (lncRNA), the pervasively transcribed part of the mammalian genome, have played a significant role in changing our protein-centric view of genomes. The abundance of lncRNAs and their diverse roles across cell types have opened numerous avenues for the research community regarding lncRNAome. To discover and understand lncRNAome, many sophisticated computational techniques have been leveraged. Recently, deep learning (DL)-based modeling techniques have been successfully used in genomics due to their capacity to handle large amounts of data and produce relatively better results than traditional machine learning (ML) models. DL-based modeling techniques have now become a choice for many modeling tasks in the field of lncRNAome as well. In this review article, we summarized the contribution of DL-based methods in nine different lncRNAome research areas. We also outlined DL-based techniques leveraged in lncRNAome, highlighting the challenges computational scientists face while developing DL-based models for lncRNAome. To the best of our knowledge, this is the first review article that summarizes the role of DL-based techniques in multiple areas of lncRNAome.
Collapse
Affiliation(s)
- Tanvir Alam
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar;
| | - Hamada R. H. Al-Absi
- College of Science and Engineering, Hamad Bin Khalifa University, Doha 34110, Qatar;
| | - Sebastian Schmeier
- School of Natural and Computational Sciences, Massey University, Auckland 0632, New Zealand;
| |
Collapse
|
84
|
Wang X, Wu Z, Qin W, Sun T, Lu S, Li Y, Wang Y, Hu X, Xu D, Wu Y, Chen Q, Yao W, Liu M, Wei M, Wu H. Long non-coding RNA ZFAS1 promotes colorectal cancer tumorigenesis and development through DDX21-POLR1B regulatory axis. Aging (Albany NY) 2020; 12:22656-22687. [PMID: 33202381 PMCID: PMC7746388 DOI: 10.18632/aging.103875] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2020] [Accepted: 05/25/2020] [Indexed: 12/19/2022]
Abstract
Increasing evidence supports long non-coding RNA-ZFAS1 as master protein regulators involved in a variety of human cancers. However, the molecular mechanism is not fully understood in colorectal cancer (CRC) and remains to be elucidated. Here, we uncovered a previously unreported mechanism linking RNA helicase DDX21 regulated by lncRNA ZFAS1 in control of POLR1B expression in CRC initiation and progression. Specifically, ZFAS1 exerted its oncogenic functions and was significantly up-regulated accompanied by elevated DDX21, POLR1B expression in CRC cells and tissues, which further closely associated with poor clinical outcomes. Notably, ZFAS1 knockdown dramatically suppressed CRC cell proliferation, invasion, migration, and increased cell apoptosis, which were contrary to the effect caused by ZFAS1 up-regulation. We further revealed that the inhibitory effect caused by ZFAS1 knockdown could be reversed by DDX21 overexpression in vitro and in vivo. Mechanistically, our research found that ZFAS1 could directly recruit DDX21 protein by harboring the specific motif (AAGA or CAGA). Finally, POLR1B was identified as the downstream target of DDX21 regulated by ZFAS1, which was also up-regulated in CRC cells and tissues and closely related to poor prognosis. The unrecognized ZFAS1/DDX21/POLR1B signaling regulation axis may provide new biomarkers and targets for CRC treatment and prognostic evaluation.
Collapse
Affiliation(s)
- Xiufang Wang
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Zhikun Wu
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Wenyan Qin
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Tong Sun
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Senxu Lu
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Yalun Li
- Department of Anorectal Surgery, First Hospital of China Medical University, Shenyang 110001, P. R. China
| | - Yuanhe Wang
- Department of Medical Oncology, Cancer Hospital of China Medical University, Department of Medical Oncology, Liaoning Cancer Hospital and Institute, Shenyang 110042, P. R. China
| | - Xiaoyun Hu
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Dongping Xu
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Yutong Wu
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Qiuchen Chen
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Weifan Yao
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Mingyan Liu
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Minjie Wei
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| | - Huizhe Wu
- Department of Pharmacology, School of Pharmacy, China Medical University, Shenyang 110122, P. R. China.,Liaoning Key Laboratory of Molecular Targeted Anti-Tumor Drug Development and Evaluation, Liaoning Cancer Immune Peptide Drug Engineering Technology Research Center, Key Laboratory of Precision Diagnosis and Treatment of Gastrointestinal Tumors, Ministry of Education, China Medical University, Shenyang 110122, P. R. China
| |
Collapse
|
85
|
He Y, Shen Z, Zhang Q, Wang S, Huang DS. A survey on deep learning in DNA/RNA motif mining. Brief Bioinform 2020; 22:5916939. [PMID: 33005921 PMCID: PMC8293829 DOI: 10.1093/bib/bbaa229] [Citation(s) in RCA: 44] [Impact Index Per Article: 8.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2020] [Revised: 08/19/2020] [Accepted: 08/24/2020] [Indexed: 01/18/2023] Open
Abstract
DNA/RNA motif mining is the foundation of gene function research. The DNA/RNA motif mining plays an extremely important role in identifying the DNA- or RNA-protein binding site, which helps to understand the mechanism of gene regulation and management. For the past few decades, researchers have been working on designing new efficient and accurate algorithms for mining motif. These algorithms can be roughly divided into two categories: the enumeration approach and the probabilistic method. In recent years, machine learning methods had made great progress, especially the algorithm represented by deep learning had achieved good performance. Existing deep learning methods in motif mining can be roughly divided into three types of models: convolutional neural network (CNN) based models, recurrent neural network (RNN) based models, and hybrid CNN–RNN based models. We introduce the application of deep learning in the field of motif mining in terms of data preprocessing, features of existing deep learning architectures and comparing the differences between the basic deep learning models. Through the analysis and comparison of existing deep learning methods, we found that the more complex models tend to perform better than simple ones when data are sufficient, and the current methods are relatively simple compared with other fields such as computer vision, language processing (NLP), computer games, etc. Therefore, it is necessary to conduct a summary in motif mining by deep learning, which can help researchers understand this field.
Collapse
Affiliation(s)
- Ying He
- computer science and technology at Tongji University, China
| | - Zhen Shen
- computer science and technology at Tongji University, China
| | - Qinhu Zhang
- computer science and technology at Tongji University, China
| | - Siguo Wang
- computer science and technology at Tongji University, China
| | - De-Shuang Huang
- Institute of Machines Learning and Systems Biology, Tongji University
| |
Collapse
|
86
|
Wang K, Hu G, Wu Z, Su H, Yang J, Kurgan L. Comprehensive Survey and Comparative Assessment of RNA-Binding Residue Predictions with Analysis by RNA Type. Int J Mol Sci 2020; 21:E6879. [PMID: 32961749 PMCID: PMC7554811 DOI: 10.3390/ijms21186879] [Citation(s) in RCA: 19] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/16/2020] [Revised: 09/15/2020] [Accepted: 09/17/2020] [Indexed: 02/07/2023] Open
Abstract
With close to 30 sequence-based predictors of RNA-binding residues (RBRs), this comparative survey aims to help with understanding and selection of the appropriate tools. We discuss past reviews on this topic, survey a comprehensive collection of predictors, and comparatively assess six representative methods. We provide a novel and well-designed benchmark dataset and we are the first to report and compare protein-level and datasets-level results, and to contextualize performance to specific types of RNAs. The methods considered here are well-cited and rely on machine learning algorithms on occasion combined with homology-based prediction. Empirical tests reveal that they provide relatively accurate predictions. Virtually all methods perform well for the proteins that interact with rRNAs, some generate accurate predictions for mRNAs, snRNA, SRP and IRES, while proteins that bind tRNAs are predicted poorly. Moreover, except for DRNApred, they confuse DNA and RNA-binding residues. None of the six methods consistently outperforms the others when tested on individual proteins. This variable and complementary protein-level performance suggests that users should not rely on applying just the single best dataset-level predictor. We recommend that future work should focus on the development of approaches that facilitate protein-level selection of accurate predictors and the consensus-based prediction of RBRs.
Collapse
Affiliation(s)
- Kui Wang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Gang Hu
- School of Statistics and Data Science, LPMC and KLMDASR, Nankai University, Tianjin 300071, China;
| | - Zhonghua Wu
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Hong Su
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Jianyi Yang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin 300071, China; (K.W.); (Z.W.); (H.S.); (J.Y.)
| | - Lukasz Kurgan
- Department of Computer Science, Virginia Commonwealth University, Richmond, VA 23284, USA
| |
Collapse
|
87
|
Zeng M, Zhang F, Wu FX, Li Y, Wang J, Li M. Protein-protein interaction site prediction through combining local and global features with deep neural networks. Bioinformatics 2020; 36:1114-1120. [PMID: 31593229 DOI: 10.1093/bioinformatics/btz699] [Citation(s) in RCA: 92] [Impact Index Per Article: 18.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Revised: 07/25/2019] [Accepted: 09/04/2019] [Indexed: 12/21/2022] Open
Abstract
MOTIVATION Protein-protein interactions (PPIs) play important roles in many biological processes. Conventional biological experiments for identifying PPI sites are costly and time-consuming. Thus, many computational approaches have been proposed to predict PPI sites. Existing computational methods usually use local contextual features to predict PPI sites. Actually, global features of protein sequences are critical for PPI site prediction. RESULTS A new end-to-end deep learning framework, named DeepPPISP, through combining local contextual and global sequence features, is proposed for PPI site prediction. For local contextual features, we use a sliding window to capture features of neighbors of a target amino acid as in previous studies. For global sequence features, a text convolutional neural network is applied to extract features from the whole protein sequence. Then the local contextual and global sequence features are combined to predict PPI sites. By integrating local contextual and global sequence features, DeepPPISP achieves the state-of-the-art performance, which is better than the other competing methods. In order to investigate if global sequence features are helpful in our deep learning model, we remove or change some components in DeepPPISP. Detailed analyses show that global sequence features play important roles in DeepPPISP. AVAILABILITY AND IMPLEMENTATION The DeepPPISP web server is available at http://bioinformatics.csu.edu.cn/PPISP/. The source code can be obtained from https://github.com/CSUBioGroup/DeepPPISP. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Min Zeng
- School of Computer Science and Engineering, Central South University, Changsha 410083, People's Republic of China
| | - Fuhao Zhang
- School of Computer Science and Engineering, Central South University, Changsha 410083, People's Republic of China
| | - Fang-Xiang Wu
- Division of Biomedical Engineering and Department of Mechanical Engineering, University of Saskatchewan, Saskatoon SKS7N5A9, Canada
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk, VA 23529, USA
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, People's Republic of China
| | - Min Li
- School of Computer Science and Engineering, Central South University, Changsha 410083, People's Republic of China
| |
Collapse
|
88
|
Lopez-Del Rio A, Martin M, Perera-Lluna A, Saidi R. Effect of sequence padding on the performance of deep learning models in archaeal protein functional prediction. Sci Rep 2020; 10:14634. [PMID: 32884053 PMCID: PMC7471694 DOI: 10.1038/s41598-020-71450-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 08/06/2020] [Indexed: 11/08/2022] Open
Abstract
The use of raw amino acid sequences as input for deep learning models for protein functional prediction has gained popularity in recent years. This scheme obliges to manage proteins with different lengths, while deep learning models require same-shape input. To accomplish this, zeros are usually added to each sequence up to a established common length in a process called zero-padding. However, the effect of different padding strategies on model performance and data structure is yet unknown. We propose and implement four novel types of padding the amino acid sequences. Then, we analysed the impact of different ways of padding the amino acid sequences in a hierarchical Enzyme Commission number prediction problem. Results show that padding has an effect on model performance even when there are convolutional layers implied. Contrastingly to most of deep learning works which focus mainly on architectures, this study highlights the relevance of the deemed-of-low-importance process of padding and raises awareness of the need to refine it for better performance. The code of this analysis is publicly available at https://github.com/b2slab/padding_benchmark .
Collapse
Affiliation(s)
- Angela Lopez-Del Rio
- B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain.
- Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain.
| | - Maria Martin
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| | - Alexandre Perera-Lluna
- B2SLab, Department d'Enginyeria de Sistemes, Automàtica i Informàtica Industrial, Universitat Politècnica de Catalunya, 08028, Barcelona, Spain
- Department of Biomedical Engineering, Institut de Recerca Pediàtrica Hospital Sant Joan de Dèu, 08950, Esplugues de Llobregat, Spain
| | - Rabie Saidi
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Hinxton, CB10 1SD, UK
| |
Collapse
|
89
|
Yang H, Deng Z, Pan X, Shen HB, Choi KS, Wang L, Wang S, Wu J. RNA-binding protein recognition based on multi-view deep feature and multi-label learning. Brief Bioinform 2020; 22:5893431. [PMID: 32808039 DOI: 10.1093/bib/bbaa174] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2020] [Revised: 06/17/2020] [Accepted: 07/09/2020] [Indexed: 12/28/2022] Open
Abstract
RNA-binding protein (RBP) is a class of proteins that bind to and accompany RNAs in regulating biological processes. An RBP may have multiple target RNAs, and its aberrant expression can cause multiple diseases. Methods have been designed to predict whether a specific RBP can bind to an RNA and the position of the binding site using binary classification model. However, most of the existing methods do not take into account the binding similarity and correlation between different RBPs. While methods employing multiple labels and Long Short Term Memory Network (LSTM) are proposed to consider binding similarity between different RBPs, the accuracy remains low due to insufficient feature learning and multi-label learning on RNA sequences. In response to this challenge, the concept of RNA-RBP Binding Network (RRBN) is proposed in this paper to provide theoretical support for multi-label learning to identify RBPs that can bind to RNAs. It is experimentally shown that the RRBN information can significantly improve the prediction of unknown RNA-RBP interactions. To further improve the prediction accuracy, we present the novel computational method iDeepMV which integrates multi-view deep learning technology under the multi-label learning framework. iDeepMV first extracts data from the views of amino acid sequence and dipeptide component based on the RNA sequences as the original view. Deep neural network models are then designed for the respective views to perform deep feature learning. The extracted deep features are fed into multi-label classifiers which are trained with the RNA-RBP interaction information for the three views. Finally, a voting mechanism is designed to make comprehensive decision on the results of the multi-label classifiers. Our experimental results show that the prediction performance of iDeepMV, which combines multi-view deep feature learning models with RNA-RBP interaction information, is significantly better than that of the state-of-the-art methods. iDeepMV is freely available at http://www.csbio.sjtu.edu.cn/bioinf/iDeepMV for academic use. The code is freely available at http://github.com/uchihayht/iDeepMV.
Collapse
Affiliation(s)
| | - Zhaohong Deng
- School of Artificial Intelligence and Computer Science of Jiangnan University, Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (LCNBI) and ZJLab
| | - Xiaoyong Pan
- Department of Automation of Shanghai Jiao Tong University
| | | | | | - Lei Wang
- School of Biotechnology and Key Laboratory of Industrial Biotechnology Ministry in Jiangnan University
| | - Shitong Wang
- School of Artificial Intelligence and Computer Science of Jiangnan University
| | - Jing Wu
- School of Biotechnology and Key Laboratory of Industrial Biotechnology Ministry in Jiangnan University
| |
Collapse
|
90
|
DeepPred-SubMito: A Novel Submitochondrial Localization Predictor Based on Multi-Channel Convolutional Neural Network and Dataset Balancing Treatment. Int J Mol Sci 2020; 21:ijms21165710. [PMID: 32784927 PMCID: PMC7460811 DOI: 10.3390/ijms21165710] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2020] [Revised: 08/05/2020] [Accepted: 08/07/2020] [Indexed: 12/18/2022] Open
Abstract
Mitochondrial proteins are physiologically active in different compartments, and their abnormal location will trigger the pathogenesis of human mitochondrial pathologies. Correctly identifying submitochondrial locations can provide information for disease pathogenesis and drug design. A mitochondrion has four submitochondrial compartments, the matrix, the outer membrane, the inner membrane, and the intermembrane space, but various existing studies ignored the intermembrane space. The majority of researchers used traditional machine learning methods for predicting mitochondrial protein localization. Those predictors required expert-level knowledge of biology to be encoded as features rather than allowing the underlying predictor to extract features through a data-driven procedure. Besides, few researchers have considered the imbalance in datasets. In this paper, we propose a novel end-to-end predictor employing deep neural networks, DeepPred-SubMito, for protein submitochondrial location prediction. First, we utilize random over-sampling to decrease the influence caused by unbalanced datasets. Next, we train a multi-channel bilayer convolutional neural network for multiple subsequences to learn high-level features. Third, the prediction result is outputted through the fully connected layer. The performance of the predictor is measured by 10-fold cross-validation and 5-fold cross-validation on the SM424-18 dataset and the SubMitoPred dataset, respectively. Experimental results show that the predictor outperforms state-of-the-art predictors. In addition, the prediction of results in the M983 dataset also confirmed its effectiveness in predicting submitochondrial locations.
Collapse
|
91
|
Fan Y, Chen M, Zhu Q, Wang W. Inferring Disease-Associated Microbes Based on Multi-Data Integration and Network Consistency Projection. Front Bioeng Biotechnol 2020; 8:831. [PMID: 32850711 PMCID: PMC7418576 DOI: 10.3389/fbioe.2020.00831] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2020] [Accepted: 06/29/2020] [Indexed: 12/18/2022] Open
Abstract
Plenty of microbes in our human body play a vital role in the process of cell physiology. In recent years, there is accumulating evidence indicating that microbes are closely related to many complex human diseases. In-depth investigation of disease-associated microbes can contribute to understanding the pathogenesis of diseases and thus provide novel strategies for the treatment, diagnosis, and prevention of diseases. To date, many computational models have been proposed for predicting microbe-disease associations using available similarity networks. However, these similarity networks are not effectively fused. In this study, we proposed a novel computational model based on multi-data integration and network consistency projection for Human Microbe-Disease Associations Prediction (HMDA-Pred), which fuses multiple similarity networks by a linear network fusion method. HMDA-Pred yielded AUC values of 0.9589 and 0.9361 ± 0.0037 in the experiments of leave-one-out cross validation (LOOCV) and 5-fold cross validation (5-fold CV), respectively. Furthermore, in case studies, 10, 8, and 10 out of the top 10 predicted microbes of asthma, colon cancer, and inflammatory bowel disease were confirmed by the literatures, respectively.
Collapse
Affiliation(s)
- Yongxian Fan
- School of Computer Science and Information Security, Guilin University of Electronic Technology, Guilin, China
| | | | | | | |
Collapse
|
92
|
PRIME-3D2D is a 3D2D model to predict binding sites of protein-RNA interaction. Commun Biol 2020; 3:384. [PMID: 32678300 PMCID: PMC7366699 DOI: 10.1038/s42003-020-1114-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2019] [Accepted: 06/29/2020] [Indexed: 11/08/2022] Open
Abstract
Protein-RNA interaction participates in many biological processes. So, studying protein–RNA interaction can help us to understand the function of protein and RNA. Although the protein–RNA 3D3D model, like PRIME, was useful in building 3D structural complexes, it can’t be used genome-wide, due to lacking RNA 3D structures. To take full advantage of RNA secondary structures revealed from high-throughput sequencing, we present PRIME-3D2D to predict binding sites of protein–RNA interaction. PRIME-3D2D is almost as good as PRIME at modeling protein–RNA complexes. PRIME-3D2D can be used to predict binding sites on PDB data (MCC = 0.75/0.70 for binding sites in protein/RNA) and transcription-wide (MCC = 0.285 for binding sites in RNA). Testing on PDB and yeast transcription-wide data show that PRIME-3D2D performs better than other binding sites predictor. So, PRIME-3D2D can be used to predict the binding sites both on PDB and genome-wide, and it’s freely available. Xie et al. report a new computational method PRIME-3D2D to predict binding sites of protein–RNA interaction by considering protein 3D structure and RNA 2D structure. It is freely available, performs better than other binding sites predictor and is as good as PRIME to model protein–RNA complex.
Collapse
|
93
|
Liu L, Hu X, Feng Z, Wang S, Sun K, Xu S. Recognizing Ion Ligand-Binding Residues by Random Forest Algorithm Based on Optimized Dihedral Angle. Front Bioeng Biotechnol 2020; 8:493. [PMID: 32596216 PMCID: PMC7303464 DOI: 10.3389/fbioe.2020.00493] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2019] [Accepted: 04/28/2020] [Indexed: 11/26/2022] Open
Abstract
The prediction of ion ligand–binding residues in protein sequences is a challenging work that contributes to understand the specific functions of proteins in life processes. In this article, we selected binding residues of 14 ion ligands as research objects, including four acid radical ion ligands and 10 metal ion ligands. Based on the amino acid sequence information, we selected the composition and position conservation information of amino acids, the predicted structural information, and physicochemical properties of amino acids as basic feature parameters. We then performed a statistical analysis and reclassification for dihedral angle and proposed new methods on the extraction of feature parameters. The methods mainly included applying information entropy on the extraction of polarization charge and hydrophilic–hydrophobic information of amino acids and using position weight matrices on the extraction of position conservation information. In the prediction model, we used the random forest algorithm and obtained better prediction results than previous works. With the independent test, the Matthew's correlation coefficient and accuracy of 10 metal ion ligand–binding residues were larger than 0.07 and 52%, respectively; the corresponding evaluation values of four acid radical ion ligand–binding residues were larger than 0.15 and 86%, respectively. Further, we classified and combined the phi and psi angles and optimized prediction model for each ion ligand–binding residue.
Collapse
Affiliation(s)
- Liu Liu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Xiuzhen Hu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Zhenxing Feng
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Shan Wang
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Kai Sun
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| | - Shuang Xu
- College of Sciences, Inner Mongolia University of Technology, Hohhot, China
| |
Collapse
|
94
|
Song J, Tian S, Yu L, Xing Y, Yang Q, Duan X, Dai Q. AC-Caps: Attention Based Capsule Network for Predicting RBP Binding Sites of LncRNA. Interdiscip Sci 2020; 12:414-423. [PMID: 32572768 DOI: 10.1007/s12539-020-00379-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2020] [Revised: 05/18/2020] [Accepted: 05/30/2020] [Indexed: 01/03/2023]
Abstract
Long non-coding RNA(lncRNA) is one of the non-coding RNAs longer than 200 nucleotides and it has no protein encoding function. LncRNA plays a key role in many biological processes. Studying the RNA-binding protein (RBP) binding sites on the lncRNA chain helps to reveal epigenetic and post-transcriptional mechanisms, to explore the physiological and pathological processes of cancer, and to discover new therapeutic breakthroughs. To improve the recognition rate of RBP binding sites and reduce the experimental time and cost, many calculation methods based on domain knowledge to predict RBP binding sites have emerged. However, these prediction methods are independent of nucleotides and do not take into account nucleotide statistics. In this paper, we use a high-order statistical-based encoding scheme, then the encoded lncRNA sequences are fed into a hybrid deep learning architecture named AC-Caps. It consists of a joint processing layer(composed of attention mechanism and convolutional neural network) and a capsule network. The AC-Caps model was evaluated using 31 independent experimental data sets from 12 lncRNA-binding proteins. In experiments, our method achieves excellent performance, with an average area under the curve (AUC) of 0.967 and an average accuracy (ACC) of 92.5%, which are 0.014, 2.3%, 0.261, 28.9%, 0.189, and 21.8% higher than HOCCNNLB, iDeepS, and DeepBind, respectively. The results show that the AC-Caps method can reliably process the large-scale RBP binding site data on the lncRNA chain, and the prediction performance is better than existing deep-learning models. The source code of AC-Caps and the datasets used in this paper are available at https://github.com/JinmiaoS/AC-Caps .
Collapse
Affiliation(s)
- Jinmiao Song
- School of Information Science and Engineering, Xinjiang University, Urumqi, 830008, China
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| | - Shengwei Tian
- School of Software, Xinjiang University, Urumqi, 830046, China.
| | - Long Yu
- Network Center, Xinjiang University, Urumqi, 830046, China
| | - Yan Xing
- Imaging Center, Xinjiang Medical University Affiliated First Hospital, Urumqi, 830011, China.
| | - Qimeng Yang
- School of Information Science and Engineering, Xinjiang University, Urumqi, 830008, China
| | - Xiaodong Duan
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| | - Qiguo Dai
- Dalian Key Lab of Digital Technology for National Culture, Dalian Minzu University, Dalian, 116600, China
| |
Collapse
|
95
|
Zhang SW, Zhang XX, Fan XN, Li WN. LPI-CNNCP: Prediction of lncRNA-protein interactions by using convolutional neural network with the copy-padding trick. Anal Biochem 2020; 601:113767. [PMID: 32454029 DOI: 10.1016/j.ab.2020.113767] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2020] [Revised: 04/27/2020] [Accepted: 05/01/2020] [Indexed: 11/17/2022]
Abstract
Long noncoding RNAs (lncRNAs) play critical roles in many pathological and biological processes, such as post-transcription, cell differentiation and gene regulation. Increasingly more studies have shown that lncRNAs function through mainly interactions with specific RNA binding proteins (RBPs). However, experimental identification of potential lncRNA-protein interactions is costly and time-consuming. In this work, we propose a novel convolutional neural network-based method with the copy-padding trick (named LPI-CNNCP) to predict lncRNA-protein interactions. The copy-padding trick of the LPI-CNNCP convert the protein/RNA sequences with variable-length into the fixed-length sequences, thus enabling the construction of the CNN model. A high-order one-hot encoding is also applied to transform the protein/RNA sequences into image-like inputs for capturing the dependencies among amino acids (or nucleotides). In the end, these encoded protein/RNA sequences are feed into a CNN to predict the lncRNA-protein interactions. Compared with other state-of-the-art methods in 10-fold cross-validation (10CV) test, LPI-CNNCP shows the best performance. Results in the independent test demonstrate that our LPI-CNNCP can effectively predict the potential lncRNA-protein interactions. We also compared the copy-padding trick with two other existing tricks (i.e., zero-padding and cropping), and the results show that our copy-padding rick outperforms the zero-padding and cropping tricks on predicting lncRNA-protein interactions. The source code of LPI-CNNCP and the datasets used in this work are available at https://github.com/NWPU-903PR/LPI-CNNCP for academic users.
Collapse
Affiliation(s)
- Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China.
| | - Xi-Xi Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Xiao-Nan Fan
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| | - Wei-Na Li
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi'an, 710072, China
| |
Collapse
|
96
|
Jia C, Bi Y, Chen J, Leier A, Li F, Song J. PASSION: an ensemble neural network approach for identifying the binding sites of RBPs on circRNAs. Bioinformatics 2020; 36:4276-4282. [DOI: 10.1093/bioinformatics/btaa522] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2019] [Revised: 04/09/2020] [Accepted: 05/13/2020] [Indexed: 12/17/2022] Open
Abstract
AbstractMotivationDifferent from traditional linear RNAs (containing 5′ and 3′ ends), circular RNAs (circRNAs) are a special type of RNAs that have a closed ring structure. Accumulating evidence has indicated that circRNAs can directly bind proteins and participate in a myriad of different biological processes.ResultsFor identifying the interaction of circRNAs with 37 different types of circRNA-binding proteins (RBPs), we develop an ensemble neural network, termed PASSION, which is based on the concatenated artificial neural network (ANN) and hybrid deep neural network frameworks. Specifically, the input of the ANN is the optimal feature subset for each RBP, which has been selected from six types of feature encoding schemes through incremental feature selection and application of the XGBoost algorithm. In turn, the input of the hybrid deep neural network is a stacked codon-based scheme. Benchmarking experiments indicate that the ensemble neural network reaches the average best area under the curve (AUC) of 0.883 across the 37 circRNA datasets when compared with XGBoost, k-nearest neighbor, support vector machine, random forest, logistic regression and Naive Bayes. Moreover, each of the 37 RBP models is extensively tested by performing independent tests, with the varying sequence similarity thresholds of 0.8, 0.7, 0.6 and 0.5, respectively. The corresponding average AUC obtained are 0.883, 0.876, 0.868 and 0.883, respectively, highlighting the effectiveness and robustness of PASSION. Extensive benchmarking experiments demonstrate that PASSION achieves a competitive performance for identifying binding sites between circRNA and RBPs, when compared with several state-of-the-art methods.Availability and implementationA user-friendly web server of PASSION is publicly accessible at http://flagship.erc.monash.edu/PASSION/.Supplementary informationSupplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Cangzhi Jia
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Yue Bi
- School of Science, Dalian Maritime University, Dalian 116026, China
| | - Jinxiang Chen
- Department of Biochemistry and Molecular Biology, Monash Biomedicine Discovery Institute
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - André Leier
- Department of Genetics, School of Medicine
- Department of Cell, Developmental and Integrative Biology, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Fuyi Li
- Department of Biochemistry and Molecular Biology, Monash Biomedicine Discovery Institute
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
| | - Jiangning Song
- Department of Biochemistry and Molecular Biology, Monash Biomedicine Discovery Institute
- Monash Centre for Data Science, Faculty of Information Technology, Monash University, Melbourne, VIC 3800, Australia
- ARC Centre of Excellence for Advanced Molecular Imaging, Monash University, Melbourne, VIC 3800, Australia
| |
Collapse
|
97
|
Zeng F, Fang G, Yao L. A Deep Neural Network for Identifying DNA N4-Methylcytosine Sites. Front Genet 2020; 11:209. [PMID: 32211035 PMCID: PMC7067889 DOI: 10.3389/fgene.2020.00209] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2019] [Accepted: 02/21/2020] [Indexed: 11/25/2022] Open
Abstract
Motivation: N4-methylcytosine (4mC) plays an important role in host defense and transcriptional regulation. Accurate identification of 4mc sites provides a more comprehensive understanding of its biological effects. At present, the traditional machine learning algorithms are used in the research on 4mC sites prediction, but the complexity of the algorithms is relatively high, which is not suitable for the processing of large data sets, and the accuracy of prediction needs to be improved. Therefore, it is necessary to develop a new and effective method to accurately identify 4mC sites. Results: In this work, we found a large number of 4mC sites and non 4mC sites of Caenorhabditis elegans (C. elegans) from the latest MethSMRT website, which greatly expanded the dataset of C. elegans, and developed a hybrid deep neural network framework named 4mcDeep-CBI, aiming to identify 4mC sites. In order to obtain the high latitude information of the feature, we input the preliminary extracted features into the Convolutional Neural Network (CNN) and Bidirectional Long Short Term Memory network (BLSTM) to generate advanced features. Taking the advanced features as algorithm input, we have proposed an integrated algorithm to improve feature representation. Experimental results on large new dataset show that the proposed predictor is able to achieve generally better performance in identifying 4mC sites as compared to the state-of-art predictor. Notably, this is the first study of identifying 4mC sites using deep neural network. Moreover, our model runs much faster than the state-of-art predictor.
Collapse
Affiliation(s)
- Feng Zeng
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Guanyun Fang
- School of Computer Science and Engineering, Central South University, Changsha, China
| | - Lan Yao
- College of Mathematics and Econometrics, Hunan University, Changsha, China
| |
Collapse
|
98
|
Zhou G, Chen M, Ju CJT, Wang Z, Jiang JY, Wang W. Mutation effect estimation on protein-protein interactions using deep contextualized representation learning. NAR Genom Bioinform 2020; 2:lqaa015. [PMID: 32166223 PMCID: PMC7059401 DOI: 10.1093/nargab/lqaa015] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2019] [Revised: 01/20/2020] [Accepted: 02/24/2020] [Indexed: 12/14/2022] Open
Abstract
The functional impact of protein mutations is reflected on the alteration of conformation and thermodynamics of protein–protein interactions (PPIs). Quantifying the changes of two interacting proteins upon mutations is commonly carried out by computational approaches. Hence, extensive research efforts have been put to the extraction of energetic or structural features on proteins, followed by statistical learning methods to estimate the effects of mutations on PPI properties. Nonetheless, such features require extensive human labors and expert knowledge to obtain, and have limited abilities to reflect point mutations. We present an end-to-end deep learning framework, MuPIPR (Mutation Effects in Protein–protein Interaction PRediction Using Contextualized Representations), to estimate the effects of mutations on PPIs. MuPIPR incorporates a contextualized representation mechanism of amino acids to propagate the effects of a point mutation to surrounding amino acid representations, therefore amplifying the subtle change in a long protein sequence. On top of that, MuPIPR leverages a Siamese residual recurrent convolutional neural encoder to encode a wild-type protein pair and its mutation pair. Multi-layer perceptron regressors are applied to the protein pair representations to predict the quantifiable changes of PPI properties upon mutations. Experimental evaluations show that, with only sequence information, MuPIPR outperforms various state-of-the-art systems on estimating the changes of binding affinity for SKEMPI v1, and offers comparable performance on SKEMPI v2. Meanwhile, MuPIPR also demonstrates state-of-the-art performance on estimating the changes of buried surface areas. The software implementation is available at https://github.com/guangyu-zhou/MuPIPR.
Collapse
Affiliation(s)
- Guangyu Zhou
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Muhao Chen
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA.,Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Chelsea J T Ju
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Zheng Wang
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Jyun-Yu Jiang
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, University of California, Los Angeles, CA 90095, USA
| |
Collapse
|
99
|
Luo X, Tu X, Ding Y, Gao G, Deng M. Expectation pooling: an effective and interpretable pooling method for predicting DNA-protein binding. Bioinformatics 2020; 36:1405-1412. [PMID: 31598637 PMCID: PMC7703793 DOI: 10.1093/bioinformatics/btz768] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2019] [Revised: 09/21/2019] [Accepted: 10/05/2019] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Convolutional neural networks (CNNs) have outperformed conventional methods in modeling the sequence specificity of DNA-protein binding. While previous studies have built a connection between CNNs and probabilistic models, simple models of CNNs cannot achieve sufficient accuracy on this problem. Recently, some methods of neural networks have increased performance using complex neural networks whose results cannot be directly interpreted. However, it is difficult to combine probabilistic models and CNNs effectively to improve DNA-protein binding predictions. RESULTS In this article, we present a novel global pooling method: expectation pooling for predicting DNA-protein binding. Our pooling method stems naturally from the expectation maximization algorithm, and its benefits can be interpreted both statistically and via deep learning theory. Through experiments, we demonstrate that our pooling method improves the prediction performance DNA-protein binding. Our interpretable pooling method combines probabilistic ideas with global pooling by taking the expectations of inputs without increasing the number of parameters. We also analyze the hyperparameters in our method and propose optional structures to help fit different datasets. We explore how to effectively utilize these novel pooling methods and show that combining statistical methods with deep learning is highly beneficial, which is promising and meaningful for future studies in this field. AVAILABILITY AND IMPLEMENTATION All code is public in https://github.com/gao-lab/ePooling. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Xinming Tu
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and the State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences
| | - Yang Ding
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and the State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences
| | - Ge Gao
- Biomedical Pioneering Innovation Center (BIOPIC), Beijing Advanced Innovation Center for Genomics (ICG), Center for Bioinformatics (CBI), and the State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences
| | - Minghua Deng
- School of Mathematical Sciences
- Center for Quantitative Biology, Peking University, Beijing 100871, China
| |
Collapse
|
100
|
Feng SH, Zhang WX, Yang J, Yang Y, Shen HB. Topology Prediction Improvement of α-helical Transmembrane Proteins Through Helix-tail Modeling and Multiscale Deep Learning Fusion. J Mol Biol 2020; 432:1279-1296. [DOI: 10.1016/j.jmb.2019.12.007] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2019] [Revised: 12/02/2019] [Accepted: 12/04/2019] [Indexed: 12/18/2022]
|