1
|
Chamberlain AR, Huynh L, Huang W, Taylor DJ, Harris ME. The specificity landscape of bacterial ribonuclease P. J Biol Chem 2024; 300:105498. [PMID: 38013087 PMCID: PMC10731613 DOI: 10.1016/j.jbc.2023.105498] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Revised: 11/14/2023] [Accepted: 11/17/2023] [Indexed: 11/29/2023] Open
Abstract
Developing quantitative models of substrate specificity for RNA processing enzymes is a key step toward understanding their biology and guiding applications in biotechnology and biomedicine. Optimally, models to predict relative rate constants for alternative substrates should integrate an understanding of structures of the enzyme bound to "fast" and "slow" substrates, large datasets of rate constants for alternative substrates, and transcriptomic data identifying in vivo processing sites. Such data are either available or emerging for bacterial ribonucleoprotein RNase P a widespread and essential tRNA 5' processing endonuclease, thus making it a valuable model system for investigating principles of biological specificity. Indeed, the well-established structure and kinetics of bacterial RNase P enabled the development of high throughput measurements of rate constants for tRNA variants and provided the necessary framework for quantitative specificity modeling. Several studies document the importance of conformational changes in the precursor tRNA substrate as well as the RNA and protein subunits of bacterial RNase P during binding, although the functional roles and dynamics are still being resolved. Recently, results from cryo-EM studies of E. coli RNase P with alternative precursor tRNAs are revealing prospective mechanistic relationships between conformational changes and substrate specificity. Yet, extensive uncharted territory remains, including leveraging these advances for drug discovery, achieving a complete accounting of RNase P substrates, and understanding how the cellular context contributes to RNA processing specificity in vivo.
Collapse
Affiliation(s)
| | - Loc Huynh
- Department of Chemistry, University of Florida, Gainesville, Florida, USA
| | - Wei Huang
- Department of Pharmacology, Case Western Reserve University School of Medicine, Cleveland, Ohio, USA
| | - Derek J Taylor
- Department of Pharmacology, Case Western Reserve University School of Medicine, Cleveland, Ohio, USA
| | - Michael E Harris
- Department of Chemistry, University of Florida, Gainesville, Florida, USA.
| |
Collapse
|
2
|
Horlacher M, Cantini G, Hesse J, Schinke P, Goedert N, Londhe S, Moyon L, Marsico A. A systematic benchmark of machine learning methods for protein-RNA interaction prediction. Brief Bioinform 2023; 24:bbad307. [PMID: 37635383 PMCID: PMC10516373 DOI: 10.1093/bib/bbad307] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2023] [Revised: 06/15/2023] [Accepted: 07/18/2023] [Indexed: 08/29/2023] Open
Abstract
RNA-binding proteins (RBPs) are central actors of RNA post-transcriptional regulation. Experiments to profile-binding sites of RBPs in vivo are limited to transcripts expressed in the experimental cell type, creating the need for computational methods to infer missing binding information. While numerous machine-learning based methods have been developed for this task, their use of heterogeneous training and evaluation datasets across different sets of RBPs and CLIP-seq protocols makes a direct comparison of their performance difficult. Here, we compile a set of 37 machine learning (primarily deep learning) methods for in vivo RBP-RNA interaction prediction and systematically benchmark a subset of 11 representative methods across hundreds of CLIP-seq datasets and RBPs. Using homogenized sample pre-processing and two negative-class sample generation strategies, we evaluate methods in terms of predictive performance and assess the impact of neural network architectures and input modalities on model performance. We believe that this study will not only enable researchers to choose the optimal prediction method for their tasks at hand, but also aid method developers in developing novel, high-performing methods by introducing a standardized framework for their evaluation.
Collapse
Affiliation(s)
- Marc Horlacher
- Computational Health Center, Helmholtz Center Munich, Germany
- School of Computation, Information and Technology, Technical University Munich (TUM), Germany
| | - Giulia Cantini
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Julian Hesse
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Patrick Schinke
- Computational Health Center, Helmholtz Center Munich, Germany
| | - Nicolas Goedert
- Computational Health Center, Helmholtz Center Munich, Germany
| | | | - Lambert Moyon
- Computational Health Center, Helmholtz Center Munich, Germany
| | | |
Collapse
|
3
|
Cheng X, Li Z, Shan R, Li Z, Wang S, Zhao W, Zhang H, Chao L, Peng J, Fei T, Li W. Modeling CRISPR-Cas13d on-target and off-target effects using machine learning approaches. Nat Commun 2023; 14:752. [PMID: 36765063 PMCID: PMC9912244 DOI: 10.1038/s41467-023-36316-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 13.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Accepted: 01/26/2023] [Indexed: 02/12/2023] Open
Abstract
A major challenge in the application of the CRISPR-Cas13d system is to accurately predict its guide-dependent on-target and off-target effect. Here, we perform CRISPR-Cas13d proliferation screens and design a deep learning model, named DeepCas13, to predict the on-target activity from guide sequences and secondary structures. DeepCas13 outperforms existing methods to predict the efficiency of guides targeting both protein-coding and non-coding RNAs. Guides targeting non-essential genes display off-target viability effects, which are closely related to their on-target efficiencies. Choosing proper negative control guides during normalization mitigates the associated false positives in proliferation screens. We apply DeepCas13 to the guides targeting lncRNAs, and identify lncRNAs that affect cell viability and proliferation in multiple cell lines. The higher prediction accuracy of DeepCas13 over existing methods is extensively confirmed via a secondary CRISPR-Cas13d screen and quantitative RT-PCR experiments. DeepCas13 is freely accessible via http://deepcas13.weililab.org .
Collapse
Affiliation(s)
- Xiaolong Cheng
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, 20010, USA
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, 20010, USA
| | - Zexu Li
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China
| | - Ruocheng Shan
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, 20010, USA
- Department of Computer Science, George Washington University, Washington, DC, 20052, USA
| | - Zihan Li
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China
| | - Shengnan Wang
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China
| | - Wenchang Zhao
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China
| | - Han Zhang
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China
| | - Lumen Chao
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, 20010, USA
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, 20010, USA
| | - Jian Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL, 61801, USA
| | - Teng Fei
- National Frontiers Science Center for Industrial Intelligence and Systems Optimization, Key Laboratory of Bioresource Research and Development of Liaoning Province, College of Life and Health Sciences, Northeastern University, Shenyang, 110819, China.
| | - Wei Li
- Center for Genetic Medicine Research, Children's National Hospital, Washington, DC, 20010, USA.
- Department of Genomics and Precision Medicine, George Washington University, Washington, DC, 20010, USA.
| |
Collapse
|
4
|
Koo PK, Ploenzke M, Anand P, Paul S, Majdandzic A. ResidualBind: Uncovering Sequence-Structure Preferences of RNA-Binding Proteins with Deep Neural Networks. Methods Mol Biol 2023; 2586:197-215. [PMID: 36705906 DOI: 10.1007/978-1-0716-2768-6_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| | | | - Steffan Paul
- Bioinformatics Program, Harvard Medical School, Boston, MA, USA
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
5
|
Zhang L, Lu C, Zeng M, Li Y, Wang J. CRMSS: predicting circRNA-RBP binding sites based on multi-scale characterizing sequence and structure features. Brief Bioinform 2023; 24:6889442. [PMID: 36511222 DOI: 10.1093/bib/bbac530] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2022] [Revised: 11/01/2022] [Accepted: 11/07/2022] [Indexed: 12/14/2022] Open
Abstract
Circular RNAs (circRNAs) are reverse-spliced and covalently closed RNAs. Their interactions with RNA-binding proteins (RBPs) have multiple effects on the progress of many diseases. Some computational methods are proposed to identify RBP binding sites on circRNAs but suffer from insufficient accuracy, robustness and explanation. In this study, we first take the characteristics of both RNA and RBP into consideration. We propose a method for discriminating circRNA-RBP binding sites based on multi-scale characterizing sequence and structure features, called CRMSS. For circRNAs, we use sequence ${k}\hbox{-}{mer}$ embedding and the forming probabilities of local secondary structures as features. For RBPs, we combine sequence and structure frequencies of RNA-binding domain regions to generate features. We capture binding patterns with multi-scale residual blocks. With BiLSTM and attention mechanism, we obtain the contextual information of high-level representation for circRNA-RBP binding. To validate the effectiveness of CRMSS, we compare its predictive performance with other methods on 37 RBPs. Taking the properties of both circRNAs and RBPs into account, CRMSS achieves superior performance over state-of-the-art methods. In the case study, our model provides reliable predictions and correctly identifies experimentally verified circRNA-RBP pairs. The code of CRMSS is freely available at https://github.com/BioinformaticsCSU/CRMSS.
Collapse
Affiliation(s)
- Lishen Zhang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| | - Chengqian Lu
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| | - Min Zeng
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| | - Yaohang Li
- Department of Computer Science at Old Dominion University, USA
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, China
| |
Collapse
|
6
|
Wu Z, Basu S, Wu X, Kurgan L. qNABpredict: Quick, accurate, and taxonomy-aware sequence-based prediction of content of nucleic acid binding amino acids. Protein Sci 2023; 32:e4544. [PMID: 36519304 PMCID: PMC9798252 DOI: 10.1002/pro.4544] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2022] [Revised: 12/07/2022] [Accepted: 12/08/2022] [Indexed: 12/23/2022]
Abstract
Protein sequence-based predictors of nucleic acid (NA)-binding include methods that predict NA-binding proteins and NA-binding residues. The residue-level tools produce more details but suffer high computational cost since they must predict every amino acid in the input sequence and rely on multiple sequence alignments. We propose an alternative approach that predicts content (fraction) of the NA-binding residues, offering more information than the protein-level prediction and much shorter runtime than the residue-level tools. Our first-of-its-kind content predictor, qNABpredict, relies on a small, rationally designed and fast-to-compute feature set that represents relevant characteristics extracted from the input sequence and a well-parametrized support vector regression model. We provide two versions of qNABpredict, a taxonomy-agnostic model that can be used for proteins of unknown taxonomic origin and more accurate taxonomy-aware models that are tailored to specific taxonomic kingdoms: archaea, bacteria, eukaryota, and viruses. Empirical tests on a low-similarity test dataset show that qNABpredict is 100 times faster and generates statistically more accurate content predictions when compared to the content extracted from results produced by the residue-level predictors. We also show that qNABpredict's content predictions can be used to improve results generated by the residue-level predictors. We release qNABpredict as a convenient webserver and source code at http://biomine.cs.vcu.edu/servers/qNABpredict/. This new tool should be particularly useful to predict details of protein-NA interactions for large protein families and proteomes.
Collapse
Affiliation(s)
- Zhonghua Wu
- School of Mathematical Sciences and LPMCNankai UniversityTianjinChina
| | - Sushmita Basu
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginiaUSA
| | - Xuantai Wu
- School of Mathematical Sciences and LPMCNankai UniversityTianjinChina
| | - Lukasz Kurgan
- Department of Computer ScienceVirginia Commonwealth UniversityRichmondVirginiaUSA
| |
Collapse
|
7
|
Ramstein GP, Buckler ES. Prediction of evolutionary constraint by genomic annotations improves functional prioritization of genomic variants in maize. Genome Biol 2022; 23:183. [PMID: 36050782 PMCID: PMC9438327 DOI: 10.1186/s13059-022-02747-2] [Citation(s) in RCA: 13] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2022] [Accepted: 08/15/2022] [Indexed: 11/10/2022] Open
Abstract
Background Crop improvement through cross-population genomic prediction and genome editing requires identification of causal variants at high resolution, within fewer than hundreds of base pairs. Most genetic mapping studies have generally lacked such resolution. In contrast, evolutionary approaches can detect genetic effects at high resolution, but they are limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Here we use genomic annotations to accurately predict nucleotide conservation across angiosperms, as a proxy for fitness effect of mutations. Results Using only sequence analysis, we annotate nonsynonymous mutations in 25,824 maize gene models, with information from bioinformatics and deep learning. Our predictions are validated by experimental information: within-species conservation, chromatin accessibility, and gene expression. According to gene ontology and pathway enrichment analyses, predicted nucleotide conservation points to genes in central carbon metabolism. Importantly, it improves genomic prediction for fitness-related traits such as grain yield, in elite maize panels, by stringent prioritization of fewer than 1% of single-site variants. Conclusions Our results suggest that predicting nucleotide conservation across angiosperms may effectively prioritize sites most likely to impact fitness-related traits in crops, without being limited by shifting selection, missing data, and low depth of multiple-sequence alignments. Our approach—Prediction of mutation Impact by Calibrated Nucleotide Conservation (PICNC)—could be useful to select polymorphisms for accurate genomic prediction, and candidate mutations for efficient base editing. The trained PICNC models and predicted nucleotide conservation at protein-coding SNPs in maize are publicly available in CyVerse (10.25739/hybz-2957). Supplementary Information The online version contains supplementary material available at 10.1186/s13059-022-02747-2.
Collapse
Affiliation(s)
- Guillaume P Ramstein
- Center for Quantitative Genetics and Genomics, Aarhus University, 8000, Aarhus, Denmark. .,Institute for Genomic Diversity, Cornell University, Ithaca, NY, 14853, USA.
| | - Edward S Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY, 14853, USA.,USDA-ARS, Ithaca, NY, 14853, USA
| |
Collapse
|
8
|
Laverty KU, Jolma A, Pour SE, Zheng H, Ray D, Morris Q, Hughes TR. PRIESSTESS: interpretable, high-performing models of the sequence and structure preferences of RNA-binding proteins. Nucleic Acids Res 2022; 50:e111. [PMID: 36018788 DOI: 10.1093/nar/gkac694] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 07/22/2022] [Accepted: 08/03/2022] [Indexed: 12/23/2022] Open
Abstract
Modelling both primary sequence and secondary structure preferences for RNA binding proteins (RBPs) remains an ongoing challenge. Current models use varied RNA structure representations and can be difficult to interpret and evaluate. To address these issues, we present a universal RNA motif-finding/scanning strategy, termed PRIESSTESS (Predictive RBP-RNA InterpretablE Sequence-Structure moTif regrESSion), that can be applied to diverse RNA binding datasets. PRIESSTESS identifies dozens of enriched RNA sequence and/or structure motifs that are subsequently reduced to a set of core motifs by logistic regression with LASSO regularization. Importantly, these core motifs are easily visualized and interpreted, and provide a measure of RBP secondary structure specificity. We used PRIESSTESS to interrogate new HTR-SELEX data for 23 RBPs with diverse RNA binding modes and captured known primary sequence and secondary structure preferences for each. Moreover, when applying PRIESSTESS to 144 RBPs across 202 RNA binding datasets, 75% showed an RNA secondary structure preference but only 10% had a preference besides unpaired bases, suggesting that most RBPs simply recognize the accessibility of primary sequences.
Collapse
Affiliation(s)
- Kaitlin U Laverty
- Department of Molecular Genetics, University of Toronto, Toronto, Canada
| | - Arttu Jolma
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Donnelly Centre, University of Toronto, Toronto, Canada
| | - Sara E Pour
- Department of Molecular Genetics, University of Toronto, Toronto, Canada
| | - Hong Zheng
- Donnelly Centre, University of Toronto, Toronto, Canada
| | - Debashish Ray
- Donnelly Centre, University of Toronto, Toronto, Canada
| | - Quaid Morris
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Computational and Systems Biology, Memorial Sloan Kettering Cancer Center, New York, USA
| | - Timothy R Hughes
- Department of Molecular Genetics, University of Toronto, Toronto, Canada.,Donnelly Centre, University of Toronto, Toronto, Canada
| |
Collapse
|
9
|
Ma H, Wen H, Xue Z, Li G, Zhang Z. RNANetMotif: Identifying sequence-structure RNA network motifs in RNA-protein binding sites. PLoS Comput Biol 2022; 18:e1010293. [PMID: 35819951 PMCID: PMC9275694 DOI: 10.1371/journal.pcbi.1010293] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2021] [Accepted: 06/09/2022] [Indexed: 11/19/2022] Open
Abstract
RNA molecules can adopt stable secondary and tertiary structures, which are essential in mediating physical interactions with other partners such as RNA binding proteins (RBPs) and in carrying out their cellular functions. In vivo and in vitro experiments such as RNAcompete and eCLIP have revealed in vitro binding preferences of RBPs to RNA oligomers and in vivo binding sites in cells. Analysis of these binding data showed that the structure properties of the RNAs in these binding sites are important determinants of the binding events; however, it has been a challenge to incorporate the structure information into an interpretable model. Here we describe a new approach, RNANetMotif, which takes predicted secondary structure of thousands of RNA sequences bound by an RBP as input and uses a graph theory approach to recognize enriched subgraphs. These enriched subgraphs are in essence shared sequence-structure elements that are important in RBP-RNA binding. To validate our approach, we performed RNA structure modeling via coarse-grained molecular dynamics folding simulations for selected 4 RBPs, and RNA-protein docking for LIN28B. The simulation results, e.g., solvent accessibility and energetics, further support the biological relevance of the discovered network subgraphs. RNA binding proteins (RBPs) regulate every aspect of RNA biology, including splicing, translation, transportation, and degradation. High-throughput technologies such as eCLIP have identified thousands of binding sites for a given RBP throughout the genome. It has been shown by earlier studies that, in addition to nucleotide sequences, the structure and conformation of RNAs also play important role in RBP-RNA interactions. Analogous to protein-protein interactions or protein-DNA interactions, it is likely that there exist intrinsic sequence-structure motifs common to these RNAs that underlie their binding specificity to specific RBPs. It is known that RNAs form energetically favorable secondary structures, which can be represented as graphs, with nucleotides being nodes and backbone covalent bonds and base-pairing hydrogen bonds representing edges. We hypothesize that these graphs can be mined by graph theory approaches to identify sequence-structure motifs as enriched sub-graphs. In this article, we described the details of this approach, termed RNANetMotif and associated new concepts, namely EKS (Extended K-mer Subgraph) and GraphK graph algorithm. To test the utility of our approach, we conducted 3D structure modeling of selected RNA sequences through molecular dynamics (MD) folding simulation and evaluated the significance of the discovered RNA motifs by comparing their spatial exposure with other regions on the RNA. We believe that this approach has the novelty of treating the RNA sequence as a graph and RBP binding sites as enriched subgraph, which has broader applications beyond RBP-RNA interactions.
Collapse
Affiliation(s)
- Hongli Ma
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- School of Mathematics, Shandong University, Jinan, China
| | - Han Wen
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
| | - Zhiyuan Xue
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao, China
- School of Mathematics, Shandong University, Jinan, China
- School of Mathematical Science, Liaocheng University, Liaocheng, China
| | - Zhaolei Zhang
- Terrence Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
10
|
Du X, Zhao X, Zhang Y. DeepBtoD: Improved RNA-binding proteins prediction via integrated deep learning. J Bioinform Comput Biol 2022; 20:2250006. [PMID: 35451938 DOI: 10.1142/s0219720022500068] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
RNA-binding proteins (RBPs) have crucial roles in various cellular processes such as alternative splicing and gene regulation. Therefore, the analysis and identification of RBPs is an essential issue. However, although many computational methods have been developed for predicting RBPs, a few studies simultaneously consider local and global information from the perspective of the RNA sequence. Facing this challenge, we present a novel method called DeepBtoD, which predicts RBPs directly from RNA sequences. First, a [Formula: see text]-BtoD encoding is designed, which takes into account the composition of [Formula: see text]-nucleotides and their relative positions and forms a local module. Second, we designed a multi-scale convolutional module embedded with a self-attentive mechanism, the ms-focusCNN, which is used to further learn more effective, diverse, and discriminative high-level features. Finally, global information is considered to supplement local modules with ensemble learning to predict whether the target RNA binds to RBPs. Our preliminary 24 independent test datasets show that our proposed method can classify RBPs with the area under the curve of 0.933. Remarkably, DeepBtoD shows competitive results across seven state-of-the-art methods, suggesting that RBPs can be highly recognized by integrating local [Formula: see text]-BtoD and global information only from RNA sequences. Hence, our integrative method may be useful to improve the power of RBPs prediction, which might be particularly useful for modeling protein-nucleic acid interactions in systems biology studies. Our DeepBtoD server can be accessed at http://175.27.228.227/DeepBtoD/.
Collapse
Affiliation(s)
- XiuQuan Du
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China.,School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - XiuJuan Zhao
- School of Computer Science and Technology, Anhui University, Hefei 230601, Anhui, P. R. China
| | - YanPing Zhang
- Key Laboratory of Intelligent Computing and Signal Processing of Ministry of Education, Anhui University, Hefei 230601, Anhui, P. R. China
| |
Collapse
|
11
|
Peng L, Tan J, Tian X, Zhou L. EnANNDeep: An Ensemble-based lncRNA-protein Interaction Prediction Framework with Adaptive k-Nearest Neighbor Classifier and Deep Models. Interdiscip Sci 2022; 14:209-232. [PMID: 35006529 DOI: 10.1007/s12539-021-00483-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2021] [Revised: 09/14/2021] [Accepted: 09/15/2021] [Indexed: 01/08/2023]
Abstract
lncRNA-protein interactions (LPIs) prediction can deepen the understanding of many important biological processes. Artificial intelligence methods have reported many possible LPIs. However, most computational techniques were evaluated mainly on one dataset, which may produce prediction bias. More importantly, they were validated only under cross validation on lncRNA-protein pairs, and did not consider the performance under cross validations on lncRNAs and proteins, thus fail to search related proteins/lncRNAs for a new lncRNA/protein. Under an ensemble learning framework (EnANNDeep) composed of adaptive k-nearest neighbor classifier and Deep models, this study focuses on systematically finding underlying linkages between lncRNAs and proteins. First, five LPI-related datasets are arranged. Second, multiple source features are integrated to depict an lncRNA-protein pair. Third, adaptive k-nearest neighbor classifier, deep neural network, and deep forest are designed to score unknown lncRNA-protein pairs, respectively. Finally, interaction probabilities from the three predictors are integrated based on a soft voting technique. In comparing to five classical LPI identification models (SFPEL, PMDKN, CatBoost, PLIPCOM, and LPI-SKF) under fivefold cross validations on lncRNAs, proteins, and LPIs, EnANNDeep computes the best average AUCs of 0.8660, 0.8775, and 0.9166, respectively, and the best average AUPRs of 0.8545, 0.8595, and 0.9054, respectively, indicating its superior LPI prediction ability. Case study analyses indicate that SNHG10 may have dense linkage with Q15717. In the ensemble framework, adaptive k-nearest neighbor classifier can separately pick the most appropriate k for each query lncRNA-protein pair. More importantly, deep models including deep neural network and deep forest can effectively learn the representative features of lncRNAs and proteins.
Collapse
Affiliation(s)
- Lihong Peng
- School of Computer Science, Hunan University of Technology, Zhuzhou, China. .,College of Life Sciences and Chemistry, Hunan University of Technology, Zhuzhou, China.
| | - Jingwei Tan
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Xiongfei Tian
- School of Computer Science, Hunan University of Technology, Zhuzhou, China
| | - Liqian Zhou
- School of Computer Science, Hunan University of Technology, Zhuzhou, China.
| |
Collapse
|
12
|
Wei J, Chen S, Zong L, Gao X, Li Y. Protein-RNA interaction prediction with deep learning: structure matters. Brief Bioinform 2022; 23:bbab540. [PMID: 34929730 PMCID: PMC8790951 DOI: 10.1093/bib/bbab540] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 11/14/2021] [Accepted: 11/22/2021] [Indexed: 12/11/2022] Open
Abstract
Protein-RNA interactions are of vital importance to a variety of cellular activities. Both experimental and computational techniques have been developed to study the interactions. Because of the limitation of the previous database, especially the lack of protein structure data, most of the existing computational methods rely heavily on the sequence data, with only a small portion of the methods utilizing the structural information. Recently, AlphaFold has revolutionized the entire protein and biology field. Foreseeably, the protein-RNA interaction prediction will also be promoted significantly in the upcoming years. In this work, we give a thorough review of this field, surveying both the binding site and binding preference prediction problems and covering the commonly used datasets, features and models. We also point out the potential challenges and opportunities in this field. This survey summarizes the development of the RNA-binding protein-RNA interaction field in the past and foresees its future development in the post-AlphaFold era.
Collapse
Affiliation(s)
- Junkang Wei
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Siyuan Chen
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Licheng Zong
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
| | - Xin Gao
- Computational Bioscience Research Center (CBRC),
King Abdullah University of Science and Technology (KAUST),
23955-6900, Thuwal, Saudi Arabia
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese
University of Hong Kong (CUHK), 999077, Hong Kong SAR, China
- The CUHK Shenzhen Research Institute, Hi-Tech Park, 518057,
Shenzhen, China
| |
Collapse
|
13
|
Zhao S, Hamada M. Multi-resBind: a residual network-based multi-label classifier for in vivo RNA binding prediction and preference visualization. BMC Bioinformatics 2021; 22:554. [PMID: 34781902 PMCID: PMC8594109 DOI: 10.1186/s12859-021-04430-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2021] [Accepted: 10/06/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Protein-RNA interactions play key roles in many processes regulating gene expression. To understand the underlying binding preference, ultraviolet cross-linking and immunoprecipitation (CLIP)-based methods have been used to identify the binding sites for hundreds of RNA-binding proteins (RBPs) in vivo. Using these large-scale experimental data to infer RNA binding preference and predict missing binding sites has become a great challenge. Some existing deep-learning models have demonstrated high prediction accuracy for individual RBPs. However, it remains difficult to avoid significant bias due to the experimental protocol. The DeepRiPe method was recently developed to solve this problem via introducing multi-task or multi-label learning into this field. However, this method has not reached an ideal level of prediction power due to the weak neural network architecture. RESULTS Compared to the DeepRiPe approach, our Multi-resBind method demonstrated substantial improvements using the same large-scale PAR-CLIP dataset with respect to an increase in the area under the receiver operating characteristic curve and average precision. We conducted extensive experiments to evaluate the impact of various types of input data on the final prediction accuracy. The same approach was used to evaluate the effect of loss functions. Finally, a modified integrated gradient was employed to generate attribution maps. The patterns disentangled from relative contributions according to context offer biological insights into the underlying mechanism of protein-RNA interactions. CONCLUSIONS Here, we propose Multi-resBind as a new multi-label deep-learning approach to infer protein-RNA binding preferences and predict novel interactions. The results clearly demonstrate that Multi-resBind is a promising tool to predict unknown binding sites in vivo and gain biology insights into why the neural network makes a given prediction.
Collapse
Affiliation(s)
- Shitao Zhao
- Waseda Research Institute for Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan.
| | - Michiaki Hamada
- Department of Electrical Engineering and Bioscience, Faculty of Science and Engineering, Waseda University, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), National Institute of Advanced Industrial Science and Technology, 3-4-1 Okubo Shinjuku-ku, Tokyo, 169-8555, Japan. .,Graduate School of Medicine, Nippon Medical School, 1-1-5 Sendagi, Bunkyo-ku, Tokyo, 113-8602, Japan.
| |
Collapse
|
14
|
Niu M, Wu J, Zou Q, Liu Z, Xu L. rBPDL:Predicting RNA-Binding Proteins Using Deep Learning. IEEE J Biomed Health Inform 2021; 25:3668-3676. [PMID: 33780344 DOI: 10.1109/jbhi.2021.3069259] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
RNA-binding protein (RBP) is a powerful and wide-ranging regulator that plays an important role in cell development, differentiation, metabolism, health and disease. The prediction of RBPs provides valuable guidance for biologists. Although experimental methods have made great progress in predicting RBP, they are time-consuming and not flexible. Therefore, we developed a network model, rBPDL, by combining a convolutional neural network and long short-term memory for multilabel classification of RBPs. Moreover, to achieve better prediction results, we used a voting algorithm for ensemble learning of the model. We compared rBPDL with state-of-the-art methods and found that rBPDL significantly improved identification performance for the RBP68 dataset, with a macro-Area Under Curve (AUC), micro-AUC, and weighted AUC of 0.936, 0.962, and 0.946, respectively. Furthermore, through AUC statistical analysis of the RBP domain, we analyzed the performance of rBPDL and found that the RBP identification performance in the same domain was similar. In addition, we analyzed the performance preferences and physicochemical properties of the binding protein amino acids and explored the characteristics that affect the binding by using the RBP86 dataset.
Collapse
|
15
|
Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB. Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput Biol 2021; 17:e1008925. [PMID: 33983921 PMCID: PMC8118286 DOI: 10.1371/journal.pcbi.1008925] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 03/30/2021] [Indexed: 12/15/2022] Open
Abstract
Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.
Collapse
Affiliation(s)
- Peter K. Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Matthew Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Praveen Anand
- Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
| | - Steffan B. Paul
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
16
|
Abstract
Deep neural networks have been revolutionizing the field of machine learning for the past several years. They have been applied with great success in many domains of the biomedical data sciences and are outperforming extant methods by a large margin. The ability of deep neural networks to pick up local image features and model the interactions between them makes them highly applicable to regulatory genomics. Instead of an image, the networks analyze DNA and RNA sequences and additional epigenomic data. In this review, we survey the successes of deep learning in the field of regulatory genomics. We first describe the fundamental building blocks of deep neural networks, popular architectures used in regulatory genomics, and their training process on molecular sequence data. We then review several key methods in different gene regulation domains. We start with the pioneering method DeepBind and its successors, which were developed to predict protein–DNA binding. We then review methods developed to predict and model epigenetic information, such as histone marks and nucleosome occupancy. Following epigenomics, we review methods to predict protein–RNA binding with its unique challenge of incorporating RNA structure information. Finally, we provide our overall view of the strengths and weaknesses of deep neural networks and prospects for future developments.
Collapse
Affiliation(s)
- Mira Barshai
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Eitamar Tripto
- Department of Biomedical Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| |
Collapse
|
17
|
Wekesa JS, Meng J, Luan Y. A deep learning model for plant lncRNA-protein interaction prediction with graph attention. Mol Genet Genomics 2020; 295:1091-1102. [DOI: 10.1007/s00438-020-01682-w] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2020] [Accepted: 05/01/2020] [Indexed: 02/06/2023]
|
18
|
Wekesa JS, Meng J, Luan Y. Multi-feature fusion for deep learning to predict plant lncRNA-protein interaction. Genomics 2020; 112:2928-2936. [PMID: 32437848 DOI: 10.1016/j.ygeno.2020.05.005] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Revised: 04/22/2020] [Accepted: 05/05/2020] [Indexed: 12/28/2022]
Abstract
Long non-coding RNAs (lncRNAs) play key roles in regulating cellular biological processes through diverse molecular mechanisms including binding to RNA binding proteins. The majority of plant lncRNAs are functionally uncharacterized, thus, accurate prediction of plant lncRNA-protein interaction is imperative for subsequent functional studies. We present an integrative model, namely DRPLPI. Its uniqueness is that it predicts by multi-feature fusion. Structural and four groups of sequence features are used, including tri-nucleotide composition, gapped k-mer, recursive complement and binary profile. We design a multi-head self-attention long short-term memory encoder-decoder network to extract generative high-level features. To obtain robust results, DRPLPI combines categorical boosting and extra trees into a single meta-learner. Experiments on Zea mays and Arabidopsis thaliana obtained 0.9820 and 0.9652 area under precision/recall curve (AUPRC) respectively. The proposed method shows significant enhancement in the prediction performance compared with existing state-of-the-art methods.
Collapse
Affiliation(s)
- Jael Sanyanda Wekesa
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China; School of Computing and Information Technology, Jomo Kenyatta University of Agriculture and Technology, Nairobi 62000-00200, Kenya
| | - Jun Meng
- School of Computer Science and Technology, Dalian University of Technology, Dalian, Liaoning 116023, China.
| | - Yushi Luan
- School of Bioengineering, Dalian University of Technology, Dalian, Liaoning 116023, China
| |
Collapse
|
19
|
Su Y, Ko ME, Cheng H, Zhu R, Xue M, Wang J, Lee JW, Frankiw L, Xu A, Wong S, Robert L, Takata K, Yuan D, Lu Y, Huang S, Ribas A, Levine R, Nolan GP, Wei W, Plevritis SK, Li G, Baltimore D, Heath JR. Multi-omic single-cell snapshots reveal multiple independent trajectories to drug tolerance in a melanoma cell line. Nat Commun 2020; 11:2345. [PMID: 32393797 PMCID: PMC7214418 DOI: 10.1038/s41467-020-15956-9] [Citation(s) in RCA: 68] [Impact Index Per Article: 13.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Accepted: 04/02/2020] [Indexed: 12/12/2022] Open
Abstract
The determination of individual cell trajectories through a high-dimensional cell-state space is an outstanding challenge for understanding biological changes ranging from cellular differentiation to epigenetic responses of diseased cells upon drugging. We integrate experiments and theory to determine the trajectories that single BRAFV600E mutant melanoma cancer cells take between drug-naive and drug-tolerant states. Although single-cell omics tools can yield snapshots of the cell-state landscape, the determination of individual cell trajectories through that space can be confounded by stochastic cell-state switching. We assayed for a panel of signaling, phenotypic, and metabolic regulators at points across 5 days of drug treatment to uncover a cell-state landscape with two paths connecting drug-naive and drug-tolerant states. The trajectory a given cell takes depends upon the drug-naive level of a lineage-restricted transcription factor. Each trajectory exhibits unique druggable susceptibilities, thus updating the paradigm of adaptive resistance development in an isogenic cell population.
Collapse
Affiliation(s)
- Yapeng Su
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, USA
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
- Institute for Systems Biology, Seattle, Washington, USA
| | - Melissa E Ko
- Cancer Biology Program, Stanford University School of Medicine, Stanford, California, USA
| | - Hanjun Cheng
- Institute for Systems Biology, Seattle, Washington, USA
| | - Ronghui Zhu
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| | - Min Xue
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, USA
- Department of Chemistry, University of California, Riverside, Riverside, California, USA
| | - Jessica Wang
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| | - Jihoon W Lee
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, USA
| | - Luke Frankiw
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| | - Alexander Xu
- Institute for Systems Biology, Seattle, Washington, USA
| | - Stephanie Wong
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| | - Lidia Robert
- Department of Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Kaitlyn Takata
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA
| | - Dan Yuan
- Institute for Systems Biology, Seattle, Washington, USA
| | - Yue Lu
- Institute for Systems Biology, Seattle, Washington, USA
| | - Sui Huang
- Institute for Systems Biology, Seattle, Washington, USA
| | - Antoni Ribas
- Department of Medicine, University of California, Los Angeles, Los Angeles, California, USA
- Department of Molecular and Medical Pharmacology, UCLA, Los Angeles, California, USA
- Department of Surgery, UCLA, Los Angeles, California, USA
- Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, California, USA
| | - Raphael Levine
- Department of Molecular and Medical Pharmacology, UCLA, Los Angeles, California, USA
- Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, California, USA
- The Fritz Haber Research Center, The Hebrew University, Jerusalem, Israel
| | - Garry P Nolan
- Department of Microbiology and Immunology, Stanford University, Stanford, California, USA
| | - Wei Wei
- Institute for Systems Biology, Seattle, Washington, USA
- Department of Molecular and Medical Pharmacology, UCLA, Los Angeles, California, USA
- Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, California, USA
| | | | - Guideng Li
- Center of Systems Medicine, Institute of Basic Medical Sciences, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China.
- Suzhou Institute of Systems Medicine, Suzhou, China.
| | - David Baltimore
- Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, California, USA.
| | - James R Heath
- Division of Chemistry and Chemical Engineering, California Institute of Technology, Pasadena, California, USA.
- Institute for Systems Biology, Seattle, Washington, USA.
- Department of Molecular and Medical Pharmacology, UCLA, Los Angeles, California, USA.
- Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, California, USA.
| |
Collapse
|