1
|
Capitanchik C, Wilkins OG, Wagner N, Gagneur J, Ule J. From computational models of the splicing code to regulatory mechanisms and therapeutic implications. Nat Rev Genet 2025; 26:171-190. [PMID: 39358547 DOI: 10.1038/s41576-024-00774-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/27/2024] [Indexed: 10/04/2024]
Abstract
Since the discovery of RNA splicing and its role in gene expression, researchers have sought a set of rules, an algorithm or a computational model that could predict the splice isoforms, and their frequencies, produced from any transcribed gene in a specific cellular context. Over the past 30 years, these models have evolved from simple position weight matrices to deep-learning models capable of integrating sequence data across vast genomic distances. Most recently, new model architectures are moving the field closer to context-specific alternative splicing predictions, and advances in sequencing technologies are expanding the type of data that can be used to inform and interpret such models. Together, these developments are driving improved understanding of splicing regulatory mechanisms and emerging applications of the splicing code to the rational design of RNA- and splicing-based therapeutics.
Collapse
Affiliation(s)
- Charlotte Capitanchik
- The Francis Crick Institute, London, UK
- UK Dementia Research Institute at King's College London, London, UK
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK
| | - Oscar G Wilkins
- The Francis Crick Institute, London, UK
- UCL Queen Square Motor Neuron Disease Centre, Department of Neuromuscular Diseases, UCL Queen Square Institute of Neurology, UCL, London, UK
| | - Nils Wagner
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
- Helmholtz Association - Munich School for Data Science (MUDS), Munich, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Computational Health Center, Helmholtz Center Munich, Neuherberg, Germany.
| | - Jernej Ule
- The Francis Crick Institute, London, UK.
- UK Dementia Research Institute at King's College London, London, UK.
- Department of Basic and Clinical Neuroscience, Institute of Psychiatry Psychology & Neuroscience, King's College London, London, UK.
- National Institute of Chemistry, Ljubljana, Slovenia.
| |
Collapse
|
2
|
Wu D, Maus N, Jha A, Yang K, Wales-McGrath BD, Jewell S, Tangiyan A, Choi P, Gardner JR, Barash Y. Generative modeling for RNA splicing predictions and design. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.20.633986. [PMID: 39896553 PMCID: PMC11785043 DOI: 10.1101/2025.01.20.633986] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2025]
Abstract
Alternative splicing (AS) of pre-mRNA plays a crucial role in tissue-specific gene regulation, with disease implications due to splicing defects. Predicting and manipulating AS can therefore uncover new regulatory mechanisms and aid in therapeutics design. We introduce TrASPr+BOS, a generative AI model with Bayesian Optimization for predicting and designing RNA for tissue-specific splicing outcomes. TrASPr is a multi-transformer model that can handle different types of AS events and generalize to unseen cellular conditions. It then serves as an oracle, generating labeled data to train a Bayesian Optimization for Splicing (BOS) algorithm to design RNA for condition-specific splicing outcomes. We show TrASPr+BOS outperforms existing methods, enhancing tissue-specific AUPRC by up to 2.4 fold and capturing tissue-specific regulatory elements. We validate hundreds of predicted novel tissue-specific splicing variations and confirm new regulatory elements using dCas13. We envision TrASPr+BOS as a light yet accurate method researchers can probe or adopt for specific tasks.
Collapse
Affiliation(s)
- Di Wu
- Department of Computer and Information Science, School of Engineering, University of Pennsylvania
| | - Natalie Maus
- Department of Computer and Information Science, School of Engineering, University of Pennsylvania
| | - Anupama Jha
- Department of Genome Sciences, University of Washington
| | - Kevin Yang
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
| | | | - San Jewell
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
| | - Anna Tangiyan
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia
| | - Peter Choi
- Department of Pathology & Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia
| | - Jacob R. Gardner
- Department of Computer and Information Science, School of Engineering, University of Pennsylvania
| | - Yoseph Barash
- Department of Computer and Information Science, School of Engineering, University of Pennsylvania
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
| |
Collapse
|
3
|
Yang K, Islas N, Jewell S, Wu D, Jha A, Radens C, Pleiss J, Lynch K, Barash Y, Choi P. Machine learning-optimized targeted detection of alternative splicing. Nucleic Acids Res 2025; 53:gkae1260. [PMID: 39727154 PMCID: PMC11797022 DOI: 10.1093/nar/gkae1260] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 10/31/2024] [Accepted: 12/10/2024] [Indexed: 12/28/2024] Open
Abstract
RNA sequencing (RNA-seq) is widely adopted for transcriptome analysis but has inherent biases that hinder the comprehensive detection and quantification of alternative splicing. To address this, we present an efficient targeted RNA-seq method that greatly enriches for splicing-informative junction-spanning reads. Local splicing variation sequencing (LSV-seq) utilizes multiplexed reverse transcription from highly scalable pools of primers anchored near splicing events of interest. Primers are designed using Optimal Prime, a novel machine learning algorithm trained on the performance of thousands of primer sequences. In experimental benchmarks, LSV-seq achieves high on-target capture rates and concordance with RNA-seq, while requiring significantly lower sequencing depth. Leveraging deep learning splicing code predictions, we used LSV-seq to target events with low coverage in GTEx RNA-seq data and newly discover hundreds of tissue-specific splicing events. Our results demonstrate the ability of LSV-seq to quantify splicing of events of interest at high-throughput and with exceptional sensitivity.
Collapse
Affiliation(s)
- Kevin Yang
- Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
- Division of Cancer Pathobiology, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| | - Nathaniel Islas
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - San Jewell
- Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Di Wu
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Anupama Jha
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Caleb M Radens
- Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jeffrey A Pleiss
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY 14853, USA
| | - Kristen W Lynch
- Department of Biochemistry and Biophysics, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Yoseph Barash
- Department of Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Peter S Choi
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA 19104, USA
- Division of Cancer Pathobiology, Children’s Hospital of Philadelphia, Philadelphia, PA 19104, USA
| |
Collapse
|
4
|
Daoud A, Ben-Hur A. The role of chromatin state in intron retention: A case study in leveraging large scale deep learning models. PLoS Comput Biol 2025; 21:e1012755. [PMID: 39792954 PMCID: PMC11756788 DOI: 10.1371/journal.pcbi.1012755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2024] [Revised: 01/23/2025] [Accepted: 12/30/2024] [Indexed: 01/12/2025] Open
Abstract
Complex deep learning models trained on very large datasets have become key enabling tools for current research in natural language processing and computer vision. By providing pre-trained models that can be fine-tuned for specific applications, they enable researchers to create accurate models with minimal effort and computational resources. Large scale genomics deep learning models come in two flavors: the first are large language models of DNA sequences trained in a self-supervised fashion, similar to the corresponding natural language models; the second are supervised learning models that leverage large scale genomics datasets from ENCODE and other sources. We argue that these models are the equivalent of foundation models in natural language processing in their utility, as they encode within them chromatin state in its different aspects, providing useful representations that allow quick deployment of accurate models of gene regulation. We demonstrate this premise by leveraging the recently created Sei model to develop simple, interpretable models of intron retention, and demonstrate their advantage over models based on the DNA language model DNABERT-2. Our work also demonstrates the impact of chromatin state on the regulation of intron retention. Using representations learned by Sei, our model is able to discover the involvement of transcription factors and chromatin marks in regulating intron retention, providing better accuracy than a recently published custom model developed for this purpose.
Collapse
Affiliation(s)
- Ahmed Daoud
- Department of Computer Science, Colorado State University, Fort Collins, Colorado, United States of America
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, Colorado, United States of America
| |
Collapse
|
5
|
Yang K, Islas N, Jewell S, Jha A, Radens CM, Pleiss JA, Lynch KW, Barash Y, Choi PS. Machine learning-optimized targeted detection of alternative splicing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.20.614162. [PMID: 39386495 PMCID: PMC11463589 DOI: 10.1101/2024.09.20.614162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 10/12/2024]
Abstract
RNA-sequencing (RNA-seq) is widely adopted for transcriptome analysis but has inherent biases which hinder the comprehensive detection and quantification of alternative splicing. To address this, we present an efficient targeted RNA-seq method that greatly enriches for splicing-informative junction-spanning reads. Local Splicing Variation sequencing (LSV-seq) utilizes multiplexed reverse transcription from highly scalable pools of primers anchored near splicing events of interest. Primers are designed using Optimal Prime, a novel machine learning algorithm trained on the performance of thousands of primer sequences. In experimental benchmarks, LSV-seq achieves high on-target capture rates and concordance with RNA-seq, while requiring significantly lower sequencing depth. Leveraging deep learning splicing code predictions, we used LSV-seq to target events with low coverage in GTEx RNA-seq data and newly discover hundreds of tissue-specific splicing events. Our results demonstrate the ability of LSV-seq to quantify splicing of events of interest at high-throughput and with exceptional sensitivity.
Collapse
Affiliation(s)
- Kevin Yang
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| | - Nathaniel Islas
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
| | - San Jewell
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| | - Anupama Jha
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Caleb M. Radens
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
| | - Jeffrey A. Pleiss
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Kristen W. Lynch
- Department of Biochemistry and Biophysics, University of Pennsylvania, Philadelphia, PA, USA
| | - Yoseph Barash
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, USA
| | - Peter S. Choi
- Department of Pathology & Laboratory Medicine, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia, Philadelphia, PA, USA
| |
Collapse
|
6
|
Wang D, Gazzara MR, Jewell S, Wales-McGrath B, Brown CD, Choi PS, Barash Y. A Deep Dive into Statistical Modeling of RNA Splicing QTLs Reveals New Variants that Explain Neurodegenerative Disease. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.01.610696. [PMID: 39282456 PMCID: PMC11398334 DOI: 10.1101/2024.09.01.610696] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 09/22/2024]
Abstract
Genome-wide association studies (GWAS) have identified thousands of putative disease causing variants with unknown regulatory effects. Efforts to connect these variants with splicing quantitative trait loci (sQTLs) have provided functional insights, yet sQTLs reported by existing methods cannot explain many GWAS signals. We show current sQTL modeling approaches can be improved by considering alternative splicing representation, model calibration, and covariate integration. We then introduce MAJIQTL, a new pipeline for sQTL discovery. MAJIQTL includes two new statistical methods: a weighted multiple testing approach for sGene discovery and a model for sQTL effect size inference to improve variant prioritization. By applying MAJIQTL to GTEx, we find significantly more sGenes harboring sQTLs with functional significance. Notably, our analysis implicates the novel variant rs582283 in Alzheimer's disease. Using antisense oligonucleotides, we validate this variant's effect by blocking the implicated YBX3 binding site, leading to exon skipping in the gene MS4A3.
Collapse
Affiliation(s)
- David Wang
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania
| | - Matthew R. Gazzara
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
- Graduate Group in Genomics and Computational Biology, Perelman School of Medicine, University of Pennsylvania
| | - San Jewell
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
| | | | | | - Peter S. Choi
- Department of Pathology & Laboratory Medicine, Perelman School of Medicine, University of Pennsylvania
- Division of Cancer Pathobiology, The Children’s Hospital of Philadelphia
| | - Yoseph Barash
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania
- Department of Computer and Information Sciences, School of Engineering, University of Pennsylvania
| |
Collapse
|
7
|
Xu C, Bao S, Wang Y, Li W, Chen H, Shen Y, Jiang T, Zhang C. Reference-informed prediction of alternative splicing and splicing-altering mutations from sequences. Genome Res 2024; 34:1052-1065. [PMID: 39060028 PMCID: PMC11368187 DOI: 10.1101/gr.279044.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2024] [Accepted: 07/18/2024] [Indexed: 07/28/2024]
Abstract
Alternative splicing plays a crucial role in protein diversity and gene expression regulation in higher eukaryotes, and mutations causing dysregulated splicing underlie a range of genetic diseases. Computational prediction of alternative splicing from genomic sequences not only provides insight into gene-regulatory mechanisms but also helps identify disease-causing mutations and drug targets. However, the current methods for the quantitative prediction of splice site usage still have limited accuracy. Here, we present DeltaSplice, a deep neural network model optimized to learn the impact of mutations on quantitative changes in alternative splicing from the comparative analysis of homologous genes. The model architecture enables DeltaSplice to perform "reference-informed prediction" by incorporating the known splice site usage of a reference gene sequence to improve its prediction on splicing-altering mutations. We benchmarked DeltaSplice and several other state-of-the-art methods on various prediction tasks, including evolutionary sequence divergence on lineage-specific splicing and splicing-altering mutations in human populations and neurodevelopmental disorders, and demonstrated that DeltaSplice outperformed consistently. DeltaSplice predicted ∼15% of splicing quantitative trait loci (sQTLs) in the human brain as causal splicing-altering variants. It also predicted splicing-altering de novo mutations outside the splice sites in a subset of patients affected by autism and other neurodevelopmental disorders (NDDs), including 19 genes with recurrent splicing-altering mutations. Integration of splicing-altering mutations with other types of de novo mutation burdens allowed the prediction of eight novel NDD-risk genes. Our work expanded the capacity of in silico splicing models with potential applications in genetic diagnosis and the development of splicing-based precision medicine.
Collapse
Affiliation(s)
- Chencheng Xu
- Bioinformatics Division, BNRIST, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
| | - Suying Bao
- Department of Systems Biology, Columbia University, New York, New York 10032, USA
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | - Ye Wang
- Department of Systems Biology, Columbia University, New York, New York 10032, USA
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| | - Wenxing Li
- Department of Systems Biology, Columbia University, New York, New York 10032, USA
- Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA
| | - Hao Chen
- Department of Computer Science and Engineering, University of California, Riverside, California 92521, USA
| | - Yufeng Shen
- Department of Systems Biology, Columbia University, New York, New York 10032, USA
- Department of Biomedical Informatics, Columbia University, New York, New York 10032, USA
| | - Tao Jiang
- Bioinformatics Division, BNRIST, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China;
- Department of Computer Science and Engineering, University of California, Riverside, California 92521, USA
| | - Chaolin Zhang
- Department of Systems Biology, Columbia University, New York, New York 10032, USA;
- Department of Biochemistry and Molecular Biophysics, Columbia University, New York, New York 10032, USA
| |
Collapse
|
8
|
Tian H, Tang L, Yang Z, Xiang Y, Min Q, Yin M, You H, Xiao Z, Shen J. Current understanding of functional peptides encoded by lncRNA in cancer. Cancer Cell Int 2024; 24:252. [PMID: 39030557 PMCID: PMC11265036 DOI: 10.1186/s12935-024-03446-7] [Citation(s) in RCA: 11] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2023] [Accepted: 07/09/2024] [Indexed: 07/21/2024] Open
Abstract
Dysregulated gene expression and imbalance of transcriptional regulation are typical features of cancer. RNA always plays a key role in these processes. Human transcripts contain many RNAs without long open reading frames (ORF, > 100 aa) and that are more than 200 bp in length. They are usually regarded as long non-coding RNA (lncRNA) which play an important role in cancer regulation, including chromatin remodeling, transcriptional regulation, translational regulation and as miRNA sponges. With the advancement of ribosome profiling and sequencing technologies, increasing research evidence revealed that some ORFs in lncRNA can also encode peptides and participate in the regulation of multiple organ tumors, which undoubtedly opens a new chapter in the field of lncRNA and oncology research. In this review, we discuss the biological function of lncRNA in tumors, the current methods to evaluate their coding potential and the role of functional small peptides encoded by lncRNA in cancers. Investigating the small peptides encoded by lncRNA and understanding the regulatory mechanisms of these functional peptides may contribute to a deeper understanding of cancer and the development of new targeted anticancer therapies.
Collapse
Affiliation(s)
- Hua Tian
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
- School of Nursing, Chongqing College of Humanities, Science & Technology, Chongqing, China
| | - Lu Tang
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Zihan Yang
- Department of Pathology, The Affiliated Hospital of Southwest Medical University, Luzhou, China, 646000
| | - Yanxi Xiang
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Qi Min
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Mengshuang Yin
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Huili You
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China
| | - Zhangang Xiao
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China.
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China.
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China.
- Gulin Traditional Chinese Medicine Hospital, Luzhou, China.
- Department of Pharmacology, School of Pharmacy, Sichuan College of Traditional Chinese Medicine, Mianyang, China.
| | - Jing Shen
- Laboratory of Molecular Pharmacology, Department of Pharmacology, School of Pharmacy, Southwest Medical University, Luzhou, 646000, China.
- Cell Therapy and Cell Drugs of Luzhou Key Laboratory, Luzhou, 646000, China.
- South Sichuan Institute of Translational Medicine, Luzhou, 646000, China.
| |
Collapse
|
9
|
Apostolides M, Choi B, Navickas A, Saberi A, Soto LM, Goodarzi H, Najafabadi HS. Accurate isoform quantification by joint short- and long-read RNA-sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.11.603067. [PMID: 39026819 PMCID: PMC11257535 DOI: 10.1101/2024.07.11.603067] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
Accurate quantification of transcript isoforms is crucial for understanding gene regulation, functional diversity, and cellular behavior. Existing RNA sequencing methods have significant limitations: short-read (SR) sequencing provides high depth but struggles with isoform deconvolution, whereas long-read (LR) sequencing offers isoform resolution at the cost of lower depth, higher noise, and technical biases. Addressing this gap, we introduce Multi-Platform Aggregation and Quantification of Transcripts (MPAQT), a generative model that combines the complementary strengths of different sequencing platforms to achieve state-of-the-art isoform-resolved transcript quantification, as demonstrated by extensive simulations and experimental benchmarks. By applying MPAQT to an in vitro model of human embryonic stem cell differentiation into cortical neurons, followed by machine learning-based modeling of transcript abundances, we show that untranslated regions (UTRs) are major determinants of isoform proportion and exon usage; this effect is mediated through isoform-specific sequence features embedded in UTRs, which likely interact with RNA-binding proteins that modulate mRNA stability. These findings highlight MPAQT's potential to enhance our understanding of transcriptomic complexity and underline the role of splicing-independent post-transcriptional mechanisms in shaping the isoform and exon usage landscape of the cell.
Collapse
Affiliation(s)
- Michael Apostolides
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- Victor P. Dahdaleh Institute of Genomic Medicine, Montreal, QC, Canada
| | - Benedict Choi
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA
- Department of Urology, University of California, San Francisco, San Francisco, CA, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
| | - Albertas Navickas
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA
- Department of Urology, University of California, San Francisco, San Francisco, CA, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Present address: Institut Curie, PSL Research University, CNRS UMR3348, INSERM U1278, Orsay, France
| | - Ali Saberi
- Victor P. Dahdaleh Institute of Genomic Medicine, Montreal, QC, Canada
- Department of Electrical and Computer Engineering, McGill University, Montreal, Canada
| | - Larisa M. Soto
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- Victor P. Dahdaleh Institute of Genomic Medicine, Montreal, QC, Canada
| | - Hani Goodarzi
- Department of Biochemistry and Biophysics, University of California, San Francisco, San Francisco, CA, USA
- Department of Urology, University of California, San Francisco, San Francisco, CA, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Arc Institute, 3181 Porter Drive, Palo Alto, CA, USA
| | - Hamed S. Najafabadi
- Department of Human Genetics, McGill University, Montreal, QC, Canada
- Victor P. Dahdaleh Institute of Genomic Medicine, Montreal, QC, Canada
- McGill Centre for RNA Sciences, McGill University, Montreal, Canada
| |
Collapse
|
10
|
Hwang H, Jeon H, Yeo N, Baek D. Big data and deep learning for RNA biology. Exp Mol Med 2024; 56:1293-1321. [PMID: 38871816 PMCID: PMC11263376 DOI: 10.1038/s12276-024-01243-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 02/27/2024] [Accepted: 03/05/2024] [Indexed: 06/15/2024] Open
Abstract
The exponential growth of big data in RNA biology (RB) has led to the development of deep learning (DL) models that have driven crucial discoveries. As constantly evidenced by DL studies in other fields, the successful implementation of DL in RB depends heavily on the effective utilization of large-scale datasets from public databases. In achieving this goal, data encoding methods, learning algorithms, and techniques that align well with biological domain knowledge have played pivotal roles. In this review, we provide guiding principles for applying these DL concepts to various problems in RB by demonstrating successful examples and associated methodologies. We also discuss the remaining challenges in developing DL models for RB and suggest strategies to overcome these challenges. Overall, this review aims to illuminate the compelling potential of DL for RB and ways to apply this powerful technology to investigate the intriguing biology of RNA more effectively.
Collapse
Affiliation(s)
- Hyeonseo Hwang
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Hyeonseong Jeon
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea
- Genome4me Inc., Seoul, Republic of Korea
| | - Nagyeong Yeo
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea
| | - Daehyun Baek
- School of Biological Sciences, Seoul National University, Seoul, Republic of Korea.
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Republic of Korea.
- Genome4me Inc., Seoul, Republic of Korea.
| |
Collapse
|
11
|
Xu C, Bao S, Chen H, Jiang T, Zhang C. Reference-informed prediction of alternative splicing and splicing-altering mutations from sequences. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.22.586363. [PMID: 38586002 PMCID: PMC10996483 DOI: 10.1101/2024.03.22.586363] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/09/2024]
Abstract
Alternative splicing plays a crucial role in protein diversity and gene expression regulation in higher eukaryotes and mutations causing dysregulated splicing underlie a range of genetic diseases. Computational prediction of alternative splicing from genomic sequences not only provides insight into gene-regulatory mechanisms but also helps identify disease-causing mutations and drug targets. However, the current methods for the quantitative prediction of splice site usage still have limited accuracy. Here, we present DeltaSplice, a deep neural network model optimized to learn the impact of mutations on quantitative changes in alternative splicing from the comparative analysis of homologous genes. The model architecture enables DeltaSplice to perform "reference-informed prediction" by incorporating the known splice site usage of a reference gene sequence to improve its prediction on splicing-altering mutations. We benchmarked DeltaSplice and several other state-of-the-art methods on various prediction tasks, including evolutionary sequence divergence on lineage-specific splicing and splicing-altering mutations in human populations and neurodevelopmental disorders, and demonstrated that DeltaSplice outperformed consistently. DeltaSplice predicted ~15% of splicing quantitative trait loci (sQTLs) in the human brain as causal splicing-altering variants. It also predicted splicing-altering de novo mutations outside the splice sites in a subset of patients affected by autism and other neurodevelopmental disorders, including 19 genes with recurrent splicing-altering mutations. Among the new candidate disease risk genes, MFN1 is involved in mitochondria fusion, which is frequently disrupted in autism patients. Our work expanded the capacity of in silico splicing models with potential applications in genetic diagnosis and the development of splicing-based precision medicine.
Collapse
Affiliation(s)
- Chencheng Xu
- Bioinformatics Division, BNRIST, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
- Present address: Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Saudi Arabia
| | - Suying Bao
- Department of Systems Biology, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
- Present address: Regeneron Pharmaceuticals, Terrytown, NY 10591, USA
| | - Hao Chen
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
- Present address: Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Tao Jiang
- Bioinformatics Division, BNRIST, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Chaolin Zhang
- Department of Systems Biology, Department of Biochemistry and Molecular Biophysics, Columbia University, New York, NY 10032, USA
| |
Collapse
|
12
|
Zhou Z, Zhang J, Zheng X, Pan Z, Zhao F, Gao Y. CIRI-Deep Enables Single-Cell and Spatial Transcriptomic Analysis of Circular RNAs with Deep Learning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2308115. [PMID: 38308181 PMCID: PMC11005702 DOI: 10.1002/advs.202308115] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2023] [Revised: 01/03/2024] [Indexed: 02/04/2024]
Abstract
Circular RNAs (circRNAs) are a crucial yet relatively unexplored class of transcripts known for their tissue- and cell-type-specific expression patterns. Despite the advances in single-cell and spatial transcriptomics, these technologies face difficulties in effectively profiling circRNAs due to inherent limitations in circRNA sequencing efficiency. To address this gap, a deep learning model, CIRI-deep, is presented for comprehensive prediction of circRNA regulation on diverse types of RNA-seq data. CIRI-deep is trained on an extensive dataset of 25 million high-confidence circRNA regulation events and achieved high performances on both test and leave-out data, ensuring its accuracy in inferring differential events from RNA-seq data. It is demonstrated that CIRI-deep and its adapted version enable various circRNA analyses, including cluster- or region-specific circRNA detection, BSJ ratio map visualization, and trans and cis feature importance evaluation. Collectively, CIRI-deep's adaptability extends to all major types of RNA-seq datasets including single-cell and spatial transcriptomic data, which will undoubtedly broaden the horizons of circRNA research.
Collapse
Affiliation(s)
- Zihan Zhou
- National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information Beijing Institute of GenomicsChinese Academy of Sciences and China National Center for BioinformationBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Jinyang Zhang
- Beijing Institutes of Life ScienceChinese Academy of SciencesBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Xin Zheng
- National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information Beijing Institute of GenomicsChinese Academy of Sciences and China National Center for BioinformationBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Zhicheng Pan
- Center for Computational Biology Flatiron InstituteNew York10010USA
| | - Fangqing Zhao
- Beijing Institutes of Life ScienceChinese Academy of SciencesBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| | - Yuan Gao
- National Genomics Data Center & CAS Key Laboratory of Genome Sciences and Information Beijing Institute of GenomicsChinese Academy of Sciences and China National Center for BioinformationBeijing100101China
- University of Chinese Academy of SciencesBeijing100101China
| |
Collapse
|
13
|
Yan Y, Li W, Wang S, Huang T. Seq-RBPPred: Predicting RNA-Binding Proteins from Sequence. ACS OMEGA 2024; 9:12734-12742. [PMID: 38524500 PMCID: PMC10955590 DOI: 10.1021/acsomega.3c08381] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Revised: 12/18/2023] [Accepted: 12/28/2023] [Indexed: 03/26/2024]
Abstract
RNA-binding proteins (RBPs) can interact with RNAs to regulate RNA translation, modification, splicing, and other important biological processes. The accurate identification of RBPs is of paramount importance for gaining insights into the intricate mechanisms underlying organismal life activities. Traditional experimental methods to predict RBPs require a lot of time and money, so it is important to develop computational methods to predict RBPs. However, the existing approaches for RBP prediction still require further improvement due to unidentified RBPs in many species. In this study, we present Seq-RBPPred (predicting RBPs from sequence), a novel method that utilizes a comprehensive feature representation encompassing both biophysical properties and hidden-state features derived from protein sequences. In the results, comprehensive performance evaluations of Seq-RBPPred its superiority compare with state-of-the-art methods, yielding impressive performance including 0.922 for overall accuracy, 0.926 for sensitivity, 0.903 for specificity, and Matthew's correlation coefficient (MCC) of 0.757 as ascertained from the evaluation of the testing set. The data and code of Seq-RBPPred are available at https://github.com/yaoyao-11/Seq-RBPPred.
Collapse
Affiliation(s)
- Yuyao Yan
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Wenran Li
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Sijia Wang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| | - Tao Huang
- CAS Key Laboratory of Computational
Biology, Shanghai Institute of Nutrition and Health, Chinese Academy
of Sciences, University of Chinese Academy
of Sciences, Shanghai 200021, China
| |
Collapse
|
14
|
Arnold B, Riegger RJ, Okuda EK, Slišković I, Keller M, Bakisoglu C, McNicoll F, Zarnack K, Müller-McNicoll M. hGRAD: A versatile "one-fits-all" system to acutely deplete RNA binding proteins from condensates. J Cell Biol 2024; 223:e202304030. [PMID: 38108808 PMCID: PMC10726014 DOI: 10.1083/jcb.202304030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2023] [Revised: 09/18/2023] [Accepted: 11/21/2023] [Indexed: 12/19/2023] Open
Abstract
Nuclear RNA binding proteins (RBPs) are difficult to study because they often belong to large protein families and form extensive networks of auto- and crossregulation. They are highly abundant and many localize to condensates with a slow turnover, requiring long depletion times or knockouts that cannot distinguish between direct and indirect or compensatory effects. Here, we developed a system that is optimized for the rapid degradation of nuclear RBPs, called hGRAD. It comes as a "one-fits-all" plasmid, and integration into any cell line with endogenously GFP-tagged proteins allows for an inducible, rapid, and complete knockdown. We show that the nuclear RBPs SRSF3, SRSF5, SRRM2, and NONO are completely cleared from nuclear speckles and paraspeckles within 2 h. hGRAD works in various cell types, is more efficient than previous methods, and does not require the expression of exogenous ubiquitin ligases. Combining SRSF5 hGRAD degradation with Nascent-seq uncovered transient transcript changes, compensatory mechanisms, and an effect of SRSF5 on transcript stability.
Collapse
Affiliation(s)
- Benjamin Arnold
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Ricarda J. Riegger
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Ellen Kazumi Okuda
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
- International Max Planck Research School for Cellular Biophysics, Frankfurt am Main, Germany
| | - Irena Slišković
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Mario Keller
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
- Buchmann Institute for Molecular Life Sciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Cem Bakisoglu
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
- Buchmann Institute for Molecular Life Sciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - François McNicoll
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Kathi Zarnack
- Institute of Molecular Biosciences, Goethe University Frankfurt, Frankfurt am Main, Germany
- Buchmann Institute for Molecular Life Sciences, Goethe University Frankfurt, Frankfurt am Main, Germany
| | | |
Collapse
|
15
|
Knudsen JE, Rich JM, Ma R. Artificial Intelligence in Pathomics and Genomics of Renal Cell Carcinoma. Urol Clin North Am 2024; 51:47-62. [PMID: 37945102 DOI: 10.1016/j.ucl.2023.06.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2023]
Abstract
The integration of artificial intelligence (AI) with histopathology images and gene expression patterns has led to the emergence of the dynamic fields of pathomics and genomics. These fields have revolutionized renal cell carcinoma (RCC) diagnosis and subtyping and improved survival prediction models. Machine learning has identified unique gene patterns across RCC subtypes and grades, providing insights into RCC origins and potential treatments, as targeted therapies. The combination of pathomics and genomics using AI opens new avenues in RCC research, promising future breakthroughs and innovations that patients and physicians can anticipate.
Collapse
Affiliation(s)
- J Everett Knudsen
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA
| | - Joseph M Rich
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA
| | - Runzhuo Ma
- Catherine & Joseph Aresty Department of Urology, USC Institute of Urology, Center for Robotic Simulation & Education, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
16
|
Gupta K, Yang C, McCue K, Bastani O, Sharp PA, Burge CB, Solar-Lezama A. Improved modeling of RNA-binding protein motifs in an interpretable neural model of RNA splicing. Genome Biol 2024; 25:23. [PMID: 38229106 PMCID: PMC10790492 DOI: 10.1186/s13059-023-03162-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2023] [Accepted: 12/28/2023] [Indexed: 01/18/2024] Open
Abstract
Sequence-specific RNA-binding proteins (RBPs) play central roles in splicing decisions. Here, we describe a modular splicing architecture that leverages in vitro-derived RNA affinity models for 79 human RBPs and the annotated human genome to produce improved models of RBP binding and activity. Binding and activity are modeled by separate Motif and Aggregator components that can be mixed and matched, enforcing sparsity to improve interpretability. Training a new Adjusted Motif (AM) architecture on the splicing task not only yields better splicing predictions but also improves prediction of RBP-binding sites in vivo and of splicing activity, assessed using independent data.
Collapse
Affiliation(s)
- Kavi Gupta
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Chenxi Yang
- Department of Computer Science, University of Texas at Austin, Austin, TX, 78712, USA
| | - Kayla McCue
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Osbert Bastani
- Department of Computer and Information Science, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Phillip A Sharp
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
- Koch Institute of Integrative Cancer Research, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Christopher B Burge
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| | - Armando Solar-Lezama
- Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
17
|
Dwivedi SL, Quiroz LF, Reddy ASN, Spillane C, Ortiz R. Alternative Splicing Variation: Accessing and Exploiting in Crop Improvement Programs. Int J Mol Sci 2023; 24:15205. [PMID: 37894886 PMCID: PMC10607462 DOI: 10.3390/ijms242015205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Revised: 10/09/2023] [Accepted: 10/10/2023] [Indexed: 10/29/2023] Open
Abstract
Alternative splicing (AS) is a gene regulatory mechanism modulating gene expression in multiple ways. AS is prevalent in all eukaryotes including plants. AS generates two or more mRNAs from the precursor mRNA (pre-mRNA) to regulate transcriptome complexity and proteome diversity. Advances in next-generation sequencing, omics technology, bioinformatics tools, and computational methods provide new opportunities to quantify and visualize AS-based quantitative trait variation associated with plant growth, development, reproduction, and stress tolerance. Domestication, polyploidization, and environmental perturbation may evolve novel splicing variants associated with agronomically beneficial traits. To date, pre-mRNAs from many genes are spliced into multiple transcripts that cause phenotypic variation for complex traits, both in model plant Arabidopsis and field crops. Cataloguing and exploiting such variation may provide new paths to enhance climate resilience, resource-use efficiency, productivity, and nutritional quality of staple food crops. This review provides insights into AS variation alongside a gene expression analysis to select for novel phenotypic diversity for use in breeding programs. AS contributes to heterosis, enhances plant symbiosis (mycorrhiza and rhizobium), and provides a mechanistic link between the core clock genes and diverse environmental clues.
Collapse
Affiliation(s)
| | - Luis Felipe Quiroz
- Agriculture and Bioeconomy Research Centre, Ryan Institute, University of Galway, University Road, H91 REW4 Galway, Ireland
| | - Anireddy S N Reddy
- Department of Biology and Program in Cell and Molecular Biology, Colorado State University, Fort Collins, CO 80523, USA
| | - Charles Spillane
- Agriculture and Bioeconomy Research Centre, Ryan Institute, University of Galway, University Road, H91 REW4 Galway, Ireland
| | - Rodomiro Ortiz
- Department of Plant Breeding, Swedish University of Agricultural Sciences, 23053 Alnarp, SE, Sweden
| |
Collapse
|
18
|
Zhao F, Yan Y, Wang Y, Liu Y, Yang R. Splicing complexity as a pivotal feature of alternative exons in mammalian species. BMC Genomics 2023; 24:198. [PMID: 37046221 PMCID: PMC10099729 DOI: 10.1186/s12864-023-09247-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 03/14/2023] [Indexed: 04/14/2023] Open
Abstract
BACKGROUND As a significant process of post-transcriptional gene expression regulation in eukaryotic cells, alternative splicing (AS) of exons greatly contributes to the complexity of the transcriptome and indirectly enriches the protein repertoires. A large number of studies have focused on the splicing inclusion of alternative exons and have revealed the roles of AS in organ development and maturation. Notably, AS takes place through a change in the relative abundance of the transcript isoforms produced by a single gene, meaning that exons can have complex splicing patterns. However, the commonly used percent spliced-in (Ψ) values only define the usage rate of exons, but lose information about the complexity of exons' linkage pattern. To date, the extent and functional consequence of splicing complexity of alternative exons in development and evolution is poorly understood. RESULTS By comparing splicing complexity of exons in six tissues (brain, cerebellum, heart, liver, kidney, and testis) from six mammalian species (human, chimpanzee, gorilla, macaque, mouse, opossum) and an outgroup species (chicken), we revealed that exons with high splicing complexity are prevalent in mammals and are closely related to features of genes. Using traditional machine learning and deep learning methods, we found that the splicing complexity of exons can be moderately predicted with features derived from exons, among which length of flanking exons and splicing strength of downstream/upstream splice sites are top predictors. Comparative analysis among human, chimpanzee, gorilla, macaque, and mouse revealed that, alternative exons tend to evolve to an increased level of splicing complexity and higher tissue specificity in splicing complexity. During organ development, not only developmentally regulated exons, but also 10-15% of non-developmentally regulated exons show dynamic splicing complexity. CONCLUSIONS Our analysis revealed that splicing complexity is an important metric to characterize the splicing dynamics of alternative exons during the development and evolution of mammals.
Collapse
Affiliation(s)
- Feiyang Zhao
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, China
| | - Yubin Yan
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, China
| | - Yaxi Wang
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, China
| | - Yuan Liu
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, China
| | - Ruolin Yang
- College of Life Sciences, Northwest A&F University, Yangling, Shaanxi, China.
| |
Collapse
|
19
|
Rogalska ME, Vivori C, Valcárcel J. Regulation of pre-mRNA splicing: roles in physiology and disease, and therapeutic prospects. Nat Rev Genet 2023; 24:251-269. [PMID: 36526860 DOI: 10.1038/s41576-022-00556-8] [Citation(s) in RCA: 109] [Impact Index Per Article: 54.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/10/2022] [Indexed: 12/23/2022]
Abstract
The removal of introns from mRNA precursors and its regulation by alternative splicing are key for eukaryotic gene expression and cellular function, as evidenced by the numerous pathologies induced or modified by splicing alterations. Major recent advances have been made in understanding the structures and functions of the splicing machinery, in the description and classification of physiological and pathological isoforms and in the development of the first therapies for genetic diseases based on modulation of splicing. Here, we review this progress and discuss important remaining challenges, including predicting splice sites from genomic sequences, understanding the variety of molecular mechanisms and logic of splicing regulation, and harnessing this knowledge for probing gene function and disease aetiology and for the design of novel therapeutic approaches.
Collapse
Affiliation(s)
- Malgorzata Ewa Rogalska
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Claudia Vivori
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain
- The Francis Crick Institute, London, UK
| | - Juan Valcárcel
- Genome Biology Program, Centre for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain.
- Department of Medicine and Life Sciences, Universitat Pompeu Fabra (UPF), Barcelona, Spain.
- Institució Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain.
| |
Collapse
|
20
|
Ullah F, Jabeen S, Salton M, Reddy ASN, Ben-Hur A. Evidence for the role of transcription factors in the co-transcriptional regulation of intron retention. Genome Biol 2023; 24:53. [PMID: 36949544 PMCID: PMC10031921 DOI: 10.1186/s13059-023-02885-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Accepted: 02/16/2023] [Indexed: 03/24/2023] Open
Abstract
BACKGROUND Alternative splicing is a widespread regulatory phenomenon that enables a single gene to produce multiple transcripts. Among the different types of alternative splicing, intron retention is one of the least explored despite its high prevalence in both plants and animals. The recent discovery that the majority of splicing is co-transcriptional has led to the finding that chromatin state affects alternative splicing. Therefore, it is plausible that transcription factors can regulate splicing outcomes. RESULTS We provide evidence for the hypothesis that transcription factors are involved in the regulation of intron retention by studying regions of open chromatin in retained and excised introns. Using deep learning models designed to distinguish between regions of open chromatin in retained introns and non-retained introns, we identified motifs enriched in IR events with significant hits to known human transcription factors. Our model predicts that the majority of transcription factors that affect intron retention come from the zinc finger family. We demonstrate the validity of these predictions using ChIP-seq data for multiple zinc finger transcription factors and find strong over-representation for their peaks in intron retention events. CONCLUSIONS This work opens up opportunities for further studies that elucidate the mechanisms by which transcription factors affect intron retention and other forms of splicing. AVAILABILITY Source code available at https://github.com/fahadahaf/chromir.
Collapse
Affiliation(s)
- Fahad Ullah
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Saira Jabeen
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Maayan Salton
- Department of Biology, Colorado State University, Fort Collins, CO, USA
| | - Anireddy S N Reddy
- Biochemistry and Molecular Biology Department, The Hebrew University Faculty of Medicine, Jerusalem, Israel
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA.
| |
Collapse
|
21
|
Horn T, Gosliga A, Li C, Enculescu M, Legewie S. Position-dependent effects of RNA-binding proteins in the context of co-transcriptional splicing. NPJ Syst Biol Appl 2023; 9:1. [PMID: 36653378 PMCID: PMC9849329 DOI: 10.1038/s41540-022-00264-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Accepted: 12/08/2022] [Indexed: 01/19/2023] Open
Abstract
Alternative splicing is an important step in eukaryotic mRNA pre-processing which increases the complexity of gene expression programs, but is frequently altered in disease. Previous work on the regulation of alternative splicing has demonstrated that splicing is controlled by RNA-binding proteins (RBPs) and by epigenetic DNA/histone modifications which affect splicing by changing the speed of polymerase-mediated pre-mRNA transcription. The interplay of these different layers of splicing regulation is poorly understood. In this paper, we derived mathematical models describing how splicing decisions in a three-exon gene are made by combinatorial spliceosome binding to splice sites during ongoing transcription. We additionally take into account the effect of a regulatory RBP and find that the RBP binding position within the sequence is a key determinant of how RNA polymerase velocity affects splicing. Based on these results, we explain paradoxical observations in the experimental literature and further derive rules explaining why the same RBP can act as inhibitor or activator of cassette exon inclusion depending on its binding position. Finally, we derive a stochastic description of co-transcriptional splicing regulation at the single-cell level and show that splicing outcomes show little noise and follow a binomial distribution despite complex regulation by a multitude of factors. Taken together, our simulations demonstrate the robustness of splicing outcomes and reveal that quantitative insights into kinetic competition of co-transcriptional events are required to fully understand this important mechanism of gene expression diversity.
Collapse
Affiliation(s)
- Timur Horn
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
| | - Alison Gosliga
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany
- University of Stuttgart, Department of Systems Biology and Stuttgart Research Center Systems Biology (SRCSB), Allmandring 31, 70569, Stuttgart, Germany
| | - Congxin Li
- University of Stuttgart, Department of Systems Biology and Stuttgart Research Center Systems Biology (SRCSB), Allmandring 31, 70569, Stuttgart, Germany
| | - Mihaela Enculescu
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany.
| | - Stefan Legewie
- Institute of Molecular Biology (IMB), Ackermannweg 4, 55128, Mainz, Germany.
- University of Stuttgart, Department of Systems Biology and Stuttgart Research Center Systems Biology (SRCSB), Allmandring 31, 70569, Stuttgart, Germany.
| |
Collapse
|
22
|
Caudai C, Galizia A, Geraci F, Le Pera L, Morea V, Salerno E, Via A, Colombo T. AI applications in functional genomics. Comput Struct Biotechnol J 2021; 19:5762-5790. [PMID: 34765093 PMCID: PMC8566780 DOI: 10.1016/j.csbj.2021.10.009] [Citation(s) in RCA: 42] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Revised: 10/05/2021] [Accepted: 10/05/2021] [Indexed: 12/13/2022] Open
Abstract
We review the current applications of artificial intelligence (AI) in functional genomics. The recent explosion of AI follows the remarkable achievements made possible by "deep learning", along with a burst of "big data" that can meet its hunger. Biology is about to overthrow astronomy as the paradigmatic representative of big data producer. This has been made possible by huge advancements in the field of high throughput technologies, applied to determine how the individual components of a biological system work together to accomplish different processes. The disciplines contributing to this bulk of data are collectively known as functional genomics. They consist in studies of: i) the information contained in the DNA (genomics); ii) the modifications that DNA can reversibly undergo (epigenomics); iii) the RNA transcripts originated by a genome (transcriptomics); iv) the ensemble of chemical modifications decorating different types of RNA transcripts (epitranscriptomics); v) the products of protein-coding transcripts (proteomics); and vi) the small molecules produced from cell metabolism (metabolomics) present in an organism or system at a given time, in physiological or pathological conditions. After reviewing main applications of AI in functional genomics, we discuss important accompanying issues, including ethical, legal and economic issues and the importance of explainability.
Collapse
Affiliation(s)
- Claudia Caudai
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Antonella Galizia
- CNR, Institute of Applied Mathematics and Information Technologies (IMATI), Genoa, Italy
| | - Filippo Geraci
- CNR, Institute for Informatics and Telematics (IIT), Pisa, Italy
| | - Loredana Le Pera
- CNR, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies (IBIOM), Bari, Italy
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Veronica Morea
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Emanuele Salerno
- CNR, Institute of Information Science and Technologies “A. Faedo” (ISTI), Pisa, Italy
| | - Allegra Via
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| | - Teresa Colombo
- CNR, Institute of Molecular Biology and Pathology (IBPM), Rome, Italy
| |
Collapse
|
23
|
Huang K, Xiao C, Glass LM, Critchlow CW, Gibson G, Sun J. Machine learning applications for therapeutic tasks with genomics data. PATTERNS (NEW YORK, N.Y.) 2021; 2:100328. [PMID: 34693370 PMCID: PMC8515011 DOI: 10.1016/j.patter.2021.100328] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/29/2022]
Abstract
Thanks to the increasing availability of genomics and other biomedical data, many machine learning algorithms have been proposed for a wide range of therapeutic discovery and development tasks. In this survey, we review the literature on machine learning applications for genomics through the lens of therapeutic development. We investigate the interplay among genomics, compounds, proteins, electronic health records, cellular images, and clinical texts. We identify 22 machine learning in genomics applications that span the whole therapeutics pipeline, from discovering novel targets, personalizing medicine, developing gene-editing tools, all the way to facilitating clinical trials and post-market studies. We also pinpoint seven key challenges in this field with potentials for expansion and impact. This survey examines recent research at the intersection of machine learning, genomics, and therapeutic development.
Collapse
Affiliation(s)
- Kexin Huang
- Department of Computer Science, Stanford University, Stanford, CA 94305, USA
| | - Cao Xiao
- Amplitude, San Francisco, CA 94105, USA
| | - Lucas M. Glass
- Analytics Center of Excellence, IQVIA, Cambridge, MA 02139, USA
| | | | - Greg Gibson
- Center for Integrative Genomics, Georgia Institute of Technology, Atlanta, GA 30332, USA
| | - Jimeng Sun
- Computer Science Department and Carle's Illinois College of Medicine, University of Illinois at Urbana-Champaign, Urbana, IL 61820, USA
| |
Collapse
|
24
|
Raveh B, Sun L, White KL, Sanyal T, Tempkin J, Zheng D, Bharath K, Singla J, Wang C, Zhao J, Li A, Graham NA, Kesselman C, Stevens RC, Sali A. Bayesian metamodeling of complex biological systems across varying representations. Proc Natl Acad Sci U S A 2021; 118:e2104559118. [PMID: 34453000 PMCID: PMC8536362 DOI: 10.1073/pnas.2104559118] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/26/2022] Open
Abstract
Comprehensive modeling of a whole cell requires an integration of vast amounts of information on various aspects of the cell and its parts. To divide and conquer this task, we introduce Bayesian metamodeling, a general approach to modeling complex systems by integrating a collection of heterogeneous input models. Each input model can in principle be based on any type of data and can describe a different aspect of the modeled system using any mathematical representation, scale, and level of granularity. These input models are 1) converted to a standardized statistical representation relying on probabilistic graphical models, 2) coupled by modeling their mutual relations with the physical world, and 3) finally harmonized with respect to each other. To illustrate Bayesian metamodeling, we provide a proof-of-principle metamodel of glucose-stimulated insulin secretion by human pancreatic β-cells. The input models include a coarse-grained spatiotemporal simulation of insulin vesicle trafficking, docking, and exocytosis; a molecular network model of glucose-stimulated insulin secretion signaling; a network model of insulin metabolism; a structural model of glucagon-like peptide-1 receptor activation; a linear model of a pancreatic cell population; and ordinary differential equations for systemic postprandial insulin response. Metamodeling benefits from decentralized computing, while often producing a more accurate, precise, and complete model that contextualizes input models as well as resolves conflicting information. We anticipate Bayesian metamodeling will facilitate collaborative science by providing a framework for sharing expertise, resources, data, and models, as exemplified by the Pancreatic β-Cell Consortium.
Collapse
Affiliation(s)
- Barak Raveh
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158
- Quantitative Biosciences Institute, University of California, San Francisco, CA 94158
- School of Computer Science and Engineering, The Hebrew University of Jerusalem, Jerusalem 9190416, Israel
| | - Liping Sun
- iHuman Institute, ShanghaiTech University, Shanghai 201210, China
| | - Kate L White
- Department of Biological Sciences, Bridge Institute, University of Southern California, Los Angeles, CA 90089
| | - Tanmoy Sanyal
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158
- Quantitative Biosciences Institute, University of California, San Francisco, CA 94158
| | - Jeremy Tempkin
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158
- Quantitative Biosciences Institute, University of California, San Francisco, CA 94158
| | - Dongqing Zheng
- Mork Family Department of Chemical Engineering and Materials Science, Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089
| | - Kala Bharath
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158
- Quantitative Biosciences Institute, University of California, San Francisco, CA 94158
| | - Jitin Singla
- Department of Biological Sciences, Bridge Institute, University of Southern California, Los Angeles, CA 90089
- Epstein Department of Industrial and Systems Engineering, The Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089
- Information Science Institute, The Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089
| | - Chenxi Wang
- iHuman Institute, ShanghaiTech University, Shanghai 201210, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Jihui Zhao
- iHuman Institute, ShanghaiTech University, Shanghai 201210, China
| | - Angdi Li
- iHuman Institute, ShanghaiTech University, Shanghai 201210, China
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Nicholas A Graham
- Mork Family Department of Chemical Engineering and Materials Science, Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089
| | - Carl Kesselman
- Epstein Department of Industrial and Systems Engineering, The Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089
- Information Science Institute, The Viterbi School of Engineering, University of Southern California, Los Angeles, CA 90089
| | - Raymond C Stevens
- iHuman Institute, ShanghaiTech University, Shanghai 201210, China
- Department of Biological Sciences, Bridge Institute, University of Southern California, Los Angeles, CA 90089
- School of Life Science and Technology, ShanghaiTech University, Shanghai 201210, China
| | - Andrej Sali
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, CA 94158;
- Quantitative Biosciences Institute, University of California, San Francisco, CA 94158
- Department of Pharmaceutical Chemistry, University of California, San Francisco, CA 94158
| |
Collapse
|
25
|
Abstract
This review provides the feasible literature on drug discovery through ML tools and techniques that are enforced in every phase of drug development to accelerate the research process and deduce the risk and expenditure in clinical trials. Machine learning techniques improve the decision-making in pharmaceutical data across various applications like QSAR analysis, hit discoveries, de novo drug architectures to retrieve accurate outcomes. Target validation, prognostic biomarkers, digital pathology are considered under problem statements in this review. ML challenges must be applicable for the main cause of inadequacy in interpretability outcomes that may restrict the applications in drug discovery. In clinical trials, absolute and methodological data must be generated to tackle many puzzles in validating ML techniques, improving decision-making, promoting awareness in ML approaches, and deducing risk failures in drug discovery.
Collapse
Affiliation(s)
- Suresh Dara
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Swetha Dhamercherla
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Surender Singh Jadav
- Centre for Molecular Cancer Research (CMCR) and Vishnu Institute of Pharmaceutical Education and Research (VIPER), Narsapur, Medak, 502313 Telangana India
| | - CH Madhu Babu
- Department of Computer Science and Engineering, B V Raju Institute of Technology, Narsapur, Medak, 502313 Telangana India
| | - Mohamed Jawed Ahsan
- Department of Pharmaceutical Chemistry, Maharishi Arvind College of Pharmacy, Jaipur, 302023 Rajasthan India
| |
Collapse
|
26
|
Zhang Y, Cai Y, Roca X, Kwoh CK, Fullwood MJ. Chromatin loop anchors predict transcript and exon usage. Brief Bioinform 2021; 22:6319936. [PMID: 34263910 PMCID: PMC8575016 DOI: 10.1093/bib/bbab254] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2021] [Revised: 06/16/2021] [Accepted: 05/25/2021] [Indexed: 11/24/2022] Open
Abstract
Epigenomics and transcriptomics data from high-throughput sequencing techniques such as RNA-seq and ChIP-seq have been successfully applied in predicting gene transcript expression. However, the locations of chromatin loops in the genome identified by techniques such as Chromatin Interaction Analysis with Paired End Tag sequencing (ChIA-PET) have never been used for prediction tasks. Here, we developed machine learning models to investigate if ChIA-PET could contribute to transcript and exon usage prediction. In doing so, we used a large set of transcription factors as well as ChIA-PET data. We developed different Gradient Boosting Trees models according to the different tasks with the integrated datasets from three cell lines, including GM12878, HeLaS3 and K562. We validated the models via 10-fold cross validation, chromosome-split validation and cross-cell validation. Our results show that both transcript and splicing-derived exon usage can be effectively predicted with at least 0.7512 and 0.7459 of accuracy, respectively, on all cell lines from all kinds of validations. Examining the predictive features, we found that RNA Polymerase II ChIA-PET was one of the most important features in both transcript and exon usage prediction, suggesting that chromatin loop anchors are predictive of both transcript and exon usage.
Collapse
Affiliation(s)
- Yu Zhang
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
| | - Yichao Cai
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore 117599, Singapore
| | - Xavier Roca
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Dr, Singapore 637551, Singapore
| | - Chee Keong Kwoh
- School of Computer Science and Engineering, Nanyang Technological University, 50 Nanyang Avenue, Singapore, 639798, Singapore
| | - Melissa Jane Fullwood
- Cancer Science Institute of Singapore, National University of Singapore, 14 Medical Dr, Singapore 117599, Singapore.,School of Biological Sciences, Nanyang Technological University, 637551, Singapore.,Institute of Molecular and Cell Biology, Agency for Science, Technology and Research (A*STAR), 61 Biopolis Dr, Singapore 138673, Singapore
| |
Collapse
|
27
|
Vatansever S, Schlessinger A, Wacker D, Kaniskan HÜ, Jin J, Zhou M, Zhang B. Artificial intelligence and machine learning-aided drug discovery in central nervous system diseases: State-of-the-arts and future directions. Med Res Rev 2021; 41:1427-1473. [PMID: 33295676 PMCID: PMC8043990 DOI: 10.1002/med.21764] [Citation(s) in RCA: 162] [Impact Index Per Article: 40.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Revised: 10/30/2020] [Accepted: 11/20/2020] [Indexed: 01/11/2023]
Abstract
Neurological disorders significantly outnumber diseases in other therapeutic areas. However, developing drugs for central nervous system (CNS) disorders remains the most challenging area in drug discovery, accompanied with the long timelines and high attrition rates. With the rapid growth of biomedical data enabled by advanced experimental technologies, artificial intelligence (AI) and machine learning (ML) have emerged as an indispensable tool to draw meaningful insights and improve decision making in drug discovery. Thanks to the advancements in AI and ML algorithms, now the AI/ML-driven solutions have an unprecedented potential to accelerate the process of CNS drug discovery with better success rate. In this review, we comprehensively summarize AI/ML-powered pharmaceutical discovery efforts and their implementations in the CNS area. After introducing the AI/ML models as well as the conceptualization and data preparation, we outline the applications of AI/ML technologies to several key procedures in drug discovery, including target identification, compound screening, hit/lead generation and optimization, drug response and synergy prediction, de novo drug design, and drug repurposing. We review the current state-of-the-art of AI/ML-guided CNS drug discovery, focusing on blood-brain barrier permeability prediction and implementation into therapeutic discovery for neurological diseases. Finally, we discuss the major challenges and limitations of current approaches and possible future directions that may provide resolutions to these difficulties.
Collapse
Affiliation(s)
- Sezen Vatansever
- Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Transformative Disease ModelingIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Icahn Institute for Data Science and Genomic TechnologyIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Avner Schlessinger
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Daniel Wacker
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of NeuroscienceIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - H. Ümit Kaniskan
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Jian Jin
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Therapeutics DiscoveryIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Ming‐Ming Zhou
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Oncological Sciences, Tisch Cancer InstituteIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| | - Bin Zhang
- Department of Genetics and Genomic SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Mount Sinai Center for Transformative Disease ModelingIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Icahn Institute for Data Science and Genomic TechnologyIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
- Department of Pharmacological SciencesIcahn School of Medicine at Mount SinaiNew YorkNew YorkUSA
| |
Collapse
|
28
|
Yang S, Zhu F, Ling X, Liu Q, Zhao P. Intelligent Health Care: Applications of Deep Learning in Computational Medicine. Front Genet 2021; 12:607471. [PMID: 33912213 PMCID: PMC8075004 DOI: 10.3389/fgene.2021.607471] [Citation(s) in RCA: 23] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2020] [Accepted: 03/05/2021] [Indexed: 12/24/2022] Open
Abstract
With the progress of medical technology, biomedical field ushered in the era of big data, based on which and driven by artificial intelligence technology, computational medicine has emerged. People need to extract the effective information contained in these big biomedical data to promote the development of precision medicine. Traditionally, the machine learning methods are used to dig out biomedical data to find the features from data, which generally rely on feature engineering and domain knowledge of experts, requiring tremendous time and human resources. Different from traditional approaches, deep learning, as a cutting-edge machine learning branch, can automatically learn complex and robust feature from raw data without the need for feature engineering. The applications of deep learning in medical image, electronic health record, genomics, and drug development are studied, where the suggestion is that deep learning has obvious advantage in making full use of biomedical data and improving medical health level. Deep learning plays an increasingly important role in the field of medical health and has a broad prospect of application. However, the problems and challenges of deep learning in computational medical health still exist, including insufficient data, interpretability, data privacy, and heterogeneity. Analysis and discussion on these problems provide a reference to improve the application of deep learning in medical health.
Collapse
Affiliation(s)
- Sijie Yang
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Fei Zhu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Xinghong Ling
- School of Computer Science and Technology, Soochow University, Suzhou, China
- WenZheng College of Soochow University, Suzhou, China
| | - Quan Liu
- School of Computer Science and Technology, Soochow University, Suzhou, China
| | - Peiyao Zhao
- School of Computer Science and Technology, Soochow University, Suzhou, China
| |
Collapse
|
29
|
Cheng J, Çelik MH, Kundaje A, Gagneur J. MTSplice predicts effects of genetic variants on tissue-specific splicing. Genome Biol 2021; 22:94. [PMID: 33789710 PMCID: PMC8011109 DOI: 10.1186/s13059-021-02273-7] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2020] [Accepted: 01/14/2021] [Indexed: 12/20/2022] Open
Abstract
We develop the free and open-source model Multi-tissue Splicing (MTSplice) to predict the effects of genetic variants on splicing of cassette exons in 56 human tissues. MTSplice combines MMSplice, which models constitutive regulatory sequences, with a new neural network that models tissue-specific regulatory sequences. MTSplice outperforms MMSplice on predicting tissue-specific variations associated with genetic variants in most tissues of the GTEx dataset, with largest improvements on brain tissues. Furthermore, MTSplice predicts that autism-associated de novo mutations are enriched for variants affecting splicing specifically in the brain. We foresee that MTSplice will aid interpreting variants associated with tissue-specific disorders.
Collapse
Affiliation(s)
- Jun Cheng
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748, Germany.
| | - Muhammed Hasan Çelik
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748, Germany
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748, Germany.
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany.
- Institute of Human Genetics, Klinikum rechts der Isar, Technical University of Munich, Munich, Germany.
| |
Collapse
|
30
|
Application of deep learning in genomics. SCIENCE CHINA-LIFE SCIENCES 2020; 63:1860-1878. [PMID: 33051704 DOI: 10.1007/s11427-020-1804-5] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/13/2020] [Accepted: 08/15/2020] [Indexed: 12/19/2022]
Abstract
In recent years, deep learning has been widely used in diverse fields of research, such as speech recognition, image classification, autonomous driving and natural language processing. Deep learning has showcased dramatically improved performance in complex classification and regression problems, where the intricate structure in the high-dimensional data is difficult to discover using conventional machine learning algorithms. In biology, applications of deep learning are gaining increasing popularity in predicting the structure and function of genomic elements, such as promoters, enhancers, or gene expression levels. In this review paper, we described the basic concepts in machine learning and artificial neural network, followed by elaboration on the workflow of using convolutional neural network in genomics. Then we provided a concise introduction of deep learning applications in genomics and synthetic biology at the levels of DNA, RNA and protein. Finally, we discussed the current challenges and future perspectives of deep learning in genomics.
Collapse
|
31
|
Jha A, K Aicher J, R Gazzara M, Singh D, Barash Y. Enhanced Integrated Gradients: improving interpretability of deep learning models using splicing codes as a case study. Genome Biol 2020; 21:149. [PMID: 32560708 PMCID: PMC7305616 DOI: 10.1186/s13059-020-02055-7] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2019] [Accepted: 05/22/2020] [Indexed: 01/03/2023] Open
Abstract
Despite the success and fast adaptation of deep learning models in biomedical domains, their lack of interpretability remains an issue. Here, we introduce Enhanced Integrated Gradients (EIG), a method to identify significant features associated with a specific prediction task. Using RNA splicing prediction as well as digit classification as case studies, we demonstrate that EIG improves upon the original Integrated Gradients method and produces sets of informative features. We then apply EIG to identify A1CF as a key regulator of liver-specific alternative splicing, supporting this finding with subsequent analysis of relevant A1CF functional (RNA-seq) and binding data (PAR-CLIP).
Collapse
Affiliation(s)
- Anupama Jha
- Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, USA
| | - Joseph K Aicher
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Matthew R Gazzara
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA
| | - Deependra Singh
- Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, USA
| | - Yoseph Barash
- Department of Computer and Information Science, School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, USA. .,Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, USA.
| |
Collapse
|
32
|
Mapping RNA splicing variations in clinically accessible and nonaccessible tissues to facilitate Mendelian disease diagnosis using RNA-seq. Genet Med 2020; 22:1181-1190. [PMID: 32225167 DOI: 10.1038/s41436-020-0780-y] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 03/03/2020] [Accepted: 03/05/2020] [Indexed: 01/14/2023] Open
Abstract
PURPOSE RNA-seq is a promising approach to improve diagnoses by detecting pathogenic aberrations in RNA splicing that are missed by DNA sequencing. RNA-seq is typically performed on clinically accessible tissues (CATs) from blood and skin. RNA tissue specificity makes it difficult to identify aberrations in relevant but nonaccessible tissues (non-CATs). We determined how RNA-seq from CATs represent splicing in and across genes and non-CATs. METHODS We quantified RNA splicing in 801 RNA-seq samples from 56 different adult and fetal tissues from Genotype-Tissue Expression Project (GTEx) and ArrayExpress. We identified genes and splicing events in each non-CAT and determined when RNA-seq in each CAT would inadequately represent them. We developed an online resource, MAJIQ-CAT, for exploring our analysis for specific genes and tissues. RESULTS In non-CATs, 40.2% of genes have splicing that is inadequately represented by at least one CAT; 6.3% of genes have splicing inadequately represented by all CATs. A majority (52.1%) of inadequately represented genes are lowly expressed in CATs (transcripts per million (TPM) < 1), but 5.8% are inadequately represented despite being well expressed (TPM > 10). CONCLUSION Many splicing events in non-CATs are inadequately evaluated using RNA-seq from CATs. MAJIQ-CAT allows users to explore which accessible tissues, if any, best represent splicing in genes and tissues of interest.
Collapse
|
33
|
Vamathevan J, Clark D, Czodrowski P, Dunham I, Ferran E, Lee G, Li B, Madabhushi A, Shah P, Spitzer M, Zhao S. Applications of machine learning in drug discovery and development. Nat Rev Drug Discov 2019; 18:463-477. [PMID: 30976107 DOI: 10.1038/s41573-019-0024-5] [Citation(s) in RCA: 1186] [Impact Index Per Article: 197.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Drug discovery and development pipelines are long, complex and depend on numerous factors. Machine learning (ML) approaches provide a set of tools that can improve discovery and decision making for well-specified questions with abundant, high-quality data. Opportunities to apply ML occur in all stages of drug discovery. Examples include target validation, identification of prognostic biomarkers and analysis of digital pathology data in clinical trials. Applications have ranged in context and methodology, with some approaches yielding accurate predictions and insights. The challenges of applying ML lie primarily with the lack of interpretability and repeatability of ML-generated results, which may limit their application. In all areas, systematic and comprehensive high-dimensional data still need to be generated. With ongoing efforts to tackle these issues, as well as increasing awareness of the factors needed to validate ML approaches, the application of ML can promote data-driven decision making and has the potential to speed up the process and reduce failure rates in drug discovery and development.
Collapse
Affiliation(s)
- Jessica Vamathevan
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK.
| | - Dominic Clark
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | | | - Ian Dunham
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Edgardo Ferran
- European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - George Lee
- Bristol-Myers Squibb, Princeton, NJ, USA
| | - Bin Li
- Takeda Pharmaceuticals International Co., Cambridge, MA, USA
| | - Anant Madabhushi
- Case Western Reserve University, Cleveland, OH, USA.,Louis Stokes Cleveland Veterans Affair Medical Center, Cleveland, OH, USA
| | | | - Michaela Spitzer
- Open Targets and European Molecular Biology Laboratory, European Bioinformatics Institute, Cambridge, UK
| | - Shanrong Zhao
- Pfizer Worldwide Research and Development, Cambridge, MA, USA
| |
Collapse
|
34
|
Chen D, Jacob L, Mairal J. Biological sequence modeling with convolutional kernel networks. Bioinformatics 2019; 35:3294-3302. [PMID: 30753280 DOI: 10.1093/bioinformatics/btz094] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2018] [Revised: 01/29/2019] [Accepted: 02/06/2019] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The growing number of annotated biological sequences available makes it possible to learn genotype-phenotype relationships from data with increasingly high accuracy. When large quantities of labeled samples are available for training a model, convolutional neural networks can be used to predict the phenotype of unannotated sequences with good accuracy. Unfortunately, their performance with medium- or small-scale datasets is mitigated, which requires inventing new data-efficient approaches. RESULTS We introduce a hybrid approach between convolutional neural networks and kernel methods to model biological sequences. Our method enjoys the ability of convolutional neural networks to learn data representations that are adapted to a specific task, while the kernel point of view yields algorithms that perform significantly better when the amount of training data is small. We illustrate these advantages for transcription factor binding prediction and protein homology detection, and we demonstrate that our model is also simple to interpret, which is crucial for discovering predictive motifs in sequences. AVAILABILITY AND IMPLEMENTATION Source code is freely available at https://gitlab.inria.fr/dchen/CKN-seq. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dexiong Chen
- Université Grenoble Alpes, INRIA, CNRS, Grenoble INP, LJK, Grenoble, Isère France
| | - Laurent Jacob
- University of Lyon, Université Lyon 1, CNRS, Laboratoire de Biométrie et Biologie Évolutive UMR 5558, Lyon, Rhône France
| | - Julien Mairal
- Université Grenoble Alpes, INRIA, CNRS, Grenoble INP, LJK, Grenoble, Isère France
| |
Collapse
|
35
|
Deep Splicing Code: Classifying Alternative Splicing Events Using Deep Learning. Genes (Basel) 2019; 10:genes10080587. [PMID: 31374967 PMCID: PMC6722613 DOI: 10.3390/genes10080587] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2019] [Revised: 07/20/2019] [Accepted: 07/30/2019] [Indexed: 12/11/2022] Open
Abstract
Alternative splicing (AS) is the process of combining different parts of the pre-mRNA to produce diverse transcripts and eventually different protein products from a single gene. In computational biology field, researchers try to understand AS behavior and regulation using computational models known as “Splicing Codes”. The final goal of these algorithms is to make an in-silico prediction of AS outcome from genomic sequence. Here, we develop a deep learning approach, called Deep Splicing Code (DSC), for categorizing the well-studied classes of AS namely alternatively skipped exons, alternative 5’ss, alternative 3’ss, and constitutively spliced exons based only on the sequence of the exon junctions. The proposed approach significantly improves the prediction and the obtained results reveal that constitutive exons have distinguishable local characteristics from alternatively spliced exons. Using the motif visualization technique, we show that the trained models learned to search for competitive alternative splice sites as well as motifs of important splicing factors with high precision. Thus, the proposed approach greatly expands the opportunities to improve alternative splicing modeling. In addition, a web-server for AS events prediction has been developed based on the proposed method.
Collapse
|
36
|
Krawczyk PS, Lipinski L, Dziembowski A. PlasFlow: predicting plasmid sequences in metagenomic data using genome signatures. Nucleic Acids Res 2019; 46:e35. [PMID: 29346586 PMCID: PMC5887522 DOI: 10.1093/nar/gkx1321] [Citation(s) in RCA: 352] [Impact Index Per Article: 58.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2017] [Accepted: 12/28/2017] [Indexed: 12/14/2022] Open
Abstract
Plasmids are mobile genetics elements that play an important role in the environmental adaptation of microorganisms. Although plasmids are usually analyzed in cultured microorganisms, there is a need for methods that allow for the analysis of pools of plasmids (plasmidomes) in environmental samples. To that end, several molecular biology and bioinformatics methods have been developed; however, they are limited to environments with low diversity and cannot recover large plasmids. Here, we present PlasFlow, a novel tool based on genomic signatures that employs a neural network approach for identification of bacterial plasmid sequences in environmental samples. PlasFlow can recover plasmid sequences from assembled metagenomes without any prior knowledge of the taxonomical or functional composition of samples with an accuracy up to 96%. It can also recover sequences of both circular and linear plasmids and can perform initial taxonomical classification of sequences. Compared to other currently available tools, PlasFlow demonstrated significantly better performance on test datasets. Analysis of two samples from heavy metal-contaminated microbial mats revealed that plasmids may constitute an important fraction of their metagenomes and carry genes involved in heavy-metal homeostasis, proving the pivotal role of plasmids in microorganism adaptation to environmental conditions.
Collapse
Affiliation(s)
- Pawel S Krawczyk
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Leszek Lipinski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland
| | - Andrzej Dziembowski
- Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Pawinskiego 5a, 02-106 Warsaw, Poland.,Department of Genetics and Biotechnology, Faculty of Biology, University of Warsaw, Pawinskiego 5a, 02-106 Warsaw, Poland
| |
Collapse
|
37
|
Eraslan G, Avsec Ž, Gagneur J, Theis FJ. Deep learning: new computational modelling techniques for genomics. Nat Rev Genet 2019; 20:389-403. [PMID: 30971806 DOI: 10.1038/s41576-019-0122-6] [Citation(s) in RCA: 588] [Impact Index Per Article: 98.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
As a data-driven science, genomics largely utilizes machine learning to capture dependencies in data and derive novel biological hypotheses. However, the ability to extract new insights from the exponentially increasing volume of genomics data requires more expressive machine learning models. By effectively leveraging large data sets, deep learning has transformed fields such as computer vision and natural language processing. Now, it is becoming the method of choice for many genomics modelling tasks, including predicting the impact of genetic variation on gene regulatory mechanisms such as DNA accessibility and splicing.
Collapse
Affiliation(s)
- Gökcen Eraslan
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany.,School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany
| | - Žiga Avsec
- Department of Informatics, Technical University of Munich, Garching, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Garching, Germany.
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany. .,School of Life Sciences Weihenstephan, Technical University of Munich, Freising, Germany. .,Department of Mathematics, Technical University of Munich, Garching, Germany.
| |
Collapse
|
38
|
Yi X, Yang Y, Wu P, Xu X, Li W. Alternative splicing events during adipogenesis from hMSCs. J Cell Physiol 2019; 235:304-316. [PMID: 31206189 DOI: 10.1002/jcp.28970] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Revised: 05/28/2019] [Accepted: 05/29/2019] [Indexed: 12/22/2022]
Abstract
Adipogenesis, the developmental process of progenitor-cell differentiating into adipocytes, leads to fat metabolic disorders. Alternative splicing (AS), a ubiquitous regulatory mechanism of gene expression, allows the generation of more than one unique messenger RNA (mRNA) species from a single gene. Till now, alternative splicing events during adipogenesis from human mesenchymal stem cells (hMSCs) are not yet fully elucidated. We performed RNA-Seq coupled with bioinformatics analysis to identify the differentially expressed AS genes and events during adipogenesis from hMSCs. A global survey separately identified 1262, 1181, 1167, and 1227 ASE involved in the most common types of AS including cassette exon, alt3, and alt5, especially with cassette exon the most prevalent, at 7, 14, 21, and 28 days during adipogenesis. Interestingly, 122 differentially expressed ASE referred to 118 genes, and the three genes including ACTN1 (alt3 and cassette), LRP1 (alt3 and alt5), and LTBP4 (cassette, cassette_multi, and unknown), appeared in multiple AS types of ASE during adipogenesis. Except for all the identified ASE of LRP1 occurred in the extracellular topological domain, alt3 (84) in transmembrane domain significantly differentially expressed was the potential key event during adipogenesis. Overall, we have, for the first time, conducted the global transcriptional profiling during adipogenesis of hMSCs to identify differentially expressed ASE and ASE-related genes. This finding would provide extensive ASE as the regulator of adipogenesis and the potential targets for future molecular research into adipogenesis-related metabolic disorders.
Collapse
Affiliation(s)
- Xia Yi
- Jiangxi Provincial Key Laboratory of Systems Biomedicine, Jiujiang University, Jiujiang, China
| | - Yunzhong Yang
- Beijing Yuanchuangzhilian Techonlogy Development Co., Ltd, Beijing, China
| | - Ping Wu
- Jiangxi Provincial Key Laboratory of Systems Biomedicine, Jiujiang University, Jiujiang, China
| | - Xiaoyuan Xu
- Jiangxi Provincial Key Laboratory of Systems Biomedicine, Jiujiang University, Jiujiang, China
| | - Weidong Li
- Jiangxi Provincial Key Laboratory of Systems Biomedicine, Jiujiang University, Jiujiang, China
| |
Collapse
|
39
|
Zhao S. Alternative splicing, RNA-seq and drug discovery. Drug Discov Today 2019; 24:1258-1267. [PMID: 30953866 DOI: 10.1016/j.drudis.2019.03.030] [Citation(s) in RCA: 52] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2019] [Revised: 02/14/2019] [Accepted: 03/28/2019] [Indexed: 12/27/2022]
Abstract
Alternative splicing, hereafter referred to as AS, is an essential component of gene expression regulation that contributes to the diversity of proteomes. Recent developments in RNA sequencing (RNA-seq) technologies, combined with the advent of computational tools, have enabled transcriptome-wide studies of AS at an unprecedented scale and resolution. RNA mis-splicing can cause human disease, and to target alternative splicing has led to the development of novel therapeutics. Splice variants diversify the repertoire of biomarkers and functionally contribute to drug resistance. Our expanding knowledge of AS variation in human populations holds great promise for improving disease diagnoses and ultimately patient care in the era of sequencing and precision medicine.
Collapse
Affiliation(s)
- Shanrong Zhao
- Pfizer Worldwide Research and Development, Cambridge, MA 02139, USA.
| |
Collapse
|
40
|
Cheng J, Nguyen TYD, Cygan KJ, Çelik MH, Fairbrother WG, Avsec Ž, Gagneur J. MMSplice: modular modeling improves the predictions of genetic variant effects on splicing. Genome Biol 2019; 20:48. [PMID: 30823901 PMCID: PMC6396468 DOI: 10.1186/s13059-019-1653-z] [Citation(s) in RCA: 139] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2018] [Accepted: 02/12/2019] [Indexed: 12/15/2022] Open
Abstract
Predicting the effects of genetic variants on splicing is highly relevant for human genetics. We describe the framework MMSplice (modular modeling of splicing) with which we built the winning model of the CAGI5 exon skipping prediction challenge. The MMSplice modules are neural networks scoring exon, intron, and splice sites, trained on distinct large-scale genomics datasets. These modules are combined to predict effects of variants on exon skipping, splice site choice, splicing efficiency, and pathogenicity, with matched or higher performance than state-of-the-art. Our models, available in the repository Kipoi, apply to variants including indels directly from VCF files.
Collapse
Affiliation(s)
- Jun Cheng
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748 Germany
- Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, München, Germany
| | - Thi Yen Duong Nguyen
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748 Germany
| | - Kamil J. Cygan
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island USA
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, Rhode Island USA
| | - Muhammed Hasan Çelik
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748 Germany
| | - William G. Fairbrother
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island USA
- Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, Rhode Island USA
| | - žiga Avsec
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748 Germany
- Graduate School of Quantitative Biosciences (QBM), Ludwig-Maximilians-Universität München, München, Germany
| | - Julien Gagneur
- Department of Informatics, Technical University of Munich, Boltzmannstraße, Garching, 85748 Germany
| |
Collapse
|
41
|
Zou J, Huss M, Abid A, Mohammadi P, Torkamani A, Telenti A. A primer on deep learning in genomics. Nat Genet 2019; 51:12-18. [PMID: 30478442 PMCID: PMC11180539 DOI: 10.1038/s41588-018-0295-5] [Citation(s) in RCA: 441] [Impact Index Per Article: 73.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2018] [Accepted: 09/26/2018] [Indexed: 12/13/2022]
Abstract
Deep learning methods are a class of machine learning techniques capable of identifying highly complex patterns in large datasets. Here, we provide a perspective and primer on deep learning applications for genome analysis. We discuss successful applications in the fields of regulatory genomics, variant calling and pathogenicity scores. We include general guidance for how to effectively use deep learning methods as well as a practical guide to tools and resources. This primer is accompanied by an interactive online tutorial.
Collapse
Affiliation(s)
- James Zou
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA, USA.
- Chan-Zuckerberg Biohub, San Francisco, CA, USA.
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA.
| | - Mikael Huss
- Peltarion, Stockholm, Sweden
- Department of Learning, Informatics, Management and Ethics, Karolinska Institutet, Stockholm, Sweden
| | - Abubakar Abid
- Department of Electrical Engineering, Stanford University, Palo Alto, CA, USA
| | - Pejman Mohammadi
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Ali Torkamani
- Scripps Research Translational Institute, La Jolla, CA, USA
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA
| | - Amalio Telenti
- Scripps Research Translational Institute, La Jolla, CA, USA.
- Department of Integrative Structural and Computational Biology, The Scripps Research Institute, La Jolla, CA, USA.
| |
Collapse
|
42
|
Wai H, Douglas AGL, Baralle D. RNA splicing analysis in genomic medicine. Int J Biochem Cell Biol 2018; 108:61-71. [PMID: 30594648 DOI: 10.1016/j.biocel.2018.12.009] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2018] [Revised: 12/03/2018] [Accepted: 12/14/2018] [Indexed: 12/13/2022]
Abstract
High-throughput next-generation sequencing technologies have led to a rapid increase in the number of sequence variants identified in clinical practice via diagnostic genetic tests. Current bioinformatic analysis pipelines fail to take adequate account of the possible splicing effects of such variants, particularly where variants fall outwith canonical splice site sequences, and consequently the pathogenicity of such variants may often be missed. The regulation of splicing is highly complex and as a result, in silico prediction tools lack sufficient sensitivity and specificity for reliable use. Variants of all kinds can be linked to aberrant splicing in disease and the need for correct identification and diagnosis grows ever more crucial as novel splice-switching antisense oligonucleotide therapies start to enter clinical usage. RT-PCR provides a useful targeted assay of the splicing effects of identified variants, while minigene assays, massive parallel reporter assays and animal models can also be used for more detailed study of a particular splicing system, given enough time and resources. However, RNA-sequencing (RNA-seq) has the potential to be used as a rapid diagnostic tool in genomic medicine. By utilising data science approaches and machine learning, it may prove possible to finally understand and interpret the 'splicing code' and apply this knowledge in human disease diagnostics.
Collapse
Affiliation(s)
- Htoo Wai
- Human Development and Health, Faculty of Medicine, University of Southampton, UK
| | - Andrew G L Douglas
- Human Development and Health, Faculty of Medicine, University of Southampton, UK; Wessex Clinical Genetics Service, University Hospital Southampton NHS Foundation Trust, Southampton, UK
| | - Diana Baralle
- Human Development and Health, Faculty of Medicine, University of Southampton, UK; Wessex Clinical Genetics Service, University Hospital Southampton NHS Foundation Trust, Southampton, UK.
| |
Collapse
|
43
|
Ashraf U, Benoit-Pilven C, Lacroix V, Navratil V, Naffakh N. Advances in Analyzing Virus-Induced Alterations of Host Cell Splicing. Trends Microbiol 2018; 27:268-281. [PMID: 30577974 DOI: 10.1016/j.tim.2018.11.004] [Citation(s) in RCA: 46] [Impact Index Per Article: 6.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Revised: 10/19/2018] [Accepted: 11/09/2018] [Indexed: 12/14/2022]
Abstract
Alteration of host cell splicing is a common feature of many viral infections which is underappreciated because of the complexity and technical difficulty of studying alternative splicing (AS) regulation. Recent advances in RNA sequencing technologies revealed that up to several hundreds of host genes can show altered mRNA splicing upon viral infection. The observed changes in AS events can be either a direct consequence of viral manipulation of the host splicing machinery or result indirectly from the virus-induced innate immune response or cellular damage. Analysis at a higher resolution with single-cell RNAseq, and at a higher scale with the integration of multiple omics data sets in a systems biology perspective, will be needed to further comprehend this complex facet of virus-host interactions.
Collapse
Affiliation(s)
- Usama Ashraf
- Institut Pasteur, Unité de Génétique Moléculaire des Virus à ARN, Département de Virologie, F-75015 Paris, France; CNRS UMR3569, F-75015 Paris, France; Université Paris Diderot, Sorbonne Paris Cité EA302, F-75015 Paris, France
| | - Clara Benoit-Pilven
- INSERM U1028; CNRS UMR5292, Lyon Neuroscience Research Center, Genetic of Neuro-development Anomalies Team, F-69000 Lyon, France; Université Claude Bernard Lyon 1, CNRS UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France; EPI ERABLE, INRIA Grenoble Rhône-Alpes, F-38330 Montbonnot Saint-Martin, France
| | - Vincent Lacroix
- Université Claude Bernard Lyon 1, CNRS UMR5558, Laboratoire de Biométrie et Biologie Evolutive, F-69622 Villeurbanne, France; EPI ERABLE, INRIA Grenoble Rhône-Alpes, F-38330 Montbonnot Saint-Martin, France
| | - Vincent Navratil
- PRABI, Rhône Alpes Bioinformatics Center, UCBL, Université Claude Bernard Lyon 1, F-69000 Lyon, France; European Virus Bioinformatics Center, Leutragraben 1, D-07743 Jena, Germany
| | - Nadia Naffakh
- Institut Pasteur, Unité de Génétique Moléculaire des Virus à ARN, Département de Virologie, F-75015 Paris, France; CNRS UMR3569, F-75015 Paris, France; Université Paris Diderot, Sorbonne Paris Cité EA302, F-75015 Paris, France.
| |
Collapse
|
44
|
Park E, Pan Z, Zhang Z, Lin L, Xing Y. The Expanding Landscape of Alternative Splicing Variation in Human Populations. Am J Hum Genet 2018; 102:11-26. [PMID: 29304370 PMCID: PMC5777382 DOI: 10.1016/j.ajhg.2017.11.002] [Citation(s) in RCA: 247] [Impact Index Per Article: 35.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2017] [Accepted: 11/03/2017] [Indexed: 12/16/2022] Open
Abstract
Alternative splicing is a tightly regulated biological process by which the number of gene products for any given gene can be greatly expanded. Genomic variants in splicing regulatory sequences can disrupt splicing and cause disease. Recent developments in sequencing technologies and computational biology have allowed researchers to investigate alternative splicing at an unprecedented scale and resolution. Population-scale transcriptome studies have revealed many naturally occurring genetic variants that modulate alternative splicing and consequently influence phenotypic variability and disease susceptibility in human populations. Innovations in experimental and computational tools such as massively parallel reporter assays and deep learning have enabled the rapid screening of genomic variants for their causal impacts on splicing. In this review, we describe technological advances that have greatly increased the speed and scale at which discoveries are made about the genetic variation of alternative splicing. We summarize major findings from population transcriptomic studies of alternative splicing and discuss the implications of these findings for human genetics and medicine.
Collapse
Affiliation(s)
- Eddie Park
- Department of Microbiology, Immunology, & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Zhicheng Pan
- Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Zijun Zhang
- Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Lan Lin
- Department of Microbiology, Immunology, & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Yi Xing
- Department of Microbiology, Immunology, & Molecular Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; Bioinformatics Interdepartmental Graduate Program, University of California, Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
45
|
Therapeutic Applications of Targeted Alternative Splicing to Cancer Treatment. Int J Mol Sci 2017; 19:ijms19010075. [PMID: 29283381 PMCID: PMC5796025 DOI: 10.3390/ijms19010075] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2017] [Revised: 12/22/2017] [Accepted: 12/24/2017] [Indexed: 12/16/2022] Open
Abstract
A growing body of studies has documented the pathological influence of impaired alternative splicing (AS) events on numerous diseases, including cancer. In addition, the generation of alternatively spliced isoforms is frequently noted to result in drug resistance in many cancer therapies. To gain comprehensive insights into the impacts of AS events on cancer biology and therapeutic developments, this paper highlights recent findings regarding the therapeutic routes of targeting alternative-spliced isoforms and splicing regulators to treatment strategies for distinct cancers.
Collapse
|
46
|
Korotcov A, Tkachenko V, Russo DP, Ekins S. Comparison of Deep Learning With Multiple Machine Learning Methods and Metrics Using Diverse Drug Discovery Data Sets. Mol Pharm 2017; 14:4462-4475. [PMID: 29096442 PMCID: PMC5741413 DOI: 10.1021/acs.molpharmaceut.7b00578] [Citation(s) in RCA: 195] [Impact Index Per Article: 24.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Machine learning methods have been applied to many data sets in pharmaceutical research for several decades. The relative ease and availability of fingerprint type molecular descriptors paired with Bayesian methods resulted in the widespread use of this approach for a diverse array of end points relevant to drug discovery. Deep learning is the latest machine learning algorithm attracting attention for many of pharmaceutical applications from docking to virtual screening. Deep learning is based on an artificial neural network with multiple hidden layers and has found considerable traction for many artificial intelligence applications. We have previously suggested the need for a comparison of different machine learning methods with deep learning across an array of varying data sets that is applicable to pharmaceutical research. End points relevant to pharmaceutical research include absorption, distribution, metabolism, excretion, and toxicity (ADME/Tox) properties, as well as activity against pathogens and drug discovery data sets. In this study, we have used data sets for solubility, probe-likeness, hERG, KCNQ1, bubonic plague, Chagas, tuberculosis, and malaria to compare different machine learning methods using FCFP6 fingerprints. These data sets represent whole cell screens, individual proteins, physicochemical properties as well as a data set with a complex end point. Our aim was to assess whether deep learning offered any improvement in testing when assessed using an array of metrics including AUC, F1 score, Cohen's kappa, Matthews correlation coefficient and others. Based on ranked normalized scores for the metrics or data sets Deep Neural Networks (DNN) ranked higher than SVM, which in turn was ranked higher than all the other machine learning methods. Visualizing these properties for training and test sets using radar type plots indicates when models are inferior or perhaps over trained. These results also suggest the need for assessing deep learning further using multiple metrics with much larger scale comparisons, prospective testing as well as assessment of different fingerprints and DNN architectures beyond those used.
Collapse
Affiliation(s)
- Alexandru Korotcov
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Valery Tkachenko
- Science Data Software, LLC, 14914 Bradwill Court, Rockville, MD 20850, USA
| | - Daniel P Russo
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
- The Rutgers Center for Computational and Integrative Biology, Camden, NJ, 08102, USA
| | - Sean Ekins
- Collaborations Pharmaceuticals, Inc., 840 Main Campus Drive, Lab 3510, Raleigh, NC 27606, USA
| |
Collapse
|