1
|
Smaruj PN, Xiao Y, Fudenberg G. Recipes and ingredients for deep learning models of 3D genome folding. Curr Opin Genet Dev 2025; 91:102308. [PMID: 39862604 PMCID: PMC11867851 DOI: 10.1016/j.gde.2024.102308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 12/19/2024] [Accepted: 12/31/2024] [Indexed: 01/27/2025]
Abstract
Three-dimensional genome folding plays roles in gene regulation and disease. In this review, we compare and contrast recent deep learning models for predicting genome contact maps. We survey preprocessing, architecture, training, evaluation, and interpretation methods, highlighting the capabilities and limitations of different models. In each area, we highlight challenges, opportunities, and potential future directions for genome-folding models.
Collapse
Affiliation(s)
- Paulina N Smaruj
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Yao Xiao
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Geoffrey Fudenberg
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
2
|
Shahid, Hayat M, Alghamdi W, Akbar S, Raza A, Kadir RA, Sarker MR. pACP-HybDeep: predicting anticancer peptides using binary tree growth based transformer and structural feature encoding with deep-hybrid learning. Sci Rep 2025; 15:565. [PMID: 39747941 PMCID: PMC11695694 DOI: 10.1038/s41598-024-84146-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Accepted: 12/20/2024] [Indexed: 01/04/2025] Open
Abstract
Worldwide, Cancer remains a significant health concern due to its high mortality rates. Despite numerous traditional therapies and wet-laboratory methods for treating cancer-affected cells, these approaches often face limitations, including high costs and substantial side effects. Recently the high selectivity of peptides has garnered significant attention from scientists due to their reliable targeted actions and minimal adverse effects. Furthermore, keeping the significant outcomes of the existing computational models, we propose a highly reliable and effective model namely, pACP-HybDeep for the accurate prediction of anticancer peptides. In this model, training peptides are numerically encoded using an attention-based ProtBERT-BFD encoder to extract semantic features along with CTDT-based structural information. Furthermore, a k-nearest neighbor-based binary tree growth (BTG) algorithm is employed to select an optimal feature set from the multi-perspective vector. The selected feature vector is subsequently trained using a CNN + RNN-based deep learning model. Our proposed pACP-HybDeep model demonstrated a high training accuracy of 95.33%, and an AUC of 0.97. To validate the generalization capabilities of the model, our pACP-HybDeep model achieved accuracies of 94.92%, 92.26%, and 91.16% on independent datasets Ind-S1, Ind-S2, and Ind-S3, respectively. The demonstrated efficacy, and reliability of the pACP-HybDeep model using test datasets establish it as a valuable tool for researchers in academia and pharmaceutical drug design.
Collapse
Affiliation(s)
- Shahid
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan
| | - Maqsood Hayat
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan.
| | - Wajdi Alghamdi
- Department of Information Technology, Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
| | - Shahid Akbar
- Department of Computer Science, Abdul Wali Khan University Mardan, Mardan, 23200, KP, Pakistan.
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, 610054, China.
| | - Ali Raza
- Department of Computer Science, MY University, Islamabad, 45750, Pakistan
| | - Rabiah Abdul Kadir
- Institute of Visual Informatics, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia.
| | - Mahidur R Sarker
- Institute of Visual Informatics, Universiti Kebangsaan Malaysia, 43600, Bangi, Selangor, Malaysia
- Universidad de Dise˜no, Innovaci´on y Tecnología, UDIT, Av. Alfonso XIII, 97, 28016, Madrid, Spain
| |
Collapse
|
3
|
Wang Z, Yuan H, Yan J, Liu J. Identification, characterization, and design of plant genome sequences using deep learning. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2025; 121:e17190. [PMID: 39666835 DOI: 10.1111/tpj.17190] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/03/2024] [Revised: 11/11/2024] [Accepted: 11/23/2024] [Indexed: 12/14/2024]
Abstract
Due to its excellent performance in processing large amounts of data and capturing complex non-linear relationships, deep learning has been widely applied in many fields of plant biology. Here we first review the application of deep learning in analyzing genome sequences to predict gene expression, chromatin interactions, and epigenetic features (open chromatin, transcription factor binding sites, and methylation sites) in plants. Then, current motif mining and functional component design and synthesis based on generative adversarial networks, large models, and attention mechanisms are elaborated in detail. The progress of protein structure and function prediction, genomic prediction, and large model applications based on deep learning is also discussed. Finally, this work provides prospects for the future development of deep learning in plants with regard to multiple omics data, algorithm optimization, large language models, sequence design, and intelligent breeding.
Collapse
Affiliation(s)
- Zhenye Wang
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Hao Yuan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
| | - Jianbing Yan
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| | - Jianxiao Liu
- National Key Laboratory of Crop Genetic Improvement, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Key Laboratory of Agricultural Bioinformatics, Huazhong Agricultural University, Wuhan, 430070, China
- College of Informatics, Huazhong Agricultural University, Wuhan, 430070, China
- Hubei Hongshan Laboratory, Wuhan, 430070, China
| |
Collapse
|
4
|
Bréhélin L. Advancing Regulatory Genomics With Machine Learning. Bioinform Biol Insights 2024; 18:11779322241249562. [PMID: 39735654 PMCID: PMC11672376 DOI: 10.1177/11779322241249562] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 04/09/2024] [Indexed: 12/31/2024] Open
Abstract
In recent years, several machine learning (ML) approaches have been proposed to predict gene expression signal and chromatin features from the DNA sequence alone. These models are often used to deduce and, to some extent, assess putative new biological insights about gene regulation, and they have led to very interesting advances in regulatory genomics. This article reviews a selection of these methods, ranging from linear models to random forests, kernel methods, and more advanced deep learning models. Specifically, we detail the different techniques and strategies that can be used to extract new gene-regulation hypotheses from these models. Furthermore, because these putative insights need to be validated with wet-lab experiments, we emphasize that it is important to have a measure of confidence associated with the extracted hypotheses. We review the procedures that have been proposed to measure this confidence for the different types of ML models, and we discuss the fact that they do not provide the same kind of information.
Collapse
|
5
|
Horvath J, Jedlicka P, Kratka M, Kubat Z, Kejnovsky E, Lexa M. Detection and classification of long terminal repeat sequences in plant LTR-retrotransposons and their analysis using explainable machine learning. BioData Min 2024; 17:57. [PMID: 39696434 DOI: 10.1186/s13040-024-00410-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2024] [Accepted: 11/22/2024] [Indexed: 12/20/2024] Open
Abstract
BACKGROUND Long terminal repeats (LTRs) represent important parts of LTR retrotransposons and retroviruses found in high copy numbers in a majority of eukaryotic genomes. LTRs contain regulatory sequences essential for the life cycle of the retrotransposon. Previous experimental and sequence studies have provided only limited information about LTR structure and composition, mostly from model systems. To enhance our understanding of these key sequence modules, we focused on the contrasts between LTRs of various retrotransposon families and other genomic regions. Furthermore, this approach can be utilized for the classification and prediction of LTRs. RESULTS We used machine learning methods suitable for DNA sequence classification and applied them to a large dataset of plant LTR retrotransposon sequences. We trained three machine learning models using (i) traditional model ensembles (Gradient Boosting), (ii) hybrid convolutional/long and short memory network models, and (iii) a DNA pre-trained transformer-based model using k-mer sequence representation. All three approaches were successful in classifying and isolating LTRs in this data, as well as providing valuable insights into LTR sequence composition. The best classification (expressed as F1 score) achieved for LTR detection was 0.85 using the hybrid network model. The most accurate classification task was superfamily classification (F1=0.89) while the least accurate was family classification (F1=0.74). The trained models were subjected to explainability analysis. Positional analysis identified a mixture of interesting features, many of which had a preferred absolute position within the LTR and/or were biologically relevant, such as a centrally positioned TATA-box regulatory sequence, and TG..CA nucleotide patterns around both LTR edges. CONCLUSIONS Our results show that the models used here recognized biologically relevant motifs, such as core promoter elements in the LTR detection task, and a development and stress-related subclass of transcription factor binding sites in the family classification task. Explainability analysis also highlighted the importance of 5'- and 3'- edges in LTR identity and revealed need to analyze more than just dinucleotides at these ends. Our work shows the applicability of machine learning models to regulatory sequence analysis and classification, and demonstrates the important role of the identified motifs in LTR detection.
Collapse
Affiliation(s)
- Jakub Horvath
- Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, 60200, Czech Republic.
| | - Pavel Jedlicka
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
| | - Marie Kratka
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
- National Centre for Biomolecular Research, Faculty of Science, Masaryk University, Kamenice 5, Brno, 62500, Czech Republic
| | - Zdenek Kubat
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
| | - Eduard Kejnovsky
- Department of Plant Developmental Genetics, Institute of Biophysics of the Czech Academy of Sciences, Kralovopolska 135, Brno, 61200, Czech Republic
| | - Matej Lexa
- Faculty of Informatics, Masaryk University, Botanicka 68a, Brno, 60200, Czech Republic.
| |
Collapse
|
6
|
Gage JL, Romay MC, Buckler ES. Maize inbreds show allelic variation for diel transcription patterns. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.16.628400. [PMID: 39763849 PMCID: PMC11702552 DOI: 10.1101/2024.12.16.628400] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/19/2025]
Abstract
Circadian entrainment and external cues can cause gene transcript abundance to oscillate throughout the day, and these patterns of diel transcript oscillation vary across genes and plant species. Less is known about within-species allelic variation for diel patterns of transcript oscillation, or about how regulatory sequence variation influences diel transcription patterns. In this study, we evaluated diel transcript abundance for 24 diverse maize inbred lines. We observed extensive natural variation in diel transcription patterns, with two-fold variation in the number of genes that oscillate over the course of the day. A convolutional neural network trained to predict oscillation from promoter sequence identified sequences previously reported as binding motifs for known circadian clock genes in other plant systems. Genes showing diel transcription patterns that cosegregate with promoter sequence haplotypes are enriched for associations with photoperiod sensitivity and may have been indirect targets of selection as maize was adapted to longer day lengths at higher latitudes. These findings support the idea that cis-regulatory sequence variation influences patterns of gene expression, which in turn can have effects on phenotypic plasticity and local adaptation.
Collapse
Affiliation(s)
- Joseph L. Gage
- Department of Crop and Soil Sciences, North Carolina State University, Raleigh, NC 27695
- NC Plant Sciences Initiative, North Carolina State University, Raleigh, NC, 27606
| | - M. Cinta Romay
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
| | - Edward S. Buckler
- Institute for Genomic Diversity, Cornell University, Ithaca, NY 14853
- USDA-ARS, Ithaca, NY 14850
- School of Integrative Plant Science, Plant Breeding and Genetics Section, Cornell University, Ithaca NY 14853
| |
Collapse
|
7
|
McKeever PM, Sababi AM, Sharma R, Xu Z, Xiao S, McGoldrick P, Ketela T, Sato C, Moreno D, Visanji N, Kovacs GG, Keith J, Zinman L, Rogaeva E, Goodarzi H, Bader GD, Robertson J. Single-nucleus transcriptome atlas of orbitofrontal cortex in amyotrophic lateral sclerosis with a deep learning-based decoding of alternative polyadenylation mechanisms. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.12.22.573083. [PMID: 38187588 PMCID: PMC10769403 DOI: 10.1101/2023.12.22.573083] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Amyotrophic lateral sclerosis (ALS) and frontotemporal lobar degeneration (FTLD) are two age-related and fatal neurodegenerative disorders that lie on a shared disease spectrum. While both disorders involve complex interactions between neuronal and glial cells, the specific cell-type alterations and their contributions to disease pathophysiology remain incompletely understood. Here, we applied single-nucleus RNA sequencing of the orbitofrontal cortex, a region affected in ALS-FTLD, to map cell-type specific transcriptional signatures in C9orf72-related ALS (with and without FTLD) and sporadic ALS cases. Our findings reveal disease- and cell-type-specific transcriptional changes, with neurons exhibiting the most pronounced alterations, primarily affecting mitochondrial function, protein homeostasis, and chromatin remodeling. A comparison with independent datasets from different cortical regions of C9orf72 and sporadic ALS cases showed concordance in several pathways, with neuronal STMN2 and NEFL showing consistent up-regulation between brain regions and disease subtypes. We also interrogated alternative polyadenylation (APA) as an additional layer of transcriptional regulation, demonstrating that APA events are not correlated with identified gene expression changes. To interpret these events, we developed APA-Net, a deep learning model that integrates transcript sequences with RNA-binding protein expression profiles, revealing cell type-specific patterns of APA regulation. Our atlas illuminates cell type-specific pathomechanisms of ALS/FTLD, providing a valuable resource for further investigation.
Collapse
|
8
|
Zhong G, Liu H, Deng L. Ensemble Machine Learning and Predicted Properties Promote Antimicrobial Peptide Identification. Interdiscip Sci 2024; 16:951-965. [PMID: 38972032 DOI: 10.1007/s12539-024-00640-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2024] [Revised: 06/04/2024] [Accepted: 06/07/2024] [Indexed: 07/08/2024]
Abstract
The emergence of antibiotic-resistant microbes raises a pressing demand for novel alternative treatments. One promising alternative is the antimicrobial peptides (AMPs), a class of innate immunity mediators within the therapeutic peptide realm. AMPs offer salient advantages such as high specificity, cost-effective synthesis, and reduced toxicity. Although some computational methodologies have been proposed to identify potential AMPs with the rapid development of artificial intelligence techniques, there is still ample room to improve their performance. This study proposes a predictive framework which ensembles deep learning and statistical learning methods to screen peptides with antimicrobial activity. We integrate multiple LightGBM classifiers and convolution neural networks which leverages various predicted sequential, structural and physicochemical properties from their residue sequences extracted by diverse machine learning paradigms. Comparative experiments exhibit that our method outperforms other state-of-the-art approaches on an independent test dataset, in terms of representative capability measures. Besides, we analyse the discrimination quality under different varieties of attribute information and it reveals that combination of multiple features could improve prediction. In addition, a case study is carried out to illustrate the exemplary favorable identification effect. We establish a web application at http://amp.denglab.org to provide convenient usage of our proposal and make the predictive framework, source code, and datasets publicly accessible at https://github.com/researchprotein/amp .
Collapse
Affiliation(s)
- Guolun Zhong
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China
| | - Hui Liu
- College of Computer and Information Engineering, Nanjing Tech University, Nanjing, 211816, China.
| | - Lei Deng
- School of Computer Science and Engineering, Central South University, Changsha, 410083, China.
| |
Collapse
|
9
|
La Fleur A, Shi Y, Seelig G. Decoding biology with massively parallel reporter assays and machine learning. Genes Dev 2024; 38:843-865. [PMID: 39362779 PMCID: PMC11535156 DOI: 10.1101/gad.351800.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/05/2024]
Abstract
Massively parallel reporter assays (MPRAs) are powerful tools for quantifying the impacts of sequence variation on gene expression. Reading out molecular phenotypes with sequencing enables interrogating the impact of sequence variation beyond genome scale. Machine learning models integrate and codify information learned from MPRAs and enable generalization by predicting sequences outside the training data set. Models can provide a quantitative understanding of cis-regulatory codes controlling gene expression, enable variant stratification, and guide the design of synthetic regulatory elements for applications from synthetic biology to mRNA and gene therapy. This review focuses on cis-regulatory MPRAs, particularly those that interrogate cotranscriptional and post-transcriptional processes: alternative splicing, cleavage and polyadenylation, translation, and mRNA decay.
Collapse
Affiliation(s)
- Alyssa La Fleur
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA
| | - Yongsheng Shi
- Department of Microbiology and Molecular Genetics, School of Medicine, University of California, Irvine, Irvine, California 92697, USA;
| | - Georg Seelig
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, Washington 98195, USA;
- Department of Electrical & Computer Engineering, University of Washington, Seattle, Washington 98195, USA
| |
Collapse
|
10
|
John C, Sahoo J, Sajan IK, Madhavan M, Mathew OK. CNN-BLSTM based deep learning framework for eukaryotic kinome classification: An explainability based approach. Comput Biol Chem 2024; 112:108169. [PMID: 39137619 DOI: 10.1016/j.compbiolchem.2024.108169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2024] [Revised: 05/08/2024] [Accepted: 08/03/2024] [Indexed: 08/15/2024]
Abstract
Classification of protein families from their sequences is an enduring task in Proteomics and related studies. Numerous deep-learning models have been moulded to tackle this challenge, but due to the black-box character, they still fall short in reliability. Here, we present a novel explainability pipeline that explains the pivotal decisions of the deep learning model on the classification of the Eukaryotic kinome. Based on a comparative and experimental analysis of the most cutting-edge deep learning algorithms, the best deep learning model CNN-BLSTM was chosen to classify the eight eukaryotic kinase sequences to their corresponding families. As a substitution for the conventional class activation map-based interpretation of CNN-based models in the domain, we have cascaded the GRAD CAM and Integrated Gradient (IG) explainability modus operandi for improved and responsible results. To ensure the trustworthiness of the classifier, we have masked the kinase domain traces, identified from the explainability pipeline and observed a class-specific drop in F1-score from 0.96 to 0.76. In compliance with the Explainable AI paradigm, our results are promising and contribute to enhancing the trustworthiness of deep learning models for biological sequence-associated studies.
Collapse
Affiliation(s)
- Chinju John
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India.
| | - Jayakrushna Sahoo
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Irish K Sajan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Manu Madhavan
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| | - Oommen K Mathew
- Department of Computer Science and Engineering, Indian Institute of Information Technology Kottayam, Kottayam, 686635, Kerala, India
| |
Collapse
|
11
|
Tang Z, Somia N, Yu Y, Koo PK. Evaluating the representational power of pre-trained DNA language models for regulatory genomics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.29.582810. [PMID: 38464101 PMCID: PMC10925287 DOI: 10.1101/2024.02.29.582810] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
The emergence of genomic language models (gLMs) offers an unsupervised approach to learning a wide diversity of cis-regulatory patterns in the non-coding genome without requiring labels of functional activity generated by wet-lab experiments. Previous evaluations have shown that pre-trained gLMs can be leveraged to improve predictive performance across a broad range of regulatory genomics tasks, albeit using relatively simple benchmark datasets and baseline models. Since the gLMs in these studies were tested upon fine-tuning their weights for each downstream task, determining whether gLM representations embody a foundational understanding of cis-regulatory biology remains an open question. Here we evaluate the representational power of pre-trained gLMs to predict and interpret cell-type-specific functional genomics data that span DNA and RNA regulation. Our findings suggest that probing the representations of pre-trained gLMs do not offer substantial advantages over conventional machine learning approaches that use one-hot encoded sequences. This work highlights a major gap with current gLMs, raising potential issues in conventional pre-training strategies for the non-coding genome.
Collapse
Affiliation(s)
- Ziqi Tang
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Nirali Somia
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| | - Yiyang Yu
- The Fu Foundation School of Engineering and Applied Science, Columbia University, New York, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, NY, USA
| |
Collapse
|
12
|
Romero R, Menichelli C, Vroland C, Marin JM, Lèbre S, Lecellier CH, Bréhélin L. TFscope: systematic analysis of the sequence features involved in the binding preferences of transcription factors. Genome Biol 2024; 25:187. [PMID: 38987807 PMCID: PMC11514967 DOI: 10.1186/s13059-024-03321-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Accepted: 06/24/2024] [Indexed: 07/12/2024] Open
Abstract
Characterizing the binding preferences of transcription factors (TFs) in different cell types and conditions is key to understand how they orchestrate gene expression. Here, we develop TFscope, a machine learning approach that identifies sequence features explaining the binding differences observed between two ChIP-seq experiments targeting either the same TF in two conditions or two TFs with similar motifs (paralogous TFs). TFscope systematically investigates differences in the core motif, nucleotide environment and co-factor motifs, and provides the contribution of each key feature in the two experiments. TFscope was applied to > 305 ChIP-seq pairs, and several examples are discussed.
Collapse
Affiliation(s)
- Raphaël Romero
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- IMAG, Univ Montpellier, CNRS, Montpellier, France
| | | | - Christophe Vroland
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | | | - Sophie Lèbre
- IMAG, Univ Montpellier, CNRS, Montpellier, France.
- AMIS, Université Paul-Valéry-Montpellier 3, Montpellier, France.
| | - Charles-Henri Lecellier
- LIRMM, Univ Montpellier, CNRS, Montpellier, France.
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France.
| | | |
Collapse
|
13
|
Zheng W, Fong JHC, Wan YK, Chu AHY, Huang Y, Wong ASL, Ho JWK. Discovery of regulatory motifs in 5' untranslated regions using interpretable multi-task learning models. Cell Syst 2023; 14:1103-1112.e6. [PMID: 38016465 DOI: 10.1016/j.cels.2023.10.011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 09/18/2023] [Accepted: 10/31/2023] [Indexed: 11/30/2023]
Abstract
The sequence in the 5' untranslated regions (UTRs) is known to affect mRNA translation rates. However, the underlying regulatory grammar remains elusive. Here, we propose MTtrans, a multi-task translation rate predictor capable of learning common sequence patterns from datasets across various experimental techniques. The core premise is that common motifs are more likely to be genuinely involved in translation control. MTtrans outperforms existing methods in both accuracy and the ability to capture transferable motifs across species, highlighting its strength in identifying evolutionarily conserved sequence motifs. Our independent fluorescence-activated cell sorting coupled with deep sequencing (FACS-seq) experiment validates the impact of most motifs identified by MTtrans. Additionally, we introduce "GRU-rewiring," a technique to interpret the hidden states of the recurrent units. Gated recurrent unit (GRU)-rewiring allows us to identify regulatory element-enriched positions and examine the local effects of 5' UTR mutations. MTtrans is a powerful tool for deciphering the translation regulatory motifs.
Collapse
Affiliation(s)
- Weizhong Zheng
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - John H C Fong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Yuk Kei Wan
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China
| | - Athena H Y Chu
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China
| | - Yuanhua Huang
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China; Center for Translational Stem Cell Biology, Hong Kong Science and Technology Park, Hong Kong SAR, China
| | - Alan S L Wong
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Centre for Oncology and Immunology, Hong Kong Science Park, Hong Kong SAR, China; Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong SAR, China
| | - Joshua W K Ho
- School of Biomedical Sciences, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Hong Kong SAR, China; Laboratory of Data Discovery for Health (D24H) Limited, Hong Kong Science Park, Hong Kong SAR, China.
| |
Collapse
|
14
|
Klie A, Laub D, Talwar JV, Stites H, Jores T, Solvason JJ, Farley EK, Carter H. Predictive analyses of regulatory sequences with EUGENe. NATURE COMPUTATIONAL SCIENCE 2023; 3:946-956. [PMID: 38177592 PMCID: PMC10768637 DOI: 10.1038/s43588-023-00544-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 09/27/2023] [Indexed: 01/06/2024]
Abstract
Deep learning has become a popular tool to study cis-regulatory function. Yet efforts to design software for deep-learning analyses in regulatory genomics that are findable, accessible, interoperable and reusable (FAIR) have fallen short of fully meeting these criteria. Here we present elucidating the utility of genomic elements with neural nets (EUGENe), a FAIR toolkit for the analysis of genomic sequences with deep learning. EUGENe consists of a set of modules and subpackages for executing the key functionality of a genomics deep learning workflow: (1) extracting, transforming and loading sequence data from many common file formats; (2) instantiating, initializing and training diverse model architectures; and (3) evaluating and interpreting model behavior. We designed EUGENe as a simple, flexible and extensible interface for streamlining and customizing end-to-end deep-learning sequence analyses, and illustrate these principles through application of the toolkit to three predictive modeling tasks. We hope that EUGENe represents a springboard towards a collaborative ecosystem for deep-learning applications in genomics research.
Collapse
Affiliation(s)
- Adam Klie
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - David Laub
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - James V Talwar
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | | | - Tobias Jores
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Joe J Solvason
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Molecular Biology, University of California San Diego, La Jolla, CA, USA
| | - Emma K Farley
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Molecular Biology, University of California San Diego, La Jolla, CA, USA
| | - Hannah Carter
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
15
|
Zhang J, Liu B, Wu J, Wang Z, Li J. DeepCAC: a deep learning approach on DNA transcription factors classification based on multi-head self-attention and concatenate convolutional neural network. BMC Bioinformatics 2023; 24:345. [PMID: 37723425 PMCID: PMC10506269 DOI: 10.1186/s12859-023-05469-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2023] [Accepted: 09/06/2023] [Indexed: 09/20/2023] Open
Abstract
Understanding gene expression processes necessitates the accurate classification and identification of transcription factors, which is supported by high-throughput sequencing technologies. However, these techniques suffer from inherent limitations such as time consumption and high costs. To address these challenges, the field of bioinformatics has increasingly turned to deep learning technologies for analyzing gene sequences. Nevertheless, the pursuit of improved experimental results has led to the inclusion of numerous complex analysis function modules, resulting in models with a growing number of parameters. To overcome these limitations, it is proposed a novel approach for analyzing DNA transcription factor sequences, which is named as DeepCAC. This method leverages deep convolutional neural networks with a multi-head self-attention mechanism. By employing convolutional neural networks, it can effectively capture local hidden features in the sequences. Simultaneously, the multi-head self-attention mechanism enhances the identification of hidden features with long-distant dependencies. This approach reduces the overall number of parameters in the model while harnessing the computational power of sequence data from multi-head self-attention. Through training with labeled data, experiments demonstrate that this approach significantly improves performance while requiring fewer parameters compared to existing methods. Additionally, the effectiveness of our approach is validated in accurately predicting DNA transcription factor sequences.
Collapse
Affiliation(s)
- Jidong Zhang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Bo Liu
- School of Mathematical and Computational Sciences, Massey University, Auckland, 0745, New Zealand.
| | - Jiahui Wu
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Zhihan Wang
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| | - Jianqiang Li
- Faculty of Information Technology, Beijing University of Technology, Beijing, 100124, China
| |
Collapse
|
16
|
Hepkema J, Lee NK, Stewart BJ, Ruangroengkulrith S, Charoensawan V, Clatworthy MR, Hemberg M. Predicting the impact of sequence motifs on gene regulation using single-cell data. Genome Biol 2023; 24:189. [PMID: 37582793 PMCID: PMC10426127 DOI: 10.1186/s13059-023-03021-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 07/21/2023] [Indexed: 08/17/2023] Open
Abstract
The binding of transcription factors at proximal promoters and distal enhancers is central to gene regulation. Identifying regulatory motifs and quantifying their impact on expression remains challenging. Using a convolutional neural network trained on single-cell data, we infer putative regulatory motifs and cell type-specific importance. Our model, scover, explains 29% of the variance in gene expression in multiple mouse tissues. Applying scover to distal enhancers identified using scATAC-seq from the developing human brain, we identify cell type-specific motif activities in distal enhancers. Scover can identify regulatory motifs and their importance from single-cell data where all parameters and outputs are easily interpretable.
Collapse
Affiliation(s)
- Jacob Hepkema
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Nicholas Keone Lee
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Benjamin J Stewart
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
- Cambridge University Hospitals NHS Foundation Trust and NIHR Cambridge Biomedical Research Centre, Cambridge, CB2 0QQ, UK
| | - Siwat Ruangroengkulrith
- Department of Biochemistry, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
| | - Varodom Charoensawan
- Department of Biochemistry, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
- Integrative Computational BioScience (ICBS) Center, Mahidol University, Nakhon Pathom, 7310, Thailand
- Systems Biology of Diseases Research Unit, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
| | - Menna R Clatworthy
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
- Cambridge University Hospitals NHS Foundation Trust and NIHR Cambridge Biomedical Research Centre, Cambridge, CB2 0QQ, UK
| | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK.
- The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK.
- Gene Lay Institute of Immunology and Inflammation, Brigham and Women's Hospital, Massachusetts General Hospital, and Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
17
|
Nowling RJ, Njoya K, Peters JG, Riehle MM. Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique. Front Cell Infect Microbiol 2023; 13:1182567. [PMID: 37600946 PMCID: PMC10433755 DOI: 10.3389/fcimb.2023.1182567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023] Open
Abstract
Introduction Various sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods Here, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
Collapse
Affiliation(s)
- Ronald J. Nowling
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Kimani Njoya
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| | - John G. Peters
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Michelle M. Riehle
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| |
Collapse
|
18
|
Balcı AT, Ebeid MM, Benos PV, Kostka D, Chikina M. An intrinsically interpretable neural network architecture for sequence-to-function learning. Bioinformatics 2023; 39:i413-i422. [PMID: 37387140 PMCID: PMC10311317 DOI: 10.1093/bioinformatics/btad271] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called totally interpretable sequence-to-function model (tiSFM). tiSFM improves upon the performance of standard multilayer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multilayer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs. RESULTS We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context-specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM's model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition. AVAILABILITY AND IMPLEMENTATION The source code, including scripts for the analysis of key findings, can be found at https://github.com/boooooogey/ATAConv, implemented in Python.
Collapse
Affiliation(s)
- Ali Tuğrul Balcı
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Mark Maher Ebeid
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Panayiotis V Benos
- Department of Epidemiology, University of Florida, Gainesville, FL 32610, United States
| | - Dennis Kostka
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
- Department of Developmental Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| | - Maria Chikina
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Pittsburgh, PA 15213, United States
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA 15213, United States
| |
Collapse
|
19
|
Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol 2023; 24:154. [PMID: 37370113 DOI: 10.1186/s13059-023-02985-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Deep learning models such as convolutional neural networks (CNNs) excel in genomic tasks but lack interpretability. We introduce ExplaiNN, which combines the expressiveness of CNNs with the interpretability of linear models. ExplaiNN can predict TF binding, chromatin accessibility, and de novo motifs, achieving performance comparable to state-of-the-art methods. Its predictions are transparent, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. ExplaiNN can serve as a plug-and-play platform for pretrained models and annotated position weight matrices. ExplaiNN aims to accelerate the adoption of deep learning in genomic sequence analysis by domain experts.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Manu Saraswat
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington (UW), Seattle, USA
| | - Wyeth W Wasserman
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
20
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
21
|
Majdandzic A, Rajesh C, Koo PK. Correcting gradient-based interpretations of deep neural networks for genomics. Genome Biol 2023; 24:109. [PMID: 37161475 PMCID: PMC10169356 DOI: 10.1186/s13059-023-02956-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2022] [Accepted: 04/28/2023] [Indexed: 05/11/2023] Open
Abstract
Post hoc attribution methods can provide insights into the learned patterns from deep neural networks (DNNs) trained on high-throughput functional genomics data. However, in practice, their resultant attribution maps can be challenging to interpret due to spurious importance scores for seemingly arbitrary nucleotides. Here, we identify a previously overlooked attribution noise source that arises from how DNNs handle one-hot encoded DNA. We demonstrate this noise is pervasive across various genomic DNNs and introduce a statistical correction that effectively reduces it, leading to more reliable attribution maps. Our approach represents a promising step towards gaining meaningful insights from DNNs in regulatory genomics.
Collapse
Affiliation(s)
- Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Chandana Rajesh
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA
| | - Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, 1 Bungtown Road, Cold Spring Harbor, NY, USA.
| |
Collapse
|
22
|
Balcı AT, Ebeid MM, Benos PV, Kostka D, Chikina M. An intrinsically interpretable neural network architecture for sequence to function learning. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.25.525572. [PMID: 36747873 PMCID: PMC9900791 DOI: 10.1101/2023.01.25.525572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
MOTIVATION Sequence-based deep learning approaches have been shown to predict a multitude of functional genomic readouts, including regions of open chromatin and RNA expression of genes. However, a major limitation of current methods is that model interpretation relies on computationally demanding post hoc analyses, and even then, one can often not explain the internal mechanics of highly parameterized models. Here, we introduce a deep learning architecture called tiSFM (totally interpretable sequence to function model). tiSFM improves upon the performance of standard multi-layer convolutional models while using fewer parameters. Additionally, while tiSFM is itself technically a multi-layer neural network, internal model parameters are intrinsically interpretable in terms of relevant sequence motifs. RESULTS We analyze published open chromatin measurements across hematopoietic lineage cell-types and demonstrate that tiSFM outperforms a state-of-the-art convolutional neural network model custom-tailored to this dataset. We also show that it correctly identifies context specific activities of transcription factors with known roles in hematopoietic differentiation, including Pax5 and Ebf1 for B-cells, and Rorc for innate lymphoid cells. tiSFM's model parameters have biologically meaningful interpretations, and we show the utility of our approach on a complex task of predicting the change in epigenetic state as a function of developmental transition. AVAILABILITY AND IMPLEMENTATION The source code, including scripts for the analysis of key findings, can be found at https://github.com/boooooogey/ATAConv, implemented in Python.
Collapse
Affiliation(s)
- Ali Tuğrul Balcı
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| | - Mark Maher Ebeid
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| | - Panayiotis V Benos
- Department of Epidemiology, University of Florida, Gainesville, 32610, Unites States
| | - Dennis Kostka
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| | - Maria Chikina
- Joint Carnegie Mellon University-University of Pittsburgh Program in Computational Biology, Institution, Pittsburgh, 15213, United States and
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, 15213, Unites States and
| |
Collapse
|
23
|
Koo PK, Ploenzke M, Anand P, Paul S, Majdandzic A. ResidualBind: Uncovering Sequence-Structure Preferences of RNA-Binding Proteins with Deep Neural Networks. Methods Mol Biol 2023; 2586:197-215. [PMID: 36705906 DOI: 10.1007/978-1-0716-2768-6_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep neural networks have demonstrated improved performance at predicting sequence specificities of DNA- and RNA-binding proteins. However, it remains unclear why they perform better than previous methods that rely on k-mers and position weight matrices. Here, we highlight a recent deep learning-based software package, called ResidualBind, that analyzes RNA-protein interactions using only RNA sequence as an input feature and performs global importance analysis for model interpretability. We discuss practical considerations for model interpretability to uncover learned sequence motifs and their secondary structure preferences.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA.
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| | | | - Steffan Paul
- Bioinformatics Program, Harvard Medical School, Boston, MA, USA
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| |
Collapse
|
24
|
Genome-wide identification and characterization of DNA enhancers with a stacked multivariate fusion framework. PLoS Comput Biol 2022; 18:e1010779. [PMID: 36520922 PMCID: PMC9836277 DOI: 10.1371/journal.pcbi.1010779] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Revised: 01/12/2023] [Accepted: 11/29/2022] [Indexed: 12/23/2022] Open
Abstract
Enhancers are short non-coding DNA sequences outside of the target promoter regions that can be bound by specific proteins to increase a gene's transcriptional activity, which has a crucial role in the spatiotemporal and quantitative regulation of gene expression. However, enhancers do not have a specific sequence motifs or structures, and their scattered distribution in the genome makes the identification of enhancers from human cell lines particularly challenging. Here we present a novel, stacked multivariate fusion framework called SMFM, which enables a comprehensive identification and analysis of enhancers from regulatory DNA sequences as well as their interpretation. Specifically, to characterize the hierarchical relationships of enhancer sequences, multi-source biological information and dynamic semantic information are fused to represent regulatory DNA enhancer sequences. Then, we implement a deep learning-based sequence network to learn the feature representation of the enhancer sequences comprehensively and to extract the implicit relationships in the dynamic semantic information. Ultimately, an ensemble machine learning classifier is trained based on the refined multi-source features and dynamic implicit relations obtained from the deep learning-based sequence network. Benchmarking experiments demonstrated that SMFM significantly outperforms other existing methods using several evaluation metrics. In addition, an independent test set was used to validate the generalization performance of SMFM by comparing it to other state-of-the-art enhancer identification methods. Moreover, we performed motif analysis based on the contribution scores of different bases of enhancer sequences to the final identification results. Besides, we conducted interpretability analysis of the identified enhancer sequences based on attention weights of EnhancerBERT, a fine-tuned BERT model that provides new insights into exploring the gene semantic information likely to underlie the discovered enhancers in an interpretable manner. Finally, in a human placenta study with 4,562 active distal gene regulatory enhancers, SMFM successfully exposed tissue-related placental development and the differential mechanism, demonstrating the generalizability and stability of our proposed framework.
Collapse
|
25
|
Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00570-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
26
|
Khanal J, Kandel J, Tayara H, Chong KT. CapsNh-Kcr: Capsule network-based prediction of lysine crotonylation sites in human non-histone proteins. Comput Struct Biotechnol J 2022; 21:120-127. [PMID: 36544479 PMCID: PMC9735261 DOI: 10.1016/j.csbj.2022.11.056] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2022] [Revised: 11/10/2022] [Accepted: 11/26/2022] [Indexed: 12/04/2022] Open
Abstract
Lysine crotonylation (Kcr) is one of the most important post-translational modifications (PTMs) that is widely detected in both histone and non-histone proteins. In fact, Kcr is reported to be involved in various biological processes, such as metabolism and cell differentiation. However, the available experimental methods for Kcr site identification are laborious and costly. To effectively replace existing experimental approaches, some computational methods have been developed in the last few years. The available computational methods still lack some important aspects, as they can only identify Kcr sites on either histone-only or combined histone and nonhistone proteins. Although a tool was developed to identify Kcr sites on non-histone proteins only, its performance is inadequate and the exploration of hidden Kcr patterns (motifs) has been completely ignored, which might be significant for detailed Kcr studies. Therefore, algorithms that can more effectively predict Kcr sites on non-histone proteins with their biological meaning need to be designed. Accordingly, we developed a novel deep learning (capsule network)-based model, named CapsNh-Kcr, for Kcr site prediction, particularly focusing on non-histone proteins. Based on the independent results, the proposed model achieves an AUC of 0.9120, which is approximately 6% higher than that of previous nhKcr model in the prediction of Kcr sites on non-histone proteins. Further, we revealed, for the first time, that the proposed model can represent obvious motif distribution across Kcr sites in non-histone proteins. The source code (in Python) is publicly available at https://github.com/Jhabindra-bioinfo/CapsNh-Kcr.
Collapse
Affiliation(s)
- Jhabindra Khanal
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea
| | - Jeevan Kandel
- Graduate School of Integrated Energy-AI, Jeonbuk National University, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea,Corresponding authors at: School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea (H. Tayara); Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea (K.T. Chong).
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea,Advanced Electronics and Information Research Center, Jeonbuk National University, Jeonju 54896, South Korea,Corresponding authors at: School of International Engineering and Science, Jeonbuk National University, Jeonju 54896, South Korea (H. Tayara); Department of Electronics and Information Engineering, Jeonbuk National University, Jeonju 54896, South Korea (K.T. Chong).
| |
Collapse
|
27
|
Yoon CH, Torrance R, Scheinerman N. Machine learning in medicine: should the pursuit of enhanced interpretability be abandoned? JOURNAL OF MEDICAL ETHICS 2022; 48:581-585. [PMID: 34006600 PMCID: PMC9411871 DOI: 10.1136/medethics-2020-107102] [Citation(s) in RCA: 42] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 11/25/2020] [Revised: 03/21/2021] [Accepted: 04/08/2021] [Indexed: 06/01/2023]
Abstract
We argue why interpretability should have primacy alongside empiricism for several reasons: first, if machine learning (ML) models are beginning to render some of the high-risk healthcare decisions instead of clinicians, these models pose a novel medicolegal and ethical frontier that is incompletely addressed by current methods of appraising medical interventions like pharmacological therapies; second, a number of judicial precedents underpinning medical liability and negligence are compromised when 'autonomous' ML recommendations are considered to be en par with human instruction in specific contexts; third, explainable algorithms may be more amenable to the ascertainment and minimisation of biases, with repercussions for racial equity as well as scientific reproducibility and generalisability. We conclude with some reasons for the ineludible importance of interpretability, such as the establishment of trust, in overcoming perhaps the most difficult challenge ML will face in a high-stakes environment like healthcare: professional and public acceptance.
Collapse
Affiliation(s)
- Chang Ho Yoon
- Big Data Institute, Oxford University, Oxford, UK
- Medical Sciences Doctoral Training Centre, Oxford University, Oxford, UK
- Nuffield Department of Population Health, University of Oxford Richard Doll Building, Oxford, UK
| | - Robert Torrance
- Faculty of Public Health and Policy, London School of Hygiene and Tropical Medicine, London, UK
| | - Naomi Scheinerman
- Department of Medical Ethics and Health Policy, Perelman School of Medicine at the University of Pennsylvania, Philadelphia, Pennsylvania, USA
| |
Collapse
|
28
|
Asim MN, Ibrahim MA, Zehe C, Trygg J, Dengel A, Ahmed S. BoT-Net: a lightweight bag of tricks-based neural network for efficient LncRNA–miRNA interaction prediction. Interdiscip Sci 2022; 14:841-862. [PMID: 35947255 PMCID: PMC9581873 DOI: 10.1007/s12539-022-00535-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2021] [Revised: 06/16/2022] [Accepted: 07/12/2022] [Indexed: 11/30/2022]
Abstract
Background and objective: Interactions of long non-coding ribonucleic acids (lncRNAs) with micro-ribonucleic acids (miRNAs) play an essential role in gene regulation, cellular metabolic, and pathological processes. Existing purely sequence based computational approaches lack robustness and efficiency mainly due to the high length variability of lncRNA sequences. Hence, the prime focus of the current study is to find optimal length trade-offs between highly flexible length lncRNA sequences. Method The paper at hand performs in-depth exploration of diverse copy padding, sequence truncation approaches, and presents a novel idea of utilizing only subregions of lncRNA sequences to generate fixed-length lncRNA sequences. Furthermore, it presents a novel bag of tricks-based deep learning approach “Bot-Net” which leverages a single layer long-short-term memory network regularized through DropConnect to capture higher order residue dependencies, pooling to retain most salient features, normalization to prevent exploding and vanishing gradient issues, learning rate decay, and dropout to regularize precise neural network for lncRNA–miRNA interaction prediction. Results BoT-Net outperforms the state-of-the-art lncRNA–miRNA interaction prediction approach by 2%, 8%, and 4% in terms of accuracy, specificity, and matthews correlation coefficient. Furthermore, a case study analysis indicates that BoT-Net also outperforms state-of-the-art lncRNA–protein interaction predictor on a benchmark dataset by accuracy of 10%, sensitivity of 19%, specificity of 6%, precision of 14%, and matthews correlation coefficient of 26%. Conclusion In the benchmark lncRNA–miRNA interaction prediction dataset, the length of the lncRNA sequence varies from 213 residues to 22,743 residues and in the benchmark lncRNA–protein interaction prediction dataset, lncRNA sequences vary from 15 residues to 1504 residues. For such highly flexible length sequences, fixed length generation using copy padding introduces a significant level of bias which makes a large number of lncRNA sequences very much identical to each other and eventually derail classifier generalizeability. Empirical evaluation reveals that within 50 residues of only the starting region of long lncRNA sequences, a highly informative distribution for lncRNA–miRNA interaction prediction is contained, a crucial finding exploited by the proposed BoT-Net approach to optimize the lncRNA fixed length generation process. Availability: BoT-Net web server can be accessed at https://sds_genetic_analysis.opendfki.de/lncmiRNA/. Graphic Abstract ![]()
Collapse
Affiliation(s)
- Muhammad Nabeel Asim
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany.
| | - Muhammad Ali Ibrahim
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
| | - Christoph Zehe
- Sartorius Stedim Cellca GmbH, 88471, Laupheim, Baden-Wurttemberg, Germany
| | - Johan Trygg
- Sartorius Stedim Cellca GmbH, 88471, Laupheim, Baden-Wurttemberg, Germany
- Computational Life Science Cluster (CLiC), Umea University, 90187, Umea, Sweden
| | - Andreas Dengel
- Department of Computer Science, Technical University of Kaiserslautern, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
| | - Sheraz Ahmed
- German Research Center for Artificial Intelligence GmbH, 67663, Kaiserslautern, Rhineland-Palatinate, Germany
- Computational Life Science Cluster (CLiC), Umea University, 90187, Umea, Sweden
| |
Collapse
|
29
|
Green AG, Yoon CH, Chen ML, Ektefaie Y, Fina M, Freschi L, Gröschel MI, Kohane I, Beam A, Farhat M. A convolutional neural network highlights mutations relevant to antimicrobial resistance in Mycobacterium tuberculosis. Nat Commun 2022; 13:3817. [PMID: 35780211 PMCID: PMC9250494 DOI: 10.1038/s41467-022-31236-0] [Citation(s) in RCA: 28] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2021] [Accepted: 06/10/2022] [Indexed: 11/30/2022] Open
Abstract
Long diagnostic wait times hinder international efforts to address antibiotic resistance in M. tuberculosis. Pathogen whole genome sequencing, coupled with statistical and machine learning models, offers a promising solution. However, generalizability and clinical adoption have been limited by a lack of interpretability, especially in deep learning methods. Here, we present two deep convolutional neural networks that predict antibiotic resistance phenotypes of M. tuberculosis isolates: a multi-drug CNN (MD-CNN), that predicts resistance to 13 antibiotics based on 18 genomic loci, with AUCs 82.6-99.5% and higher sensitivity than state-of-the-art methods; and a set of 13 single-drug CNNs (SD-CNN) with AUCs 80.1-97.1% and higher specificity than the previous state-of-the-art. Using saliency methods to evaluate the contribution of input sequence features to the SD-CNN predictions, we identify 18 sites in the genome not previously associated with resistance. The CNN models permit functional variant discovery, biologically meaningful interpretation, and clinical applicability.
Collapse
Affiliation(s)
- Anna G Green
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
| | - Chang Ho Yoon
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
- Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, OX37LF, UK
| | - Michael L Chen
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
- Stanford University School of Medicine, 291 Campus Dr, Stanford, CA, 94305, USA
| | - Yasha Ektefaie
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
| | - Mack Fina
- Harvard College, Cambridge, MA, 02138, USA
| | - Luca Freschi
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
| | - Matthias I Gröschel
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
| | - Isaac Kohane
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA
| | - Andrew Beam
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA.
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, 677 Huntington Ave, Boston, MA, 02115, USA.
| | - Maha Farhat
- Department of Biomedical Informatics, Harvard Medical School, 25 Shattuck St, Boston, MA, 02115, USA.
- Division of Pulmonary & Critical Care, Massachusetts General Hospital, 55 Fruit St, Boston, MA, 02114, USA.
| |
Collapse
|
30
|
Dsouza KB, Maslova A, Al-Jibury E, Merkenschlager M, Bhargava VK, Libbrecht MW. Learning representations of chromatin contacts using a recurrent neural network identifies genomic drivers of conformation. Nat Commun 2022; 13:3704. [PMID: 35764630 PMCID: PMC9240038 DOI: 10.1038/s41467-022-31337-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/05/2021] [Accepted: 06/15/2022] [Indexed: 11/28/2022] Open
Abstract
Despite the availability of chromatin conformation capture experiments, discerning the relationship between the 1D genome and 3D conformation remains a challenge, which limits our understanding of their affect on gene expression and disease. We propose Hi-C-LSTM, a method that produces low-dimensional latent representations that summarize intra-chromosomal Hi-C contacts via a recurrent long short-term memory neural network model. We find that these representations contain all the information needed to recreate the observed Hi-C matrix with high accuracy, outperforming existing methods. These representations enable the identification of a variety of conformation-defining genomic elements, including nuclear compartments and conformation-related transcription factors. They furthermore enable in-silico perturbation experiments that measure the influence of cis-regulatory elements on conformation.
Collapse
Affiliation(s)
- Kevin B Dsouza
- Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, Canada.
| | - Alexandra Maslova
- School of Computing Science, Simon Fraser University, Burnaby, Canada
| | - Ediem Al-Jibury
- MRC, London Institute of Medical Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK
- Department of Computing, Imperial College London, London, UK
| | - Matthias Merkenschlager
- MRC, London Institute of Medical Sciences, Institute of Clinical Sciences, Faculty of Medicine, Imperial College London, London, UK
| | - Vijay K Bhargava
- Department of Electrical and Computer Engineering, University of British Columbia, Vancouver, Canada
| | | |
Collapse
|
31
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
32
|
Lu AX, Lu AX, Pritišanac I, Zarin T, Forman-Kay JD, Moses AM. Discovering molecular features of intrinsically disordered regions by using evolution for contrastive learning. PLoS Comput Biol 2022; 18:e1010238. [PMID: 35767567 PMCID: PMC9275697 DOI: 10.1371/journal.pcbi.1010238] [Citation(s) in RCA: 21] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2021] [Revised: 07/12/2022] [Accepted: 05/23/2022] [Indexed: 02/07/2023] Open
Abstract
A major challenge to the characterization of intrinsically disordered regions (IDRs), which are widespread in the proteome, but relatively poorly understood, is the identification of molecular features that mediate functions of these regions, such as short motifs, amino acid repeats and physicochemical properties. Here, we introduce a proteome-scale feature discovery approach for IDRs. Our approach, which we call "reverse homology", exploits the principle that important functional features are conserved over evolution. We use this as a contrastive learning signal for deep learning: given a set of homologous IDRs, the neural network has to correctly choose a held-out homolog from another set of IDRs sampled randomly from the proteome. We pair reverse homology with a simple architecture and standard interpretation techniques, and show that the network learns conserved features of IDRs that can be interpreted as motifs, repeats, or bulk features like charge or amino acid propensities. We also show that our model can be used to produce visualizations of what residues and regions are most important to IDR function, generating hypotheses for uncharacterized IDRs. Our results suggest that feature discovery using unsupervised neural networks is a promising avenue to gain systematic insight into poorly understood protein sequences.
Collapse
Affiliation(s)
- Alex X. Lu
- Department of Computer Science, University of Toronto, Toronto, Canada
| | - Amy X. Lu
- Department of Computer Science, University of Toronto, Toronto, Canada
| | - Iva Pritišanac
- Department of Cell and Systems Biology, University of Toronto, Toronto, Canada
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, Canada
| | - Taraneh Zarin
- Department of Cell and Systems Biology, University of Toronto, Toronto, Canada
| | - Julie D. Forman-Kay
- Program in Molecular Medicine, Hospital for Sick Children, Toronto, Canada
- Department of Biochemistry, University of Toronto, Toronto, Canada
| | - Alan M. Moses
- Department of Computer Science, University of Toronto, Toronto, Canada
- Department of Cell and Systems Biology, University of Toronto, Toronto, Canada
| |
Collapse
|
33
|
Liu S, Cheng H, Ashraf J, Zhang Y, Wang Q, Lv L, He M, Song G, Zuo D. Interpretation of convolutional neural networks reveals crucial sequence features involving in transcription during fiber development. BMC Bioinformatics 2022; 23:91. [PMID: 35291940 PMCID: PMC8922751 DOI: 10.1186/s12859-022-04619-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 02/22/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Upland cotton provides the most natural fiber in the world. During fiber development, the quality and yield of fiber were influenced by gene transcription. Revealing sequence features related to transcription has a profound impact on cotton molecular breeding. We applied convolutional neural networks to predict gene expression status based on the sequences of gene transcription start regions. After that, a gradient-based interpretation and an N-adjusted kernel transformation were implemented to extract sequence features contributing to transcription. RESULTS Our models had approximate 80% accuracies, and the area under the receiver operating characteristic curve reached over 0.85. Gradient-based interpretation revealed 5' untranslated region contributed to gene transcription. Furthermore, 6 DOF binding motifs and 4 transcription activator binding motifs were obtained by N-adjusted kernel-motif transformation from models in three developmental stages. Apart from 10 general motifs, 3 DOF5.1 genes were also detected. In silico analysis about these motifs' binding proteins implied their potential functions in fiber formation. Besides, we also found some novel motifs in plants as important sequence features for transcription. CONCLUSIONS In conclusion, the N-adjusted kernel transformation method could interpret convolutional neural networks and reveal important sequence features related to transcription during fiber development. Potential functions of motifs interpreted from convolutional neural networks could be validated by further wet-lab experiments and applied in cotton molecular breeding.
Collapse
Affiliation(s)
- Shang Liu
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China.,Zhengzhou Research Base, State Key Laboratory of Cotton Biology, Zhengzhou University, Zhengzhou, 450001, China
| | - Hailiang Cheng
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China.,Zhengzhou Research Base, State Key Laboratory of Cotton Biology, Zhengzhou University, Zhengzhou, 450001, China
| | - Javaria Ashraf
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China.,Department of Plant Breeding and Genetics, University College of Agriculture and Environmental Sciences, The Islamia University of Bahawalpur, Punjab, 63100, Pakistan
| | - Youping Zhang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China.,Zhengzhou Research Base, State Key Laboratory of Cotton Biology, Zhengzhou University, Zhengzhou, 450001, China
| | - Qiaolian Wang
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China.,Zhengzhou Research Base, State Key Laboratory of Cotton Biology, Zhengzhou University, Zhengzhou, 450001, China
| | - Limin Lv
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China.,Zhengzhou Research Base, State Key Laboratory of Cotton Biology, Zhengzhou University, Zhengzhou, 450001, China
| | - Man He
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China
| | - Guoli Song
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China. .,Zhengzhou Research Base, State Key Laboratory of Cotton Biology, Zhengzhou University, Zhengzhou, 450001, China.
| | - Dongyun Zuo
- Institute of Cotton Research of Chinese Academy of Agricultural Sciences, Anyang, 455000, China. .,Zhengzhou Research Base, State Key Laboratory of Cotton Biology, Zhengzhou University, Zhengzhou, 450001, China.
| |
Collapse
|
34
|
Zhang S, Ma A, Zhao J, Xu D, Ma Q, Wang Y. Assessing deep learning methods in cis-regulatory motif finding based on genomic sequencing data. Brief Bioinform 2022; 23:bbab374. [PMID: 34607350 PMCID: PMC8769700 DOI: 10.1093/bib/bbab374] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Revised: 08/22/2021] [Accepted: 08/23/2021] [Indexed: 12/28/2022] Open
Abstract
Identifying cis-regulatory motifs from genomic sequencing data (e.g. ChIP-seq and CLIP-seq) is crucial in identifying transcription factor (TF) binding sites and inferring gene regulatory mechanisms for any organism. Since 2015, deep learning (DL) methods have been widely applied to identify TF binding sites and predict motif patterns, with the strengths of offering a scalable, flexible and unified computational approach for highly accurate predictions. As far as we know, 20 DL methods have been developed. However, without a clear and systematic assessment, users will struggle to choose the most appropriate tool for their specific studies. In this manuscript, we evaluated 20 DL methods for cis-regulatory motif prediction using 690 ENCODE ChIP-seq, 126 cancer ChIP-seq and 55 RNA CLIP-seq data. Four metrics were investigated, including the accuracy of motif finding, the performance of DNA/RNA sequence classification, algorithm scalability and tool usability. The assessment results demonstrated the high complementarity of the existing DL methods. It was determined that the most suitable model should primarily depend on the data size and type and the method's outputs.
Collapse
Affiliation(s)
- Shuangquan Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
| | - Anjun Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Jing Zhao
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, and Christopher S. Bond Life Science Center, University of Missouri, MO, 65211, USA
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, 43210, USA
| | - Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, 130012, China
- School of Artificial Intelligence, Jilin University, Changchun, 130012, China
| |
Collapse
|
35
|
Ahmed S, Muhammod R, Khan ZH, Adilina S, Sharma A, Shatabda S, Dehzangi A. ACP-MHCNN: an accurate multi-headed deep-convolutional neural network to predict anticancer peptides. Sci Rep 2021; 11:23676. [PMID: 34880291 PMCID: PMC8654959 DOI: 10.1038/s41598-021-02703-3] [Citation(s) in RCA: 46] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2021] [Accepted: 11/17/2021] [Indexed: 01/10/2023] Open
Abstract
Although advancing the therapeutic alternatives for treating deadly cancers has gained much attention globally, still the primary methods such as chemotherapy have significant downsides and low specificity. Most recently, Anticancer peptides (ACPs) have emerged as a potential alternative to therapeutic alternatives with much fewer negative side-effects. However, the identification of ACPs through wet-lab experiments is expensive and time-consuming. Hence, computational methods have emerged as viable alternatives. During the past few years, several computational ACP identification techniques using hand-engineered features have been proposed to solve this problem. In this study, we propose a new multi headed deep convolutional neural network model called ACP-MHCNN, for extracting and combining discriminative features from different information sources in an interactive way. Our model extracts sequence, physicochemical, and evolutionary based features for ACP identification using different numerical peptide representations while restraining parameter overhead. It is evident through rigorous experiments using cross-validation and independent-dataset that ACP-MHCNN outperforms other models for anticancer peptide identification by a substantial margin on our employed benchmarks. ACP-MHCNN outperforms state-of-the-art model by 6.3%, 8.6%, 3.7%, 4.0%, and 0.20 in terms of accuracy, sensitivity, specificity, precision, and MCC respectively. ACP-MHCNN and its relevant codes and datasets are publicly available at: https://github.com/mrzResearchArena/Anticancer-Peptides-CNN . ACP-MHCNN is also publicly available as an online predictor at: https://anticancer.pythonanywhere.com/ .
Collapse
Affiliation(s)
- Sajid Ahmed
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Rafsanjani Muhammod
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Zahid Hossain Khan
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Sheikh Adilina
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh
| | - Alok Sharma
- Laboratory for Medical Science Mathematics, RIKEN Center for Integrative Medical Sciences, Yokohama, 230-0045, Japan
- Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, QLD, 4111, Australia
| | - Swakkhar Shatabda
- Department of Computer Science and Engineering, United International University, Dhaka, Bangladesh.
| | - Abdollah Dehzangi
- Department of Computer Science, Rutgers University, Camden, NJ, 08102, USA.
- Center for Computational and Integrative Biology, Rutgers University, Camden, NJ, 08102, USA.
| |
Collapse
|
36
|
Ullah F, Ben-Hur A. A self-attention model for inferring cooperativity between regulatory features. Nucleic Acids Res 2021; 49:e77. [PMID: 33950192 PMCID: PMC8287919 DOI: 10.1093/nar/gkab349] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2020] [Revised: 04/15/2021] [Accepted: 04/20/2021] [Indexed: 11/14/2022] Open
Abstract
Deep learning has demonstrated its predictive power in modeling complex biological phenomena such as gene expression. The value of these models hinges not only on their accuracy, but also on the ability to extract biologically relevant information from the trained models. While there has been much recent work on developing feature attribution methods that discover the most important features for a given sequence, inferring cooperativity between regulatory elements, which is the hallmark of phenomena such as gene expression, remains an open problem. We present SATORI, a Self-ATtentiOn based model to detect Regulatory element Interactions. Our approach combines convolutional layers with a self-attention mechanism that helps us capture a global view of the landscape of interactions between regulatory elements in a sequence. A comprehensive evaluation demonstrates the ability of SATORI to identify numerous statistically significant TF-TF interactions, many of which have been previously reported. Our method is able to detect higher numbers of experimentally verified TF-TF interactions than existing methods, and has the advantage of not requiring a computationally expensive post-processing step. Finally, SATORI can be used for detection of any type of feature interaction in models that use a similar attention mechanism, and is not limited to the detection of TF-TF interactions.
Collapse
Affiliation(s)
- Fahad Ullah
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| | - Asa Ben-Hur
- Department of Computer Science, Colorado State University, Fort Collins, CO 80523, USA
| |
Collapse
|
37
|
Hammelman J, Gifford DK. Discovering differential genome sequence activity with interpretable and efficient deep learning. PLoS Comput Biol 2021; 17:e1009282. [PMID: 34370721 PMCID: PMC8376110 DOI: 10.1371/journal.pcbi.1009282] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Revised: 08/19/2021] [Accepted: 07/16/2021] [Indexed: 11/23/2022] Open
Abstract
Discovering sequence features that differentially direct cells to alternate fates is key to understanding both cellular development and the consequences of disease related mutations. We introduce Expected Pattern Effect and Differential Expected Pattern Effect, two black-box methods that can interpret genome regulatory sequences for cell type-specific or condition specific patterns. We show that these methods identify relevant transcription factor motifs and spacings that are predictive of cell state-specific chromatin accessibility. Finally, we integrate these methods into framework that is readily accessible to non-experts and available for download as a binary or installed via PyPI or bioconda at https://cgs.csail.mit.edu/deepaccess-package/.
Collapse
Affiliation(s)
- Jennifer Hammelman
- Computational and Systems Biology, MIT, Cambridge, Massachusetts, United States of America
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts, United States of America
| | - David K. Gifford
- Computer Science and Artificial Intelligence Laboratory, MIT, Cambridge, Massachusetts, United States of America
- Department of Electrical Engineering & Computer Science, MIT, Cambridge, Massachusetts, United States of America
- Department of Biological Engineering, MIT, Cambridge, Massachusetts, United States of America
| |
Collapse
|
38
|
Li JY, Jin S, Tu XM, Ding Y, Gao G. Identifying complex motifs in massive omics data with a variable-convolutional layer in deep neural network. Brief Bioinform 2021; 22:6312656. [PMID: 34219140 DOI: 10.1093/bib/bbab233] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2021] [Revised: 05/25/2021] [Accepted: 05/28/2021] [Indexed: 01/10/2023] Open
Abstract
Motif identification is among the most common and essential computational tasks for bioinformatics and genomics. Here we proposed a novel convolutional layer for deep neural network, named variable convolutional (vConv) layer, for effective motif identification in high-throughput omics data by learning kernel length from data adaptively. Empirical evaluations on DNA-protein binding and DNase footprinting cases well demonstrated that vConv-based networks have superior performance to their convolutional counterparts regardless of model complexity. Meanwhile, vConv could be readily integrated into multi-layer neural networks as an 'in-place replacement' of canonical convolutional layer. All source codes are freely available on GitHub for academic usage.
Collapse
Affiliation(s)
- Jing-Yi Li
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Shen Jin
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Xin-Ming Tu
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Yang Ding
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| | - Ge Gao
- Biomedical Pioneering Innovation Center & Beijing Advanced Innovation Center for Genomics, Center for Bioinformatics, and State Key Laboratory of Protein and Plant Gene Research at School of Life Sciences, Peking University, Beijing 100871, China
| |
Collapse
|
39
|
Zeng X, Park SJ, Nakai K. Characterizing Promoter and Enhancer Sequences by a Deep Learning Method. Front Genet 2021; 12:681259. [PMID: 34211503 PMCID: PMC8239401 DOI: 10.3389/fgene.2021.681259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2021] [Accepted: 05/20/2021] [Indexed: 11/21/2022] Open
Abstract
Promoters and enhancers are well-known regulatory elements modulating gene expression. As confirmed by high-throughput sequencing technologies, these regulatory elements are bidirectionally transcribed. That is, promoters produce stable mRNA in the sense direction and unstable RNA in the antisense direction, while enhancers transcribe unstable RNA in both directions. Although it is thought that enhancers and promoters share a similar architecture of transcription start sites (TSSs), how the transcriptional machinery distinctly uses these genomic regions as promoters or enhancers remains unclear. To address this issue, we developed a deep learning (DL) method by utilizing a convolutional neural network (CNN) and the saliency algorithm. In comparison with other classifiers, our CNN presented higher predictive performance, suggesting the overarching importance of the high-order sequence features, captured by the CNN. Moreover, our method revealed that there are substantial sequence differences between the enhancers and promoters. Remarkably, the 20-120 bp downstream regions from the center of bidirectional TSSs seemed to contribute to the RNA stability. These regions in promoters tend to have a larger number of guanines and cytosines compared to those in enhancers, and this feature contributed to the classification of the regulatory elements. Our CNN-based method can capture the complex TSS architectures. We found that the genomic regions around TSSs for promoters and enhancers contribute to RNA stability and show GC-biased characteristics as a critical determinant for promoter TSSs.
Collapse
Affiliation(s)
- Xin Zeng
- Department of Computational Biology and Medical Science, The University of Tokyo, Kashiwa, Japan
| | - Sung-Joon Park
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Kenta Nakai
- Department of Computational Biology and Medical Science, The University of Tokyo, Kashiwa, Japan
- Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
40
|
Zrimec J, Buric F, Kokina M, Garcia V, Zelezniak A. Learning the Regulatory Code of Gene Expression. Front Mol Biosci 2021; 8:673363. [PMID: 34179082 PMCID: PMC8223075 DOI: 10.3389/fmolb.2021.673363] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2021] [Accepted: 05/24/2021] [Indexed: 11/13/2022] Open
Abstract
Data-driven machine learning is the method of choice for predicting molecular phenotypes from nucleotide sequence, modeling gene expression events including protein-DNA binding, chromatin states as well as mRNA and protein levels. Deep neural networks automatically learn informative sequence representations and interpreting them enables us to improve our understanding of the regulatory code governing gene expression. Here, we review the latest developments that apply shallow or deep learning to quantify molecular phenotypes and decode the cis-regulatory grammar from prokaryotic and eukaryotic sequencing data. Our approach is to build from the ground up, first focusing on the initiating protein-DNA interactions, then specific coding and non-coding regions, and finally on advances that combine multiple parts of the gene and mRNA regulatory structures, achieving unprecedented performance. We thus provide a quantitative view of gene expression regulation from nucleotide sequence, concluding with an information-centric overview of the central dogma of molecular biology.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Mariia Kokina
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Victor Garcia
- School of Life Sciences and Facility Management, Zurich University of Applied Sciences, Wädenswil, Switzerland
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
- Science for Life Laboratory, Stockholm, Sweden
| |
Collapse
|
41
|
Discovery of widespread transcription initiation at microsatellites predictable by sequence-based deep neural network. Nat Commun 2021; 12:3297. [PMID: 34078885 PMCID: PMC8172540 DOI: 10.1038/s41467-021-23143-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Accepted: 04/13/2021] [Indexed: 02/04/2023] Open
Abstract
Using the Cap Analysis of Gene Expression (CAGE) technology, the FANTOM5 consortium provided one of the most comprehensive maps of transcription start sites (TSSs) in several species. Strikingly, ~72% of them could not be assigned to a specific gene and initiate at unconventional regions, outside promoters or enhancers. Here, we probe these unassigned TSSs and show that, in all species studied, a significant fraction of CAGE peaks initiate at microsatellites, also called short tandem repeats (STRs). To confirm this transcription, we develop Cap Trap RNA-seq, a technology which combines cap trapping and long read MinION sequencing. We train sequence-based deep learning models able to predict CAGE signal at STRs with high accuracy. These models unveil the importance of STR surrounding sequences not only to distinguish STR classes, but also to predict the level of transcription initiation. Importantly, genetic variants linked to human diseases are preferentially found at STRs with high transcription initiation level, supporting the biological and clinical relevance of transcription initiation at STRs. Together, our results extend the repertoire of non-coding transcription associated with DNA tandem repeats and complexify STR polymorphism.
Collapse
|
42
|
Koo PK, Majdandzic A, Ploenzke M, Anand P, Paul SB. Global importance analysis: An interpretability method to quantify importance of genomic features in deep neural networks. PLoS Comput Biol 2021; 17:e1008925. [PMID: 33983921 PMCID: PMC8118286 DOI: 10.1371/journal.pcbi.1008925] [Citation(s) in RCA: 51] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Accepted: 03/30/2021] [Indexed: 12/15/2022] Open
Abstract
Deep neural networks have demonstrated improved performance at predicting the sequence specificities of DNA- and RNA-binding proteins compared to previous methods that rely on k-mers and position weight matrices. To gain insights into why a DNN makes a given prediction, model interpretability methods, such as attribution methods, can be employed to identify motif-like representations along a given sequence. Because explanations are given on an individual sequence basis and can vary substantially across sequences, deducing generalizable trends across the dataset and quantifying their effect size remains a challenge. Here we introduce global importance analysis (GIA), a model interpretability method that quantifies the population-level effect size that putative patterns have on model predictions. GIA provides an avenue to quantitatively test hypotheses of putative patterns and their interactions with other patterns, as well as map out specific functions the network has learned. As a case study, we demonstrate the utility of GIA on the computational task of predicting RNA-protein interactions from sequence. We first introduce a convolutional network, we call ResidualBind, and benchmark its performance against previous methods on RNAcompete data. Using GIA, we then demonstrate that in addition to sequence motifs, ResidualBind learns a model that considers the number of motifs, their spacing, and sequence context, such as RNA secondary structure and GC-bias.
Collapse
Affiliation(s)
- Peter K. Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Antonio Majdandzic
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Matthew Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, Massachusetts, United States of America
| | - Praveen Anand
- Dana-Farber Cancer Institute, Boston, Massachusetts, United States of America
| | - Steffan B. Paul
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, Massachusetts, United States of America
| |
Collapse
|
43
|
Menichelli C, Guitard V, Martins RM, Lèbre S, Lopez-Rubio JJ, Lecellier CH, Bréhélin L. Identification of long regulatory elements in the genome of Plasmodium falciparum and other eukaryotes. PLoS Comput Biol 2021; 17:e1008909. [PMID: 33861755 PMCID: PMC8081344 DOI: 10.1371/journal.pcbi.1008909] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2020] [Revised: 04/28/2021] [Accepted: 03/24/2021] [Indexed: 01/15/2023] Open
Abstract
Long regulatory elements (LREs), such as CpG islands, polydA:dT tracts or AU-rich elements, are thought to play key roles in gene regulation but, as opposed to conventional binding sites of transcription factors, few methods have been proposed to formally and automatically characterize them. We present here a computational approach named DExTER (Domain Exploration To Explain gene Regulation) dedicated to the identification of candidate LREs (cLREs) and apply it to the analysis of the genomes of P. falciparum and other eukaryotes. Our analyses show that all tested genomes contain several cLREs that are somewhat conserved along evolution, and that gene expression can be predicted with surprising accuracy on the basis of these long regions only. Regulation by cLREs exhibits very different behaviours depending on species and conditions. In P. falciparum and other Apicomplexan organisms as well as in Dictyostelium discoideum, the process appears highly dynamic, with different cLREs involved at different phases of the life cycle. For multicellular organisms, the same cLREs are involved in all tissues, but a dynamic behavior is observed along embryonic development stages. In P. falciparum, whose genome is known to be strongly depleted of transcription factors, cLREs are predictive of expression with an accuracy above 70%, and our analyses show that they are associated with both transcriptional and post-transcriptional regulation signals. Moreover, we assessed the biological relevance of one LRE discovered by DExTER in P. falciparum using an in vivo reporter assay. The source code (python) of DExTER is available at https://gite.lirmm.fr/menichelli/DExTER.
Collapse
Affiliation(s)
| | - Vincent Guitard
- Laboratory of Pathogen-Host Interactions (LPHI), UMR5235, CNRS, Montpellier University, INSERM, Montpellier, France
| | - Rafael M. Martins
- Laboratory of Pathogen-Host Interactions (LPHI), UMR5235, CNRS, Montpellier University, INSERM, Montpellier, France
| | - Sophie Lèbre
- IMAG, Univ. Montpellier, CNRS, Montpellier, France
- Univ. Paul-Valéry-Montpellier 3, Montpellier, France
| | - Jose-Juan Lopez-Rubio
- Laboratory of Pathogen-Host Interactions (LPHI), UMR5235, CNRS, Montpellier University, INSERM, Montpellier, France
| | - Charles-Henri Lecellier
- LIRMM, Univ Montpellier, CNRS, Montpellier, France
- Institut de Génétique Moléculaire de Montpellier, University of Montpellier, CNRS, Montpellier, France
| | | |
Collapse
|
44
|
Koo PK, Ploenzke M. Improving representations of genomic sequence motifs in convolutional networks with exponential activations. NAT MACH INTELL 2021; 3:258-266. [PMID: 34322657 DOI: 10.1038/s42256-020-00291-x] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Deep convolutional neural networks (CNNs) trained on regulatory genomic sequences tend to build representations in a distributed manner, making it a challenge to extract learned features that are biologically meaningful, such as sequence motifs. Here we perform a comprehensive analysis on synthetic sequences to investigate the role that CNN activations have on model interpretability. We show that employing an exponential activation to first layer filters consistently leads to interpretable and robust representations of motifs compared to other commonly used activations. Strikingly, we demonstrate that CNNs with better test performance do not necessarily imply more interpretable representations with attribution methods. We find that CNNs with exponential activations significantly improve the efficacy of recovering biologically meaningful representations with attribution methods. We demonstrate these results generalise to real DNA sequences across several in vivo datasets. Together, this work demonstrates how a small modification to existing CNNs, i.e. setting exponential activations in the first layer, can significantly improve the robustness and interpretabilty of learned representations directly in convolutional filters and indirectly with attribution methods.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Matt Ploenzke
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| |
Collapse
|
45
|
Zrimec J, Börlin CS, Buric F, Muhammad AS, Chen R, Siewers V, Verendel V, Nielsen J, Töpel M, Zelezniak A. Deep learning suggests that gene expression is encoded in all parts of a co-evolving interacting gene regulatory structure. Nat Commun 2020; 11:6141. [PMID: 33262328 PMCID: PMC7708451 DOI: 10.1038/s41467-020-19921-4] [Citation(s) in RCA: 88] [Impact Index Per Article: 17.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Accepted: 11/02/2020] [Indexed: 12/31/2022] Open
Abstract
Understanding the genetic regulatory code governing gene expression is an important challenge in molecular biology. However, how individual coding and non-coding regions of the gene regulatory structure interact and contribute to mRNA expression levels remains unclear. Here we apply deep learning on over 20,000 mRNA datasets to examine the genetic regulatory code controlling mRNA abundance in 7 model organisms ranging from bacteria to Human. In all organisms, we can predict mRNA abundance directly from DNA sequence, with up to 82% of the variation of transcript levels encoded in the gene regulatory structure. By searching for DNA regulatory motifs across the gene regulatory structure, we discover that motif interactions could explain the whole dynamic range of mRNA levels. Co-evolution across coding and non-coding regions suggests that it is not single motifs or regions, but the entire gene regulatory structure and specific combination of regulatory elements that define gene expression levels.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biology and Biological Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Christoph S Börlin
- Department of Biology and Biological Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Filip Buric
- Department of Biology and Biological Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Azam Sheikh Muhammad
- Computer Science and Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Rhongzen Chen
- Computer Science and Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Verena Siewers
- Department of Biology and Biological Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Vilhelm Verendel
- Computer Science and Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Jens Nielsen
- Department of Biology and Biological Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
- Novo Nordisk Foundation Center for Biosustainability, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden
| | - Mats Töpel
- Department of Marine Sciences, University of Gothenburg, Box 461, SE-405 30, Gothenburg, Sweden
- Gothenburg Global Biodiversity Center (GGBC), Box 461, 40530, Gothenburg, Sweden
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Kemivägen 10, SE-412 96, Gothenburg, Sweden.
- Science for Life Laboratory, Tomtebodavägen 23a, SE-171 65, Stockholm, Sweden.
| |
Collapse
|
46
|
Valeri JA, Collins KM, Ramesh P, Alcantar MA, Lepe BA, Lu TK, Camacho DM. Sequence-to-function deep learning frameworks for engineered riboregulators. Nat Commun 2020; 11:5058. [PMID: 33028819 PMCID: PMC7541510 DOI: 10.1038/s41467-020-18676-2] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 09/02/2020] [Indexed: 12/26/2022] Open
Abstract
While synthetic biology has revolutionized our approaches to medicine, agriculture, and energy, the design of completely novel biological circuit components beyond naturally-derived templates remains challenging due to poorly understood design rules. Toehold switches, which are programmable nucleic acid sensors, face an analogous design bottleneck; our limited understanding of how sequence impacts functionality often necessitates expensive, time-consuming screens to identify effective switches. Here, we introduce Sequence-based Toehold Optimization and Redesign Model (STORM) and Nucleic-Acid Speech (NuSpeak), two orthogonal and synergistic deep learning architectures to characterize and optimize toeholds. Applying techniques from computer vision and natural language processing, we 'un-box' our models using convolutional filters, attention maps, and in silico mutagenesis. Through transfer-learning, we redesign sub-optimal toehold sensors, even with sparse training data, experimentally validating their improved performance. This work provides sequence-to-function deep learning frameworks for toehold selection and design, augmenting our ability to construct potent biological circuit components and precision diagnostics.
Collapse
Affiliation(s)
- Jacqueline A Valeri
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Institute for Medical Engineering and Science and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Katherine M Collins
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Pradeep Ramesh
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
| | - Miguel A Alcantar
- Institute for Medical Engineering and Science and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Bianca A Lepe
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Institute for Medical Engineering and Science and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
| | - Timothy K Lu
- Institute for Medical Engineering and Science and Department of Biological Engineering, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
- Synthetic Biology Group, Research Laboratory of Electronics, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA.
| | - Diogo M Camacho
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA.
| |
Collapse
|
47
|
Angenent-Mari NM, Garruss AS, Soenksen LR, Church G, Collins JJ. A deep learning approach to programmable RNA switches. Nat Commun 2020; 11:5057. [PMID: 33028812 PMCID: PMC7541447 DOI: 10.1038/s41467-020-18677-1] [Citation(s) in RCA: 82] [Impact Index Per Article: 16.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2019] [Accepted: 07/31/2020] [Indexed: 12/21/2022] Open
Abstract
Engineered RNA elements are programmable tools capable of detecting small molecules, proteins, and nucleic acids. Predicting the behavior of these synthetic biology components remains a challenge, a situation that could be addressed through enhanced pattern recognition from deep learning. Here, we investigate Deep Neural Networks (DNN) to predict toehold switch function as a canonical riboswitch model in synthetic biology. To facilitate DNN training, we synthesize and characterize in vivo a dataset of 91,534 toehold switches spanning 23 viral genomes and 906 human transcription factors. DNNs trained on nucleotide sequences outperform (R2 = 0.43-0.70) previous state-of-the-art thermodynamic and kinetic models (R2 = 0.04-0.15) and allow for human-understandable attention-visualizations (VIS4Map) to identify success and failure modes. This work shows that deep learning approaches can be used for functionality predictions and insight generation in RNA synthetic biology.
Collapse
Affiliation(s)
- Nicolaas M Angenent-Mari
- Department of Biological Engineering, Massachusetts Institute of Technology (MIT), Cambridge, MA, 02139, USA
- Institute for Medical Engineering and Science (IMES), MIT, Cambridge, MA, 02139, USA
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
| | - Alexander S Garruss
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Program in Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA, 02138, USA
- Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
| | - Luis R Soenksen
- Department of Biological Engineering, Massachusetts Institute of Technology (MIT), Cambridge, MA, 02139, USA
- Institute for Medical Engineering and Science (IMES), MIT, Cambridge, MA, 02139, USA
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Department of Mechanical Engineering, MIT, Cambridge, MA, 02139, USA
| | - George Church
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA
- Department of Genetics, Harvard Medical School, Boston, MA, 02115, USA
- Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA, 02139, USA
| | - James J Collins
- Department of Biological Engineering, Massachusetts Institute of Technology (MIT), Cambridge, MA, 02139, USA.
- Institute for Medical Engineering and Science (IMES), MIT, Cambridge, MA, 02139, USA.
- Wyss Institute for Biologically Inspired Engineering, Harvard University, Boston, MA, 02115, USA.
- Department of Mechanical Engineering, MIT, Cambridge, MA, 02139, USA.
- Harvard-MIT Program in Health Sciences and Technology, Cambridge, MA, 02139, USA.
| |
Collapse
|
48
|
Koo PK, Ploenzke M. Deep learning for inferring transcription factor binding sites. CURRENT OPINION IN SYSTEMS BIOLOGY 2020; 19:16-23. [PMID: 32905524 PMCID: PMC7469942 DOI: 10.1016/j.coisb.2020.04.001] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep learning is a powerful tool for predicting transcription factor binding sites from DNA sequence. Despite their high predictive accuracy, there are no guarantees that a high-performing deep learning model will learn causal sequence-function relationships. Thus a move beyond performance comparisons on benchmark datasets is needed. Interpreting model predictions is a powerful approach to identify which features drive performance gains and ideally provide insight into the underlying biological mechanisms. Here we highlight timely advances in deep learning for genomics, with a focus on inferring transcription factors binding sites. We describe recent applications, model architectures, and advances in local and global model interpretability methods, then conclude with a discussion on future research directions.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| |
Collapse
|