51
|
Moyers BA, Partridge EC, Mackiewicz M, Betti MJ, Darji R, Meadows SK, Newberry KM, Brandsmeier LA, Wold BJ, Mendenhall EM, Myers RM. Characterization of human transcription factor function and patterns of gene regulation in HepG2 cells. Genome Res 2023; 33:1879-1892. [PMID: 37852782 PMCID: PMC10760452 DOI: 10.1101/gr.278205.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Accepted: 10/13/2023] [Indexed: 10/20/2023]
Abstract
Transcription factors (TFs) are trans-acting proteins that bind cis-regulatory elements (CREs) in DNA to control gene expression. Here, we analyzed the genomic localization profiles of 529 sequence-specific TFs and 151 cofactors and chromatin regulators in the human cancer cell line HepG2, for a total of 680 broadly termed DNA-associated proteins (DAPs). We used this deep collection to model each TF's impact on gene expression, and identified a cohort of 26 candidate transcriptional repressors. We examine high occupancy target (HOT) sites in the context of three-dimensional genome organization and show biased motif placement in distal-promoter connections involving HOT sites. We also found a substantial number of closed chromatin regions with multiple DAPs bound, and explored their properties, finding that a MAFF/MAFK TF pair correlates with transcriptional repression. Altogether, these analyses provide novel insights into the regulatory logic of the human cell line HepG2 genome and show the usefulness of large genomic analyses for elucidation of individual TF functions.
Collapse
Affiliation(s)
- Belle A Moyers
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | | | - Mark Mackiewicz
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Michael J Betti
- Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA
| | - Roshan Darji
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | - Sarah K Meadows
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA
| | | | | | - Barbara J Wold
- Merkin Institute for Translational Research, California Institute of Technology, Pasadena, California 91125, USA
| | - Eric M Mendenhall
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA;
| | - Richard M Myers
- HudsonAlpha Institute for Biotechnology, Huntsville, Alabama 35806, USA;
| |
Collapse
|
52
|
Huang C, Shuai RW, Baokar P, Chung R, Rastogi R, Kathail P, Ioannidis NM. Personal transcriptome variation is poorly explained by current genomic deep learning models. Nat Genet 2023; 55:2056-2059. [PMID: 38036790 PMCID: PMC10703684 DOI: 10.1038/s41588-023-01574-w] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2023] [Accepted: 10/18/2023] [Indexed: 12/02/2023]
Abstract
Genomic deep learning models can predict genome-wide epigenetic features and gene expression levels directly from DNA sequence. While current models perform well at predicting gene expression levels across genes in different cell types from the reference genome, their ability to explain expression variation between individuals due to cis-regulatory genetic variants remains largely unexplored. Here, we evaluate four state-of-the-art models on paired personal genome and transcriptome data and find limited performance when explaining variation in expression across individuals. In addition, models often fail to predict the correct direction of effect of cis-regulatory genetic variation on expression.
Collapse
Affiliation(s)
- Connie Huang
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Richard W Shuai
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Parth Baokar
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Ryan Chung
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Ruchir Rastogi
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA
| | - Pooja Kathail
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA
| | - Nilah M Ioannidis
- Department of Electrical Engineering and Computer Sciences, University of California Berkeley, Berkeley, CA, USA.
- Center for Computational Biology, University of California Berkeley, Berkeley, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
53
|
Wang H, Ji Z. T-cell receptor sequences correlate with and predict gene expression levels in T cells. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.27.568912. [PMID: 38076860 PMCID: PMC10705237 DOI: 10.1101/2023.11.27.568912] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/18/2023]
Abstract
T cells exhibit high heterogeneity in both their gene expression profiles and antigen specificities. We analyzed fifteen single-cell immune profiling datasets to systematically investigate the association between T-cell receptor (TCR) sequences and the gene expression profiles of T cells. Our findings reveal that T cells sharing identical or similar TCR sequences tend to have highly similar gene expression profiles. Additionally, we developed a foundational model that leverages TCR information to predict gene expression levels in T cells.
Collapse
Affiliation(s)
- Hao Wang
- Department of Statistical Science, Duke University, Durham, NC, USA
| | - Zhicheng Ji
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, USA
| |
Collapse
|
54
|
Chen Y, Paramo MI, Zhang Y, Yao L, Shah SR, Jin Y, Zhang J, Pan X, Yu H. Finding Needles in the Haystack: Strategies for Uncovering Noncoding Regulatory Variants. Annu Rev Genet 2023; 57:201-222. [PMID: 37562413 DOI: 10.1146/annurev-genet-030723-120717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/12/2023]
Abstract
Despite accumulating evidence implicating noncoding variants in human diseases, unraveling their functionality remains a significant challenge. Systematic annotations of the regulatory landscape and the growth of sequence variant data sets have fueled the development of tools and methods to identify causal noncoding variants and evaluate their regulatory effects. Here, we review the latest advances in the field and discuss potential future research avenues to gain a more in-depth understanding of noncoding regulatory variants.
Collapse
Affiliation(s)
- You Chen
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Mauricio I Paramo
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yingying Zhang
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Li Yao
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Sagar R Shah
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Yiyang Jin
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Junke Zhang
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| | - Xiuqi Pan
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, New York, USA
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
| | - Haiyuan Yu
- Weill Institute for Cell and Molecular Biology, Cornell University, Ithaca, New York, USA;
- Department of Computational Biology, Cornell University, Ithaca, New York, USA
| |
Collapse
|
55
|
Pianfetti E, Lovino M, Ficarra E, Martignetti L. MiREx: mRNA levels prediction from gene sequence and miRNA target knowledge. BMC Bioinformatics 2023; 24:443. [PMID: 37993778 PMCID: PMC10666312 DOI: 10.1186/s12859-023-05560-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 11/06/2023] [Indexed: 11/24/2023] Open
Abstract
Messenger RNA (mRNA) has an essential role in the protein production process. Predicting mRNA expression levels accurately is crucial for understanding gene regulation, and various models (statistical and neural network-based) have been developed for this purpose. A few models predict mRNA expression levels from the DNA sequence, exploiting the DNA sequence and gene features (e.g., number of exons/introns, gene length). Other models include information about long-range interaction molecules (i.e., enhancers/silencers) and transcriptional regulators as predictive features, such as transcription factors (TFs) and small RNAs (e.g., microRNAs - miRNAs). Recently, a convolutional neural network (CNN) model, called Xpresso, has been proposed for mRNA expression level prediction leveraging the promoter sequence and mRNAs' half-life features (gene features). To push forward the mRNA level prediction, we present miREx, a CNN-based tool that includes information about miRNA targets and expression levels in the model. Indeed, each miRNA can target specific genes, and the model exploits this information to guide the learning process. In detail, not all miRNAs are included, only a selected subset with the highest impact on the model. MiREx has been evaluated on four cancer primary sites from the genomics data commons (GDC) database: lung, kidney, breast, and corpus uteri. Results show that mRNA level prediction benefits from selected miRNA targets and expression information. Future model developments could include other transcriptional regulators or be trained with proteomics data to infer protein levels.
Collapse
Affiliation(s)
- Elena Pianfetti
- Department of Engineering, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41225, Italy
| | - Marta Lovino
- Department of Engineering, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41225, Italy.
| | - Elisa Ficarra
- Department of Engineering, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41225, Italy
| | - Loredana Martignetti
- Institut Curie, Rue d'Ulm 26, Paris, 75005, France.
- Inserm U900, Paris, France.
- CBIO-Centre for Computational Biology, Paris, France.
- PSL Research University, Paris, France.
| |
Collapse
|
56
|
Bhogale S, Seward C, Stubbs L, Sinha S. SEAMoD: A fully interpretable neural network for cis-regulatory analysis of differentially expressed genes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.09.565900. [PMID: 38014229 PMCID: PMC10680628 DOI: 10.1101/2023.11.09.565900] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/29/2023]
Abstract
A common way to investigate gene regulatory mechanisms is to identify differentially expressed genes using transcriptomics, find their candidate enhancers using epigenomics, and search for over-represented transcription factor (TF) motifs in these enhancers using bioinformatics tools. A related follow-up task is to model gene expression as a function of enhancer sequences and rank TF motifs by their contribution to such models, thus prioritizing among regulators. We present a new computational tool called SEAMoD that performs the above tasks of motif finding and sequence-to-expression modeling simultaneously. It trains a convolutional neural network model to relate enhancer sequences to differential expression in one or more biological conditions. The model uses TF motifs to interpret the sequences, learning these motifs and their relative importance to each biological condition from data. It also utilizes epigenomic information in the form of activity scores of putative enhancers and automatically searches for the most promising enhancer for each gene. Compared to existing neural network models of non-coding sequences, SEAMoD uses far fewer parameters, requires far less training data, and emphasizes biological interpretability. We used SEAMoD to understand regulatory mechanisms underlying the differentiation of neural stem cell (NSC) derived from mouse forebrain. We profiled gene expression and histone modifications in NSC and three differentiated cell types and used SEAMoD to model differential expression of nearly 12,000 genes with an accuracy of 81%, in the process identifying the Olig2, E2f family TFs, Foxo3, and Tcf4 as key transcriptional regulators of the differentiation process.
Collapse
|
57
|
Gjoni K, Pollard KS. SuPreMo: a computational tool for streamlining in silico perturbation using sequence-based predictive models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.11.03.565556. [PMID: 37961123 PMCID: PMC10635135 DOI: 10.1101/2023.11.03.565556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Computationally editing genome sequences is a common bioinformatics task, but current approaches have limitations, such as incompatibility with structural variants, challenges in identifying responsible sequence perturbations, and the need for vcf file inputs and phased data. To address these bottlenecks, we present Sequence Mutator for Predictive Models (SuPreMo), a scalable and comprehensive tool for performing in silico mutagenesis. We then demonstrate how pairs of reference and perturbed sequences can be used with machine learning models to prioritize pathogenic variants or discover new functional sequences.
Collapse
Affiliation(s)
- Ketrin Gjoni
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA 94158, USA
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, USA
| | - Katherine S Pollard
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA 94158, USA
- Department of Epidemiology & Biostatistics, University of California, San Francisco, CA 94158, USA
- Chan Zuckerberg Biohub, San Francisco, CA 94158, USA
| |
Collapse
|
58
|
Klie A, Laub D, Talwar JV, Stites H, Jores T, Solvason JJ, Farley EK, Carter H. Predictive analyses of regulatory sequences with EUGENe. NATURE COMPUTATIONAL SCIENCE 2023; 3:946-956. [PMID: 38177592 PMCID: PMC10768637 DOI: 10.1038/s43588-023-00544-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/12/2023] [Accepted: 09/27/2023] [Indexed: 01/06/2024]
Abstract
Deep learning has become a popular tool to study cis-regulatory function. Yet efforts to design software for deep-learning analyses in regulatory genomics that are findable, accessible, interoperable and reusable (FAIR) have fallen short of fully meeting these criteria. Here we present elucidating the utility of genomic elements with neural nets (EUGENe), a FAIR toolkit for the analysis of genomic sequences with deep learning. EUGENe consists of a set of modules and subpackages for executing the key functionality of a genomics deep learning workflow: (1) extracting, transforming and loading sequence data from many common file formats; (2) instantiating, initializing and training diverse model architectures; and (3) evaluating and interpreting model behavior. We designed EUGENe as a simple, flexible and extensible interface for streamlining and customizing end-to-end deep-learning sequence analyses, and illustrate these principles through application of the toolkit to three predictive modeling tasks. We hope that EUGENe represents a springboard towards a collaborative ecosystem for deep-learning applications in genomics research.
Collapse
Affiliation(s)
- Adam Klie
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - David Laub
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | - James V Talwar
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
| | | | - Tobias Jores
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Joe J Solvason
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Molecular Biology, University of California San Diego, La Jolla, CA, USA
| | - Emma K Farley
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA
- Department of Molecular Biology, University of California San Diego, La Jolla, CA, USA
| | - Hannah Carter
- Department of Medicine, University of California San Diego, La Jolla, CA, USA.
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
59
|
Nambiar A, Dubinkina V, Liu S, Maslov S. FUN-PROSE: A deep learning approach to predict condition-specific gene expression in fungi. PLoS Comput Biol 2023; 19:e1011563. [PMID: 37971967 PMCID: PMC10653424 DOI: 10.1371/journal.pcbi.1011563] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/19/2023] [Accepted: 09/30/2023] [Indexed: 11/19/2023] Open
Abstract
mRNA levels of all genes in a genome is a critical piece of information defining the overall state of the cell in a given environmental condition. Being able to reconstruct such condition-specific expression in fungal genomes is particularly important to metabolically engineer these organisms to produce desired chemicals in industrially scalable conditions. Most previous deep learning approaches focused on predicting the average expression levels of a gene based on its promoter sequence, ignoring its variation across different conditions. Here we present FUN-PROSE-a deep learning model trained to predict differential expression of individual genes across various conditions using their promoter sequences and expression levels of all transcription factors. We train and test our model on three fungal species and get the correlation between predicted and observed condition-specific gene expression as high as 0.85. We then interpret our model to extract promoter sequence motifs responsible for variable expression of individual genes. We also carried out input feature importance analysis to connect individual transcription factors to their gene targets. A sizeable fraction of both sequence motifs and TF-gene interactions learned by our model agree with previously known biological information, while the rest corresponds to either novel biological facts or indirect correlations.
Collapse
Affiliation(s)
- Ananthan Nambiar
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, Urbana, Illinois, United States of America
| | - Veronika Dubinkina
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, Urbana, Illinois, United States of America
- The Gladstone Institute of Data Science and Biotechnology, San Francisco, California, United States of America
| | - Simon Liu
- Carl R. Woese Institute for Genomic Biology, Urbana, Illinois, United States of America
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
| | - Sergei Maslov
- Department of Bioengineering, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
- Carl R. Woese Institute for Genomic Biology, Urbana, Illinois, United States of America
- Department of Physics, University of Illinois Urbana-Champaign, Urbana, Illinois, United States of America
- Computing, Environment and Life Sciences, Argonne National Laboratory, Lemont, Illinois, United States of America
| |
Collapse
|
60
|
Khodursky S, Zheng EB, Svetec N, Durkin SM, Benjamin S, Gadau A, Wu X, Zhao L. The evolution and mutational robustness of chromatin accessibility in Drosophila. Genome Biol 2023; 24:232. [PMID: 37845780 PMCID: PMC10578003 DOI: 10.1186/s13059-023-03079-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 09/29/2023] [Indexed: 10/18/2023] Open
Abstract
BACKGROUND The evolution of genomic regulatory regions plays a critical role in shaping the diversity of life. While this process is primarily sequence-dependent, the enormous complexity of biological systems complicates the understanding of the factors underlying regulation and its evolution. Here, we apply deep neural networks as a tool to investigate the sequence determinants underlying chromatin accessibility in different species and tissues of Drosophila. RESULTS We train hybrid convolution-attention neural networks to accurately predict ATAC-seq peaks using only local DNA sequences as input. We show that our models generalize well across substantially evolutionarily diverged species of insects, implying that the sequence determinants of accessibility are highly conserved. Using our model to examine species-specific gains in accessibility, we find evidence suggesting that these regions may be ancestrally poised for evolution. Using in silico mutagenesis, we show that accessibility can be accurately predicted from short subsequences in each example. However, in silico knock-out of these sequences does not qualitatively impair classification, implying that accessibility is mutationally robust. Subsequently, we show that accessibility is predicted to be robust to large-scale random mutation even in the absence of selection. Conversely, simulations under strong selection demonstrate that accessibility can be extremely malleable despite its robustness. Finally, we identify motifs predictive of accessibility, recovering both novel and previously known motifs. CONCLUSIONS These results demonstrate the conservation of the sequence determinants of accessibility and the general robustness of chromatin accessibility, as well as the power of deep neural networks to explore fundamental questions in regulatory genomics and evolution.
Collapse
Affiliation(s)
- Samuel Khodursky
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Eric B Zheng
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Nicolas Svetec
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Sylvia M Durkin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
- Present Address: Department of Integrative Biology and Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, CA, USA
| | - Sigi Benjamin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Alice Gadau
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Xia Wu
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Li Zhao
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA.
| |
Collapse
|
61
|
Huang L, Song M, Shen H, Hong H, Gong P, Deng HW, Zhang C. Deep Learning Methods for Omics Data Imputation. BIOLOGY 2023; 12:1313. [PMID: 37887023 PMCID: PMC10604785 DOI: 10.3390/biology12101313] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/28/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
One common problem in omics data analysis is missing values, which can arise due to various reasons, such as poor tissue quality and insufficient sample volumes. Instead of discarding missing values and related data, imputation approaches offer an alternative means of handling missing data. However, the imputation of missing omics data is a non-trivial task. Difficulties mainly come from high dimensionality, non-linear or non-monotonic relationships within features, technical variations introduced by sampling methods, sample heterogeneity, and the non-random missingness mechanism. Several advanced imputation methods, including deep learning-based methods, have been proposed to address these challenges. Due to its capability of modeling complex patterns and relationships in large and high-dimensional datasets, many researchers have adopted deep learning models to impute missing omics data. This review provides a comprehensive overview of the currently available deep learning-based methods for omics imputation from the perspective of deep generative model architectures such as autoencoder, variational autoencoder, generative adversarial networks, and Transformer, with an emphasis on multi-omics data imputation. In addition, this review also discusses the opportunities that deep learning brings and the challenges that it might face in this field.
Collapse
Affiliation(s)
- Lei Huang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Hui Shen
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA
| | - Hong-Wen Deng
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| |
Collapse
|
62
|
Vogel JW, Corriveau-Lecavalier N, Franzmeier N, Pereira JB, Brown JA, Maass A, Botha H, Seeley WW, Bassett DS, Jones DT, Ewers M. Connectome-based modelling of neurodegenerative diseases: towards precision medicine and mechanistic insight. Nat Rev Neurosci 2023; 24:620-639. [PMID: 37620599 DOI: 10.1038/s41583-023-00731-8] [Citation(s) in RCA: 52] [Impact Index Per Article: 26.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/26/2023] [Indexed: 08/26/2023]
Abstract
Neurodegenerative diseases are the most common cause of dementia. Although their underlying molecular pathologies have been identified, there is substantial heterogeneity in the patterns of progressive brain alterations across and within these diseases. Recent advances in neuroimaging methods have revealed that pathological proteins accumulate along specific macroscale brain networks, implicating the network architecture of the brain in the system-level pathophysiology of neurodegenerative diseases. However, the extent to which 'network-based neurodegeneration' applies across the wide range of neurodegenerative disorders remains unclear. Here, we discuss the state-of-the-art of neuroimaging-based connectomics for the mapping and prediction of neurodegenerative processes. We review findings supporting brain networks as passive conduits through which pathological proteins spread. As an alternative view, we also discuss complementary work suggesting that network alterations actively modulate the spreading of pathological proteins between connected brain regions. We conclude this Perspective by proposing an integrative framework in which connectome-based models can be advanced along three dimensions of innovation: incorporating parameters that modulate propagation behaviour on the basis of measurable biological features; building patient-tailored models that use individual-level information and allowing model parameters to interact dynamically over time. We discuss promises and pitfalls of these strategies for improving disease insights and moving towards precision medicine.
Collapse
Affiliation(s)
- Jacob W Vogel
- Department of Clinical Sciences, SciLifeLab, Lund University, Lund, Sweden.
| | - Nick Corriveau-Lecavalier
- Department of Neurology, Mayo Clinic, Rochester, MN, USA
- Department of Psychiatry and Psychology, Mayo Clinic, Rochester, MN, USA
| | - Nicolai Franzmeier
- Institute for Stroke and Dementia Research (ISD), University Hospital, LMU Munich, Munich, Germany
- Munich Cluster for Systems Neurology (SyNergy), Munich, Germany
- Department of Psychiatry and Neurochemistry, Institute of Neuroscience and Physiology, The Sahlgrenska Acadamy, University of Gothenburg, Mölndal and Gothenburg, Sweden
| | - Joana B Pereira
- Clinical Memory Research Unit, Department of Clinical Sciences, Lund University, Malmö, Sweden
- Neuro Division, Department of Clinical Neurosciences, Karolinska Institute, Stockholm, Sweden
| | - Jesse A Brown
- Memory and Aging Center, Department of Neurology, University of California, San Francisco, CA, USA
| | - Anne Maass
- German Center for Neurodegenerative Diseases (DZNE), Magdeburg, Germany
| | - Hugo Botha
- Department of Neurology, Mayo Clinic, Rochester, MN, USA
| | - William W Seeley
- Memory and Aging Center, Department of Neurology, University of California, San Francisco, CA, USA
- Department of Pathology, University of California, San Francisco, CA, USA
| | - Dani S Bassett
- Departments of Bioengineering, Electrical and Systems Engineering, Physics and Astronomy, Neurology and Psychiatry, University of Pennsylvania, Philadelphia, PA, USA
- Santa Fe Institute, Santa Fe, NM, USA
| | - David T Jones
- Department of Neurology, Mayo Clinic, Rochester, MN, USA
- Department of Radiology, Mayo Clinic, Rochester, MN, USA
| | - Michael Ewers
- Institute for Stroke and Dementia Research (ISD), University Hospital, LMU Munich, Munich, Germany.
| |
Collapse
|
63
|
Sasse A, Ng B, Spiro AE, Tasaki S, Bennett DA, Gaiteri C, De Jager PL, Chikina M, Mostafavi S. Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.16.532969. [PMID: 36993652 PMCID: PMC10055057 DOI: 10.1101/2023.03.16.532969] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/31/2023]
Abstract
Deep learning methods have recently become the state-of-the-art in a variety of regulatory genomic tasks1-6 including the prediction of gene expression from genomic DNA. As such, these methods promise to serve as important tools in interpreting the full spectrum of genetic variation observed in personal genomes. Previous evaluation strategies have assessed their predictions of gene expression across genomic regions, however, systematic benchmarking is lacking to assess their predictions across individuals, which would directly evaluates their utility as personal DNA interpreters. We used paired Whole Genome Sequencing and gene expression from 839 individuals in the ROSMAP study7 to evaluate the ability of current methods to predict gene expression variation across individuals at varied loci. Our approach identifies a limitation of current methods to correctly predict the direction of variant effects. We show that this limitation stems from insufficiently learnt sequence motif grammar, and suggest new model training strategies to improve performance.
Collapse
Affiliation(s)
- Alexander Sasse
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195
| | - Bernard Ng
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, Illinois, USA, 60612
| | - Anna E Spiro
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195
| | - Shinya Tasaki
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, Illinois, USA, 60612
| | - David A Bennett
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, Illinois, USA, 60612
| | - Christopher Gaiteri
- Rush Alzheimer's Disease Center, Rush University Medical Center, Chicago, Illinois, USA, 60612
- Department of Psychiatry, SUNY Upstate Medical University, Syracuse, NY, USA 13210
| | - Philip L De Jager
- Center for Translational & Computational Neuroimmunology, Department of Neurology, and the Taub Institute for the Study of Alzheimer's Disease and the Aging Brain, Columbia University Irving Medical Center, New York, NY, USA, 10032
| | - Maria Chikina
- Department of Computational and Systems Biology, University of Pittsburgh, Pittsburgh, PA, USA, 15260
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington, WA, USA, 98195
- Canadian Institute for Advanced Research, Toronto, ON, Canada, MG5 1ZB
| |
Collapse
|
64
|
Sokolova K, Theesfeld CL, Wong AK, Zhang Z, Dolinski K, Troyanskaya OG. Atlas of primary cell-type-specific sequence models of gene expression and variant effects. CELL REPORTS METHODS 2023; 3:100580. [PMID: 37703883 PMCID: PMC10545936 DOI: 10.1016/j.crmeth.2023.100580] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/15/2023] [Revised: 05/05/2023] [Accepted: 08/18/2023] [Indexed: 09/15/2023]
Abstract
Human biology is rooted in highly specialized cell types programmed by a common genome, 98% of which is outside of genes. Genetic variation in the enormous noncoding space is linked to the majority of disease risk. To address the problem of linking these variants to expression changes in primary human cells, we introduce ExPectoSC, an atlas of modular deep-learning-based models for predicting cell-type-specific gene expression directly from sequence. We provide models for 105 primary human cell types covering 7 organ systems, demonstrate their accuracy, and then apply them to prioritize relevant cell types for complex human diseases. The resulting atlas of sequence-based gene expression and variant effects is publicly available in a user-friendly interface and readily extensible to any primary cell types. We demonstrate the accuracy of our approach through systematic evaluations and apply the models to prioritize ClinVar clinical variants of uncertain significance, verifying our top predictions experimentally.
Collapse
Affiliation(s)
- Ksenia Sokolova
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA; Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Chandra L Theesfeld
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA.
| | - Aaron K Wong
- Flatiron Institute, Simons Foundation, New York City, NY 10001, USA
| | - Zijun Zhang
- Flatiron Institute, Simons Foundation, New York City, NY 10001, USA; Division of Artificial Intelligence in Medicine, Cedars-Sinai Medical Center, 116 N. Robertson Boulevard, Los Angeles, CA 90048, USA
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Olga G Troyanskaya
- Department of Computer Science, Princeton University, Princeton, NJ 08544, USA; Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA; Flatiron Institute, Simons Foundation, New York City, NY 10001, USA.
| |
Collapse
|
65
|
Hepkema J, Lee NK, Stewart BJ, Ruangroengkulrith S, Charoensawan V, Clatworthy MR, Hemberg M. Predicting the impact of sequence motifs on gene regulation using single-cell data. Genome Biol 2023; 24:189. [PMID: 37582793 PMCID: PMC10426127 DOI: 10.1186/s13059-023-03021-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2022] [Accepted: 07/21/2023] [Indexed: 08/17/2023] Open
Abstract
The binding of transcription factors at proximal promoters and distal enhancers is central to gene regulation. Identifying regulatory motifs and quantifying their impact on expression remains challenging. Using a convolutional neural network trained on single-cell data, we infer putative regulatory motifs and cell type-specific importance. Our model, scover, explains 29% of the variance in gene expression in multiple mouse tissues. Applying scover to distal enhancers identified using scATAC-seq from the developing human brain, we identify cell type-specific motif activities in distal enhancers. Scover can identify regulatory motifs and their importance from single-cell data where all parameters and outputs are easily interpretable.
Collapse
Affiliation(s)
- Jacob Hepkema
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
| | - Nicholas Keone Lee
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK
| | - Benjamin J Stewart
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
- Cambridge University Hospitals NHS Foundation Trust and NIHR Cambridge Biomedical Research Centre, Cambridge, CB2 0QQ, UK
| | - Siwat Ruangroengkulrith
- Department of Biochemistry, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
| | - Varodom Charoensawan
- Department of Biochemistry, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
- Integrative Computational BioScience (ICBS) Center, Mahidol University, Nakhon Pathom, 7310, Thailand
- Systems Biology of Diseases Research Unit, Faculty of Science, Mahidol University, Bangkok, 10400, Thailand
| | - Menna R Clatworthy
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK
- Molecular Immunity Unit, Department of Medicine, University of Cambridge, Cambridge, CB2 0QQ, UK
- Cambridge University Hospitals NHS Foundation Trust and NIHR Cambridge Biomedical Research Centre, Cambridge, CB2 0QQ, UK
| | - Martin Hemberg
- Wellcome Sanger Institute, Wellcome Genome Campus, Hinxton, CB10 1SA, UK.
- The Gurdon Institute, University of Cambridge, Tennis Court Road, Cambridge, CB2 1QN, UK.
- Gene Lay Institute of Immunology and Inflammation, Brigham and Women's Hospital, Massachusetts General Hospital, and Harvard Medical School, Boston, MA, 02115, USA.
| |
Collapse
|
66
|
Romo L, Findlay SD, Burge CB. Regulatory features aid interpretation of 3'UTR Variants. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.01.551549. [PMID: 37577470 PMCID: PMC10418266 DOI: 10.1101/2023.08.01.551549] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/15/2023]
Abstract
Our ability to determine the clinical impact of variants in 3' untranslated regions (UTRs) of genes remains poor. We provide a thorough analysis of 3'UTR variants from several datasets. Variants in putative regulatory elements including RNA-binding protein motifs, eCLIP peaks, and microRNA sites are up to 16 times more likely than other variants to have gene expression and phenotype associations. Heterozygous variants in regulatory motifs result in allele-specific protein binding in cell lines and allele-specific gene expression differences in population studies. In addition, variants in shared regions of alternatively polyadenylated isoforms and those proximal to polyA sites are more likely to affect gene expression and phenotype. Finally, pathogenic 3'UTR variants in ClinVar are 20 times more likely than benign variants to fall in a regulatory site. We incorporated these findings into RegVar, a software tool that interprets regulatory elements and annotations for any 3'UTR variant, and predicts whether the variant is likely to affect gene expression or phenotype. This tool will help prioritize variants for experimental studies and identify pathogenic variants in patients.
Collapse
Affiliation(s)
- Lindsay Romo
- Harvard Medical Genetics Training Program, Boston Children’s Hospital, Boston, MA 02115
| | - Scott D. Findlay
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02142
| | - Christopher B. Burge
- Department of Biology, Massachusetts Institute of Technology, Cambridge, MA, 02142
| |
Collapse
|
67
|
Khodursky S, Zheng EB, Svetec N, Durkin SM, Benjamin S, Gadau A, Wu X, Zhao L. The evolution and mutational robustness of chromatin accessibility in Drosophila. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.26.546587. [PMID: 37425760 PMCID: PMC10327059 DOI: 10.1101/2023.06.26.546587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
The evolution of regulatory regions in the genome plays a critical role in shaping the diversity of life. While this process is primarily sequence-dependent, the enormous complexity of biological systems has made it difficult to understand the factors underlying regulation and its evolution. Here, we apply deep neural networks as a tool to investigate the sequence determinants underlying chromatin accessibility in different tissues of Drosophila. We train hybrid convolution-attention neural networks to accurately predict ATAC-seq peaks using only local DNA sequences as input. We show that a model trained in one species has nearly identical performance when tested in another species, implying that the sequence determinants of accessibility are highly conserved. Indeed, model performance remains excellent even in distantly-related species. By using our model to examine species-specific gains in chromatin accessibility, we find that their orthologous inaccessible regions in other species have surprisingly similar model outputs, suggesting that these regions may be ancestrally poised for evolution. We then use in silico saturation mutagenesis to reveal evidence of selective constraint acting specifically on inaccessible chromatin regions. We further show that chromatin accessibility can be accurately predicted from short subsequences in each example. However, in silico knock-out of these sequences does not qualitatively impair classification, implying that chromatin accessibility is mutationally robust. Subsequently, we demonstrate that chromatin accessibility is predicted to be robust to large-scale random mutation even in the absence of selection. We also perform in silico evolution experiments under the regime of strong selection and weak mutation (SSWM) and show that chromatin accessibility can be extremely malleable despite its mutational robustness. However, selection acting in different directions in a tissue-specific manner can substantially slow adaptation. Finally, we identify motifs predictive of chromatin accessibility and recover motifs corresponding to known chromatin accessibility activators and repressors. These results demonstrate the conservation of the sequence determinants of accessibility and the general robustness of chromatin accessibility, as well as the power of deep neural networks as tools to answer fundamental questions in regulatory genomics and evolution.
Collapse
Affiliation(s)
- Samuel Khodursky
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- These authors contributed equally
| | - Eric B Zheng
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- These authors contributed equally
| | - Nicolas Svetec
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Sylvia M Durkin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- Current Address: Department of Integrative Biology and Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, CA, USA
| | - Sigi Benjamin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Alice Gadau
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Xia Wu
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Li Zhao
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| |
Collapse
|
68
|
Novakovsky G, Fornes O, Saraswat M, Mostafavi S, Wasserman WW. ExplaiNN: interpretable and transparent neural networks for genomics. Genome Biol 2023; 24:154. [PMID: 37370113 DOI: 10.1186/s13059-023-02985-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
Deep learning models such as convolutional neural networks (CNNs) excel in genomic tasks but lack interpretability. We introduce ExplaiNN, which combines the expressiveness of CNNs with the interpretability of linear models. ExplaiNN can predict TF binding, chromatin accessibility, and de novo motifs, achieving performance comparable to state-of-the-art methods. Its predictions are transparent, providing global (cell state level) as well as local (individual sequence level) biological insights into the data. ExplaiNN can serve as a plug-and-play platform for pretrained models and annotated position weight matrices. ExplaiNN aims to accelerate the adoption of deep learning in genomic sequence analysis by domain experts.
Collapse
Affiliation(s)
- Gherman Novakovsky
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Oriol Fornes
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Manu Saraswat
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
- Division of Computational Genomics and Systems Genetics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- European Molecular Biology Laboratory (EMBL), Genome Biology Unit, Heidelberg, Germany
| | - Sara Mostafavi
- Paul G. Allen School of Computer Science and Engineering, University of Washington (UW), Seattle, USA
| | - Wyeth W Wasserman
- Department of Medical Genetics, Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
69
|
Stefanini M, Lovino M, Cucchiara R, Ficarra E. Predicting gene and protein expression levels from DNA and protein sequences with Perceiver. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2023; 234:107504. [PMID: 37004267 DOI: 10.1016/j.cmpb.2023.107504] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 03/06/2023] [Accepted: 03/21/2023] [Indexed: 06/19/2023]
Abstract
BACKGROUND AND OBJECTIVE The functions of an organism and its biological processes result from the expression of genes and proteins. Therefore quantifying and predicting mRNA and protein levels is a crucial aspect of scientific research. Concerning the prediction of mRNA levels, the available approaches use the sequence upstream and downstream of the Transcription Start Site (TSS) as input to neural networks. The State-of-the-art models (e.g., Xpresso and Basenjii) predict mRNA levels exploiting Convolutional (CNN) or Long Short Term Memory (LSTM) Networks. However, CNN prediction depends on convolutional kernel size, and LSTM suffers from capturing long-range dependencies in the sequence. Concerning the prediction of protein levels, as far as we know, there is no model for predicting protein levels by exploiting the gene or protein sequences. METHODS Here, we exploit a new model type (called Perceiver) for mRNA and protein level prediction, exploiting a Transformer-based architecture with an attention module to attend to long-range interactions in the sequences. In addition, the Perceiver model overcomes the quadratic complexity of the standard Transformer architectures. This work's contributions are 1. DNAPerceiver model to predict mRNA levels from the sequence upstream and downstream of the TSS; 2. ProteinPerceiver model to predict protein levels from the protein sequence; 3. Protein&DNAPerceiver model to predict protein levels from TSS and protein sequences. RESULTS The models are evaluated on cell lines, mice, glioblastoma, and lung cancer tissues. The results show the effectiveness of the Perceiver-type models in predicting mRNA and protein levels. CONCLUSIONS This paper presents a Perceiver architecture for mRNA and protein level prediction. In the future, inserting regulatory and epigenetic information into the model could improve mRNA and protein level predictions. The source code is freely available at https://github.com/MatteoStefanini/DNAPerceiver.
Collapse
Affiliation(s)
- Matteo Stefanini
- DIEF, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41125, Italy
| | - Marta Lovino
- DIEF, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41125, Italy.
| | - Rita Cucchiara
- DIEF, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41125, Italy
| | - Elisa Ficarra
- DIEF, University of Modena and Reggio Emilia, Via Vivarelli 10/1, Modena, 41125, Italy
| |
Collapse
|
70
|
Georgakopoulos-Soares I, Deng C, Agarwal V, Chan CSY, Zhao J, Inoue F, Ahituv N. Transcription factor binding site orientation and order are major drivers of gene regulatory activity. Nat Commun 2023; 14:2333. [PMID: 37087538 PMCID: PMC10122648 DOI: 10.1038/s41467-023-37960-5] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 04/06/2023] [Indexed: 04/24/2023] Open
Abstract
The gene regulatory code and grammar remain largely unknown, precluding our ability to link phenotype to genotype in regulatory sequences. Here, using a massively parallel reporter assay (MPRA) of 209,440 sequences, we examine all possible pair and triplet combinations, permutations and orientations of eighteen liver-associated transcription factor binding sites (TFBS). We find that TFBS orientation and order have a major effect on gene regulatory activity. Corroborating these results with genomic analyses, we find clear human promoter TFBS orientation biases and similar TFBS orientation and order transcriptional effects in an MPRA that tested 164,307 liver candidate regulatory elements. Additionally, by adding TFBS orientation to a model that predicts expression from sequence we improve performance by 7.7%. Collectively, our results show that TFBS orientation and order have a significant effect on gene regulatory activity and need to be considered when analyzing the functional effect of variants on the activity of these sequences.
Collapse
Affiliation(s)
- Ilias Georgakopoulos-Soares
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA.
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA.
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA.
| | - Chengyu Deng
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Vikram Agarwal
- mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA, USA
| | - Candace S Y Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Jingjing Zhao
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA
| | - Fumitaka Inoue
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA.
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA.
| |
Collapse
|
71
|
Rodrigues DC, Mufteev M, Yuki KE, Narula A, Wei W, Piekna A, Liu J, Pasceri P, Rissland OS, Wilson MD, Ellis J. Buffering of transcription rate by mRNA half-life is a conserved feature of Rett syndrome models. Nat Commun 2023; 14:1896. [PMID: 37019888 PMCID: PMC10076348 DOI: 10.1038/s41467-023-37339-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2022] [Accepted: 03/13/2023] [Indexed: 04/07/2023] Open
Abstract
Transcriptional changes in Rett syndrome (RTT) are assumed to directly correlate with steady-state mRNA levels, but limited evidence in mice suggests that changes in transcription can be compensated by post-transcriptional regulation. We measure transcription rate and mRNA half-life changes in RTT patient neurons using RATEseq, and re-interpret nuclear and whole-cell RNAseq from Mecp2 mice. Genes are dysregulated by changing transcription rate or half-life and are buffered when both change. We utilized classifier models to predict the direction of transcription rate changes and find that combined frequencies of three dinucleotides are better predictors than CA and CG. MicroRNA and RNA-binding Protein (RBP) motifs are enriched in 3'UTRs of genes with half-life changes. Nuclear RBP motifs are enriched on buffered genes with increased transcription rate. We identify post-transcriptional mechanisms in humans and mice that alter half-life or buffer transcription rate changes when a transcriptional modulator gene is mutated in a neurodevelopmental disorder.
Collapse
Affiliation(s)
- Deivid C Rodrigues
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Marat Mufteev
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
| | - Kyoko E Yuki
- Genetics & Genome Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Ashrut Narula
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Molecular Medicine, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Wei Wei
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Alina Piekna
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Jiajie Liu
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Peter Pasceri
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - Olivia S Rissland
- Molecular Medicine, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
- RNA Bioscience Initiative and Department of Biochemistry & Molecular Genetics, University of Colorado School of Medicine, Aurora, CO, 80045, USA
| | - Michael D Wilson
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada
- Genetics & Genome Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada
| | - James Ellis
- Developmental & Stem Cell Biology, Hospital for Sick Children, Toronto, ON, M5G 0A4, Canada.
- Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 1A8, Canada.
| |
Collapse
|
72
|
Karollus A, Mauermeier T, Gagneur J. Current sequence-based models capture gene expression determinants in promoters but mostly ignore distal enhancers. Genome Biol 2023; 24:56. [PMID: 36973806 PMCID: PMC10045630 DOI: 10.1186/s13059-023-02899-9] [Citation(s) in RCA: 38] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 03/16/2023] [Indexed: 03/29/2023] Open
Abstract
BACKGROUND The largest sequence-based models of transcription control to date are obtained by predicting genome-wide gene regulatory assays across the human genome. This setting is fundamentally correlative, as those models are exposed during training solely to the sequence variation between human genes that arose through evolution, questioning the extent to which those models capture genuine causal signals. RESULTS Here we confront predictions of state-of-the-art models of transcription regulation against data from two large-scale observational studies and five deep perturbation assays. The most advanced of these sequence-based models, Enformer, by and large, captures causal determinants of human promoters. However, models fail to capture the causal effects of enhancers on expression, notably in medium to long distances and particularly for highly expressed promoters. More generally, the predicted impact of distal elements on gene expression predictions is small and the ability to correctly integrate long-range information is significantly more limited than the receptive fields of the models suggest. This is likely caused by the escalating class imbalance between actual and candidate regulatory elements as distance increases. CONCLUSIONS Our results suggest that sequence-based models have advanced to the point that in silico study of promoter regions and promoter variants can provide meaningful insights and we provide practical guidance on how to use them. Moreover, we foresee that it will require significantly more and particularly new kinds of data to train models accurately accounting for distal elements.
Collapse
Affiliation(s)
- Alexander Karollus
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
| | - Thomas Mauermeier
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany
| | - Julien Gagneur
- School of Computation, Information and Technology, Technical University of Munich, Garching, Germany.
- Institute of Human Genetics, School of Medicine, Technical University of Munich, Munich, Germany.
- Institute of Computational Biology, Helmholtz Zentrum München, Neuherberg, Germany.
- Munich Data Science Institute, Technical University of Munich, Garching, Germany.
| |
Collapse
|
73
|
Agarwal V, Inoue F, Schubach M, Martin BK, Dash PM, Zhang Z, Sohota A, Noble WS, Yardimci GG, Kircher M, Shendure J, Ahituv N. Massively parallel characterization of transcriptional regulatory elements in three diverse human cell types. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.05.531189. [PMID: 36945371 PMCID: PMC10028905 DOI: 10.1101/2023.03.05.531189] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/11/2023]
Abstract
The human genome contains millions of candidate cis-regulatory elements (CREs) with cell-type-specific activities that shape both health and myriad disease states. However, we lack a functional understanding of the sequence features that control the activity and cell-type-specific features of these CREs. Here, we used lentivirus-based massively parallel reporter assays (lentiMPRAs) to test the regulatory activity of over 680,000 sequences, representing a nearly comprehensive set of all annotated CREs among three cell types (HepG2, K562, and WTC11), finding 41.7% to be functional. By testing sequences in both orientations, we find promoters to have significant strand orientation effects. We also observe that their 200 nucleotide cores function as non-cell-type-specific 'on switches' providing similar expression levels to their associated gene. In contrast, enhancers have weaker orientation effects, but increased tissue-specific characteristics. Utilizing our lentiMPRA data, we develop sequence-based models to predict CRE function with high accuracy and delineate regulatory motifs. Testing an additional lentiMPRA library encompassing 60,000 CREs in all three cell types, we further identified factors that determine cell-type specificity. Collectively, our work provides an exhaustive catalog of functional CREs in three widely used cell lines, and showcases how large-scale functional measurements can be used to dissect regulatory grammar.
Collapse
Affiliation(s)
- Vikram Agarwal
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA 02451, USA
| | - Fumitaka Inoue
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Max Schubach
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
| | - Beth K. Martin
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
| | - Pyaree Mohan Dash
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
| | - Zicong Zhang
- Institute for the Advanced Study of Human Biology (WPI-ASHBi), Kyoto University, Kyoto, Japan
| | - Ajuni Sohota
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Galip Gürkan Yardimci
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Knight Cancer Institute, Oregon Health and Science University, Portland, OR, USA
- Cancer Early Detection Advanced Research Center, Oregon Health and Science University, Portland, OR, USA
| | - Martin Kircher
- Berlin Institute of Health of Health at Charité - Universitätsmedizin Berlin, 10178, Berlin, Germany
- Institute of Human Genetics, University Medical Center Schleswig-Holstein, University of Lübeck, Lübeck, Germany
| | - Jay Shendure
- Department of Genome Sciences, University of Washington, Seattle, WA 98195, USA
- Howard Hughes Medical Institute, Seattle, WA 98195, USA
- Brotman Baty Institute for Precision Medicine, University of Washington, Seattle, WA 98195, USA
- Allen Center for Cell Lineage Tracing, University of Washington, Seattle, WA 98195, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94158, USA
| |
Collapse
|
74
|
Deep learning in regulatory genomics: from identification to design. Curr Opin Biotechnol 2023; 79:102887. [PMID: 36640453 DOI: 10.1016/j.copbio.2022.102887] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 11/12/2022] [Accepted: 12/14/2022] [Indexed: 01/14/2023]
Abstract
Genomics and deep learning are a natural match since both are data-driven fields. Regulatory genomics refers to functional noncoding DNA regulating gene expression. In recent years, deep learning applications on regulatory genomics have achieved remarkable advances so-much-so that it has revolutionized the rules of the game of the computational methods in this field. Here, we review two emerging trends: (i) the modeling of very long input sequence (up to 200 kb), which requires self-matched modularization of model architecture; (ii) on the balance of model predictability and model interpretability because the latter is more able to meet biological demands. Finally, we discuss how to employ these two routes to design synthetic regulatory DNA, as a promising strategy for optimizing crop agronomic properties.
Collapse
|
75
|
Li Z, Gao E, Zhou J, Han W, Xu X, Gao X. Applications of deep learning in understanding gene regulation. CELL REPORTS METHODS 2023; 3:100384. [PMID: 36814848 PMCID: PMC9939384 DOI: 10.1016/j.crmeth.2022.100384] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/22/2023]
Abstract
Gene regulation is a central topic in cell biology. Advances in omics technologies and the accumulation of omics data have provided better opportunities for gene regulation studies than ever before. For this reason deep learning, as a data-driven predictive modeling approach, has been successfully applied to this field during the past decade. In this article, we aim to give a brief yet comprehensive overview of representative deep-learning methods for gene regulation. Specifically, we discuss and compare the design principles and datasets used by each method, creating a reference for researchers who wish to replicate or improve existing methods. We also discuss the common problems of existing approaches and prospectively introduce the emerging deep-learning paradigms that will potentially alleviate them. We hope that this article will provide a rich and up-to-date resource and shed light on future research directions in this area.
Collapse
Affiliation(s)
- Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Elva Gao
- The KAUST School, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xiaopeng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- KAUST Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
76
|
Uzonyi A, Dierks D, Nir R, Kwon OS, Toth U, Barbosa I, Burel C, Brandis A, Rossmanith W, Le Hir H, Slobodin B, Schwartz S. Exclusion of m6A from splice-site proximal regions by the exon junction complex dictates m6A topologies and mRNA stability. Mol Cell 2023; 83:237-251.e7. [PMID: 36599352 DOI: 10.1016/j.molcel.2022.12.026] [Citation(s) in RCA: 92] [Impact Index Per Article: 46.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2022] [Revised: 11/04/2022] [Accepted: 12/21/2022] [Indexed: 01/05/2023]
Abstract
N6-methyladenosine (m6A), a widespread destabilizing mark on mRNA, is non-uniformly distributed across the transcriptome, yet the basis for its selective deposition is unknown. Here, we propose that m6A deposition is not selective. Instead, it is exclusion based: m6A consensus motifs are methylated by default, unless they are within a window of ∼100 nt from a splice junction. A simple model which we extensively validate, relying exclusively on presence of m6A motifs and exon-intron architecture, allows in silico recapitulation of experimentally measured m6A profiles. We provide evidence that exclusion from splice junctions is mediated by the exon junction complex (EJC), potentially via physical occlusion, and that previously observed associations between exon-intron architecture and mRNA decay are mechanistically mediated via m6A. Our findings establish a mechanism coupling nuclear mRNA splicing and packaging with the covalent installation of m6A, in turn controlling cytoplasmic decay.
Collapse
Affiliation(s)
- Anna Uzonyi
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7630031, Israel
| | - David Dierks
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7630031, Israel
| | - Ronit Nir
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7630031, Israel
| | - Oh Sung Kwon
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Ursula Toth
- Center for Anatomy & Cell Biology, Medical University of Vienna, 1090 Vienna, Austria
| | - Isabelle Barbosa
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Cindy Burel
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Alexander Brandis
- Life Sciences Core Facilities, Weizmann Institute of Science, Rehovot 7630031, Israel
| | - Walter Rossmanith
- Center for Anatomy & Cell Biology, Medical University of Vienna, 1090 Vienna, Austria
| | - Hervé Le Hir
- Institut de Biologie de l'Ecole Normale Supérieure (IBENS), Ecole Normale Supérieure, CNRS, INSERM, Université PSL, 75005 Paris, France
| | - Boris Slobodin
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7630031, Israel; Department of Biochemistry, Rappaport Faculty of Medicine, Technion - Israel Institute of Technology, Haifa 31096, Israel.
| | - Schraga Schwartz
- Department of Molecular Genetics, Weizmann Institute of Science, Rehovot 7630031, Israel.
| |
Collapse
|
77
|
Zrimec J, Zelezniak A, Gruden K. Toward learning the principles of plant gene regulation. TRENDS IN PLANT SCIENCE 2022; 27:1206-1208. [PMID: 36100536 DOI: 10.1016/j.tplants.2022.08.010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/09/2022] [Revised: 08/09/2022] [Accepted: 08/17/2022] [Indexed: 06/15/2023]
Abstract
Advanced machine learning (ML) algorithms produce highly accurate models of gene expression, uncovering novel regulatory features in nucleotide sequences involving multiple cis-regulatory regions across whole genes and structural properties. These broaden our understanding of gene regulation and point to new principles to test and adopt in the field of plant science.
Collapse
Affiliation(s)
- Jan Zrimec
- Department of Biotechnology and Systems Biology, National Institute of Biology, Večna pot 111, 1000 Ljubljana, Slovenia.
| | - Aleksej Zelezniak
- Department of Biology and Biological Engineering, Chalmers University of Technology, Kemivägen 10, 412 96, Gothenburg, Sweden
| | - Kristina Gruden
- Department of Biotechnology and Systems Biology, National Institute of Biology, Večna pot 111, 1000 Ljubljana, Slovenia.
| |
Collapse
|
78
|
Agarwal V, Kelley DR. The genetic and biochemical determinants of mRNA degradation rates in mammals. Genome Biol 2022; 23:245. [PMID: 36419176 PMCID: PMC9684954 DOI: 10.1186/s13059-022-02811-x] [Citation(s) in RCA: 38] [Impact Index Per Article: 12.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Accepted: 11/02/2022] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND Degradation rate is a fundamental aspect of mRNA metabolism, and the factors governing it remain poorly characterized. Understanding the genetic and biochemical determinants of mRNA half-life would enable more precise identification of variants that perturb gene expression through post-transcriptional gene regulatory mechanisms. RESULTS We establish a compendium of 39 human and 27 mouse transcriptome-wide mRNA decay rate datasets. A meta-analysis of these data identified a prevalence of technical noise and measurement bias, induced partially by the underlying experimental strategy. Correcting for these biases allowed us to derive more precise, consensus measurements of half-life which exhibit enhanced consistency between species. We trained substantially improved statistical models based upon genetic and biochemical features to better predict half-life and characterize the factors molding it. Our state-of-the-art model, Saluki, is a hybrid convolutional and recurrent deep neural network which relies only upon an mRNA sequence annotated with coding frame and splice sites to predict half-life (r=0.77). The key novel principle learned by Saluki is that the spatial positioning of splice sites, codons, and RNA-binding motifs within an mRNA is strongly associated with mRNA half-life. Saluki predicts the impact of RNA sequences and genetic mutations therein on mRNA stability, in agreement with functional measurements derived from massively parallel reporter assays. CONCLUSIONS Our work produces a more robust ground truth for transcriptome-wide mRNA half-lives in mammalian cells. Using these revised measurements, we trained Saluki, a model that is over 50% more accurate in predicting half-life from sequence than existing models. Saluki succinctly captures many of the known determinants of mRNA half-life and can be rapidly deployed to predict the functional consequences of arbitrary mutations in the transcriptome.
Collapse
Affiliation(s)
- Vikram Agarwal
- Calico Life Sciences LLC, South San Francisco, CA, 94080, USA.
- Present Address: mRNA Center of Excellence, Sanofi Pasteur Inc., Waltham, MA, 02451, USA.
| | - David R Kelley
- Calico Life Sciences LLC, South San Francisco, CA, 94080, USA.
| |
Collapse
|
79
|
Lan AY, Corces MR. Deep learning approaches for noncoding variant prioritization in neurodegenerative diseases. Front Aging Neurosci 2022; 14:1027224. [PMID: 36466610 PMCID: PMC9716280 DOI: 10.3389/fnagi.2022.1027224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 10/24/2022] [Indexed: 11/19/2022] Open
Abstract
Determining how noncoding genetic variants contribute to neurodegenerative dementias is fundamental to understanding disease pathogenesis, improving patient prognostication, and developing new clinical treatments. Next generation sequencing technologies have produced vast amounts of genomic data on cell type-specific transcription factor binding, gene expression, and three-dimensional chromatin interactions, with the promise of providing key insights into the biological mechanisms underlying disease. However, this data is highly complex, making it challenging for researchers to interpret, assimilate, and dissect. To this end, deep learning has emerged as a powerful tool for genome analysis that can capture the intricate patterns and dependencies within these large datasets. In this review, we organize and discuss the many unique model architectures, development philosophies, and interpretation methods that have emerged in the last few years with a focus on using deep learning to predict the impact of genetic variants on disease pathogenesis. We highlight both broadly-applicable genomic deep learning methods that can be fine-tuned to disease-specific contexts as well as existing neurodegenerative disease research, with an emphasis on Alzheimer's-specific literature. We conclude with an overview of the future of the field at the intersection of neurodegeneration, genomics, and deep learning.
Collapse
Affiliation(s)
- Alexander Y. Lan
- Gladstone Institute of Neurological Disease, San Francisco, CA, United States
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - M. Ryan Corces
- Gladstone Institute of Neurological Disease, San Francisco, CA, United States
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| |
Collapse
|
80
|
Li J, Wang J, Zhang P, Wang R, Mei Y, Sun Z, Fei L, Jiang M, Ma L, E W, Chen H, Wang X, Fu Y, Wu H, Liu D, Wang X, Li J, Guo Q, Liao Y, Yu C, Jia D, Wu J, He S, Liu H, Ma J, Lei K, Chen J, Han X, Guo G. Deep learning of cross-species single-cell landscapes identifies conserved regulatory programs underlying cell types. Nat Genet 2022; 54:1711-1720. [PMID: 36229673 DOI: 10.1038/s41588-022-01197-7] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2021] [Accepted: 08/31/2022] [Indexed: 11/09/2022]
Abstract
Despite extensive efforts to generate and analyze reference genomes, genetic models to predict gene regulation and cell fate decisions are lacking for most species. Here, we generated whole-body single-cell transcriptomic landscapes of zebrafish, Drosophila and earthworm. We then integrated cell landscapes from eight representative metazoan species to study gene regulation across evolution. Using these uniformly constructed cross-species landscapes, we developed a deep-learning-based strategy, Nvwa, to predict gene expression and identify regulatory sequences at the single-cell level. We systematically compared cell-type-specific transcription factors to reveal conserved genetic regulation in vertebrates and invertebrates. Our work provides a valuable resource and offers a new strategy for studying regulatory grammar in diverse biological systems.
Collapse
Affiliation(s)
- Jiaqi Li
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Jingjing Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China. .,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China.
| | - Peijing Zhang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Renying Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yuqing Mei
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Zhongyi Sun
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Lijiang Fei
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Mengmeng Jiang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Lifeng Ma
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Weigao E
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Haide Chen
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China
| | - Xinru Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Yuting Fu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Hanyu Wu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Daiyuan Liu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Xueyi Wang
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jingyu Li
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Qile Guo
- Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, China
| | - Yuan Liao
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China.,Zhejiang Provincial Key Laboratory for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, China
| | - Chengxuan Yu
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Danmei Jia
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China
| | - Jian Wu
- Division of Hepatobiliary and Pancreatic Surgery, Department of Surgery, First Affiliated Hospital School of Medicine, Zhejiang University, Hangzhou, China
| | - Shibo He
- College of Control Science and Engineering, Zhejiang University, Hangzhou, China
| | - Huanju Liu
- Women's Hospital and Institute of Genetics, Zhenjiang University School of Medicine, Hangzhou, China
| | - Jun Ma
- Women's Hospital and Institute of Genetics, Zhenjiang University School of Medicine, Hangzhou, China
| | - Kai Lei
- Westlake Laboratory of Life Sciences and Biomedicine, Key Laboratory of Growth Regulation and Translational Research of Zhejiang Province, School of Life Sciences, Westlake University, Hangzhou, China
| | - Jiming Chen
- College of Control Science and Engineering, Zhejiang University, Hangzhou, China
| | - Xiaoping Han
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China. .,Zhejiang Provincial Key Laboratory for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, China.
| | - Guoji Guo
- Center for Stem Cell and Regenerative Medicine and Bone Marrow Transplantation Center of the First Affiliated Hospital, Zhejiang University School of Medicine, Hangzhou, China. .,Liangzhu Laboratory, Zhejiang University Medical Center, Hangzhou, China. .,Zhejiang University-University of Edinburgh Institute, Zhejiang University School of Medicine, Zhejiang University, Hangzhou, China. .,Zhejiang Provincial Key Laboratory for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, China. .,Alibaba-Zhejiang University Joint Research Center of Future Digital Healthcare, Hangzhou, China.
| |
Collapse
|
81
|
Orozco G, Schoenfelder S, Walker N, Eyre S, Fraser P. 3D genome organization links non-coding disease-associated variants to genes. Front Cell Dev Biol 2022; 10:995388. [PMID: 36340032 PMCID: PMC9631826 DOI: 10.3389/fcell.2022.995388] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2022] [Accepted: 09/27/2022] [Indexed: 11/13/2022] Open
Abstract
Genome sequencing has revealed over 300 million genetic variations in human populations. Over 90% of variants are single nucleotide polymorphisms (SNPs), the remainder include short deletions or insertions, and small numbers of structural variants. Hundreds of thousands of these variants have been associated with specific phenotypic traits and diseases through genome wide association studies which link significant differences in variant frequencies with specific phenotypes among large groups of individuals. Only 5% of disease-associated SNPs are located in gene coding sequences, with the potential to disrupt gene expression or alter of the function of encoded proteins. The remaining 95% of disease-associated SNPs are located in non-coding DNA sequences which make up 98% of the genome. The role of non-coding, disease-associated SNPs, many of which are located at considerable distances from any gene, was at first a mystery until the discovery that gene promoters regularly interact with distal regulatory elements to control gene expression. Disease-associated SNPs are enriched at the millions of gene regulatory elements that are dispersed throughout the non-coding sequences of the genome, suggesting they function as gene regulation variants. Assigning specific regulatory elements to the genes they control is not straightforward since they can be millions of base pairs apart. In this review we describe how understanding 3D genome organization can identify specific interactions between gene promoters and distal regulatory elements and how 3D genomics can link disease-associated SNPs to their target genes. Understanding which gene or genes contribute to a specific disease is the first step in designing rational therapeutic interventions.
Collapse
Affiliation(s)
- Gisela Orozco
- Centre for Genetics and Genomics Versus Arthritis, Division of Musculoskeletal and Dermatological Sciences, School of Biological Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom
- NIHR Manchester Biomedical Research Centre, Manchester University Foundation Trust, Manchester, United Kingdom
| | - Stefan Schoenfelder
- Enhanc3D Genomics Ltd., Cambridge, United Kingdom
- Epigenetics Programme, The Babraham Institute, Babraham Research Campus, CB22 3AT Cambridge, Cambridge, United Kingdom
| | | | - Stephan Eyre
- Centre for Genetics and Genomics Versus Arthritis, Division of Musculoskeletal and Dermatological Sciences, School of Biological Sciences, Faculty of Biology, Medicine and Health, The University of Manchester, Manchester Academic Health Science Centre, Manchester, United Kingdom
- NIHR Manchester Biomedical Research Centre, Manchester University Foundation Trust, Manchester, United Kingdom
| | - Peter Fraser
- Enhanc3D Genomics Ltd., Cambridge, United Kingdom
- Department of Biological Science, Florida State University, Tallahassee, FL, United States
| |
Collapse
|
82
|
Wrightsman T, Marand AP, Crisp PA, Springer NM, Buckler ES. Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks. THE PLANT GENOME 2022; 15:e20249. [PMID: 35924336 DOI: 10.1002/tpg2.20249] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 06/20/2022] [Indexed: 06/06/2024]
Abstract
Accessible chromatin regions are critical components of gene regulation but modeling them directly from sequence remains challenging, especially within plants, whose mechanisms of chromatin remodeling are less understood than in animals. We trained an existing deep-learning architecture, DanQ, on data from 12 angiosperm species to predict the chromatin accessibility in leaf of sequence windows within and across species. We also trained DanQ on DNA methylation data from 10 angiosperms because unmethylated regions have been shown to overlap significantly with ACRs in some plants. The across-species models have comparable or even superior performance to a model trained within species, suggesting strong conservation of chromatin mechanisms across angiosperms. Testing a maize (Zea mays L.) held-out model on a multi-tissue chromatin accessibility panel revealed our models are best at predicting constitutively accessible chromatin regions, with diminishing performance as cell-type specificity increases. Using a combination of interpretation methods, we ranked JASPAR motifs by their importance to each model and saw that the TCP and AP2/ERF transcription factor (TF) families consistently ranked highly. We embedded the top three JASPAR motifs for each model at all possible positions on both strands in our sequence window and observed position- and strand-specific patterns in their importance to the model. With our publicly available across-species 'a2z' model it is now feasible to predict the chromatin accessibility and methylation landscape of any angiosperm genome.
Collapse
Affiliation(s)
- Travis Wrightsman
- Section of Plant Breeding and Genetics, Cornell Univ., Ithaca, NY, 14853, USA
| | | | - Peter A Crisp
- School of Agriculture and Food Sciences, Univ. of Queensland, Brisbane, QLD, 4072, Australia
| | - Nathan M Springer
- Dep. of Plant and Microbial Biology, Univ. of Minnesota, Saint Paul, MN, 55108, USA
| | - Edward S Buckler
- Section of Plant Breeding and Genetics, Cornell Univ., Ithaca, NY, 14853, USA
- Institute for Genomic Diversity, Cornell Univ., Ithaca, NY, 14853, USA
- USDA-ARS, Ithaca, NY, 14853, USA
| |
Collapse
|
83
|
Controlling gene expression with deep generative design of regulatory DNA. Nat Commun 2022; 13:5099. [PMID: 36042233 PMCID: PMC9427793 DOI: 10.1038/s41467-022-32818-8] [Citation(s) in RCA: 36] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2022] [Accepted: 08/18/2022] [Indexed: 11/25/2022] Open
Abstract
Design of de novo synthetic regulatory DNA is a promising avenue to control gene expression in biotechnology and medicine. Using mutagenesis typically requires screening sizable random DNA libraries, which limits the designs to span merely a short section of the promoter and restricts their control of gene expression. Here, we prototype a deep learning strategy based on generative adversarial networks (GAN) by learning directly from genomic and transcriptomic data. Our ExpressionGAN can traverse the entire regulatory sequence-expression landscape in a gene-specific manner, generating regulatory DNA with prespecified target mRNA levels spanning the whole gene regulatory structure including coding and adjacent non-coding regions. Despite high sequence divergence from natural DNA, in vivo measurements show that 57% of the highly-expressed synthetic sequences surpass the expression levels of highly-expressed natural controls. This demonstrates the applicability and relevance of deep generative design to expand our knowledge and control of gene expression regulation in any desired organism, condition or tissue. Design of de novo synthetic regulatory DNA is a promising avenue to control gene expression in biotechnology and medicine. Here the authors present EspressionGAN, a generative adversarial network that uses genomic and transcriptomic data to generate regulatory sequences.
Collapse
|
84
|
Novel functional genomics approaches bridging neuroscience and psychiatry. BIOLOGICAL PSYCHIATRY GLOBAL OPEN SCIENCE 2022. [PMID: 37519472 PMCID: PMC10382709 DOI: 10.1016/j.bpsgos.2022.07.005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
The possibility of establishing a metric of individual genetic risk for a particular disease or trait has sparked the interest of the clinical and research communities, with many groups developing and validating genomic profiling methodologies for their potential application in clinical care. Current approaches for calculating genetic risk to specific psychiatric conditions consist of aggregating genome-wide association studies-derived estimates into polygenic risk scores, which broadly represent the number of inherited risk alleles for an individual. While the traditional approach for polygenic risk score calculation aggregates estimates of gene-disease associations, novel alternative approaches have started to consider functional molecular phenotypes that are closer to genetic variation and are less penalized by the multiple testing required in genome-wide association studies. Moving the focus from genotype-disease to genotype-gene regulation frameworks, these novel approaches incorporate prior knowledge regarding biological processes involved in disease and aggregate estimates for the association of genotypes and phenotypes using multi-omics data modalities. In this review, we discuss and list different functional genomics tools that can be used and integrated to inform researchers and clinicians for a better understanding and diagnosis of psychopathology. We suggest that these novel approaches can help generate biologically driven hypotheses for polygenic signals that can ultimately serve the clinical community as potential biomarkers of psychiatric disease susceptibility.
Collapse
|
85
|
Kang Y, Jung WJ, Brent MR. Predicting which genes will respond to transcription factor perturbations. G3 (BETHESDA, MD.) 2022; 12:jkac144. [PMID: 35666184 PMCID: PMC9339286 DOI: 10.1093/g3journal/jkac144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Accepted: 05/25/2022] [Indexed: 11/13/2022]
Abstract
The ability to predict which genes will respond to the perturbation of a transcription factor serves as a benchmark for our systems-level understanding of transcriptional regulatory networks. In previous work, machine learning models have been trained to predict static gene expression levels in a biological sample by using data from the same or similar samples, including data on their transcription factor binding locations, histone marks, or DNA sequence. We report on a different challenge-training machine learning models to predict which genes will respond to the perturbation of a transcription factor without using any data from the perturbed cells. We find that existing transcription factor location data (ChIP-seq) from human cells have very little detectable utility for predicting which genes will respond to perturbation of a transcription factor. Features of genes, including their preperturbation expression level and expression variation, are very useful for predicting responses to perturbation of any transcription factor. This shows that some genes are poised to respond to transcription factor perturbations and others are resistant, shedding light on why it has been so difficult to predict responses from binding locations. Certain histone marks, including H3K4me1 and H3K4me3, have some predictive power when located downstream of the transcription start site. However, the predictive power of histone marks is much less than that of gene expression level and expression variation. Sequence-based or epigenetic properties of genes strongly influence their tendency to respond to direct transcription factor perturbations, partially explaining the oft-noted difficulty of predicting responsiveness from transcription factor binding location data. These molecular features are largely reflected in and summarized by the gene's expression level and expression variation. Code is available at https://github.com/BrentLab/TFPertRespExplainer.
Collapse
Affiliation(s)
- Yiming Kang
- Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63108, USA
| | - Wooseok J Jung
- Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63108, USA
| | - Michael R Brent
- Center for Genome Sciences and Systems Biology, Washington University School of Medicine, St. Louis, MO 63110, USA
- Department of Computer Science and Engineering, Washington University, St. Louis, MO 63108, USA
- Department of Genetics, Washington University School of Medicine, St. Louis, MO 63110, USA
| |
Collapse
|
86
|
Assessing comparative importance of DNA sequence and epigenetic modifications on gene expression using a deep convolutional neural network. Comput Struct Biotechnol J 2022; 20:3814-3823. [PMID: 35891778 PMCID: PMC9307602 DOI: 10.1016/j.csbj.2022.07.014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 07/05/2022] [Accepted: 07/05/2022] [Indexed: 11/26/2022] Open
Abstract
Gene expression is regulated at both transcriptional and post-transcriptional levels. DNA sequence and epigenetic modifications are key factors which regulate gene transcription. Understanding their complex interactions and their respective contributions to gene expression regulation remains a challenge in biological studies. We have developed iSEGnet, a framework of deep convolutional neural network to predict mRNA abundance using the information on DNA sequences as well as epigenetic modifications within genes and their cis-regulatory regions. We demonstrate that our framework outperforms other machine learning models in terms of predicting mRNA abundance using transcriptional and epigenetic profiles from six distinct cell lines/types chosen from the ENCODE. The analysis from the learned models also reveals that specific regions around promotors and transcription termination sites are most important for gene expression regulation. Using the method of Integrated Gradients, we identify narrow segments in these regions which are most likely to impact gene expression for a specific epigenetic modification. We further show that these identified segments are enriched in known active regulatory regions by comparing the transcription factor binding sites obtained via ChIP-seq. Moreover, we demonstrate how iSEGnet can uncover potential transcription factors that have regulatory functions in cancer using two cancer multi-omics data.
Collapse
|
87
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
88
|
Cong Y, Endo T. Multi-Omics and Artificial Intelligence-Guided Drug Repositioning: Prospects, Challenges, and Lessons Learned from COVID-19. OMICS : A JOURNAL OF INTEGRATIVE BIOLOGY 2022; 26:361-371. [PMID: 35759424 DOI: 10.1089/omi.2022.0068] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Drug repurposing is of interest for therapeutics innovation in many human diseases including coronavirus disease 2019 (COVID-19). Methodological innovations in drug repurposing are currently being empowered by convergence of omics systems science and digital transformation of life sciences. This expert review article offers a systematic summary of the application of artificial intelligence (AI), particularly machine learning (ML), to drug repurposing and classifies and introduces the common clustering, dimensionality reduction, and other methods. We highlight, as a present-day high-profile example, the involvement of AI/ML-based drug discovery in the COVID-19 pandemic and discuss the collection and sharing of diverse data types, and the possible futures awaiting drug repurposing in an era of AI/ML and digital technologies. The article provides new insights on convergence of multi-omics and AI-based drug repurposing. We conclude with reflections on the various pathways to expedite innovation in drug development through drug repurposing for prompt responses to the current COVID-19 pandemic and future ecological crises in the 21st century.
Collapse
Affiliation(s)
- Yi Cong
- Laboratory of Information Biology, Information Science and Technology, Hokkaido University, Sapporo, Japan
| | - Toshinori Endo
- Laboratory of Information Biology, Information Science and Technology, Hokkaido University, Sapporo, Japan
| |
Collapse
|
89
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
90
|
Lee D, Kim S. Knowledge-guided artificial intelligence technologies for decoding complex multiomics interactions in cells. Clin Exp Pediatr 2022; 65:239-249. [PMID: 34844399 PMCID: PMC9082244 DOI: 10.3345/cep.2021.01438] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/15/2021] [Revised: 10/19/2021] [Accepted: 10/21/2021] [Indexed: 11/27/2022] Open
Abstract
Cells survive and proliferate through complex interactions among diverse molecules across multiomics layers. Conventional experimental approaches for identifying these interactions have built a firm foundation for molecular biology, but their scalability is gradually becoming inadequate compared to the rapid accumulation of multiomics data measured by high-throughput technologies. Therefore, the need for data-driven computational modeling of interactions within cells has been highlighted in recent years. The complexity of multiomics interactions is primarily due to their nonlinearity. That is, their accurate modeling requires intricate conditional dependencies, synergies, or antagonisms between considered genes or proteins, which retard experimental validations. Artificial intelligence (AI) technologies, including deep learning models, are optimal choices for handling complex nonlinear relationships between features that are scalable and produce large amounts of data. Thus, they have great potential for modeling multiomics interactions. Although there exist many AI-driven models for computational biology applications, relatively few explicitly incorporate the prior knowledge within model architectures or training procedures. Such guidance of models by domain knowledge will greatly reduce the amount of data needed to train models and constrain their vast expressive powers to focus on the biologically relevant space. Therefore, it can enhance a model's interpretability, reduce spurious interactions, and prove its validity and utility. Thus, to facilitate further development of knowledge-guided AI technologies for the modeling of multiomics interactions, here we review representative bioinformatics applications of deep learning models for multiomics interactions developed to date by categorizing them by guidance mode.
Collapse
Affiliation(s)
- Dohoon Lee
- Bioinformatics Institute, Seoul National University, Seoul, Korea
| | - Sun Kim
- Interdisciplinary Program in Bioinformatics, Seoul National University, Seoul, Korea
- Department of Computer Science and Engineering, Seoul National University, Seoul, Korea
- Institute of Engineering Research, Seoul National University, Seoul, Korea
- AIGENDRUG Co., Ltd., Seoul, Korea
| |
Collapse
|
91
|
Accurate and highly interpretable prediction of gene expression from histone modifications. BMC Bioinformatics 2022; 23:151. [PMID: 35473556 PMCID: PMC9040271 DOI: 10.1186/s12859-022-04687-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2021] [Accepted: 04/08/2022] [Indexed: 11/10/2022] Open
Abstract
Background Histone Mark Modifications (HMs) are crucial actors in gene regulation, as they actively remodel chromatin to modulate transcriptional activity: aberrant combinatorial patterns of HMs have been connected with several diseases, including cancer. HMs are, however, reversible modifications: understanding their role in disease would allow the design of ‘epigenetic drugs’ for specific, non-invasive treatments. Standard statistical techniques were not entirely successful in extracting representative features from raw HM signals over gene locations. On the other hand, deep learning approaches allow for effective automatic feature extraction, but at the expense of model interpretation. Results Here, we propose ShallowChrome, a novel computational pipeline to model transcriptional regulation via HMs in both an accurate and interpretable way. We attain state-of-the-art results on the binary classification of gene transcriptional states over 56 cell-types from the REMC database, largely outperforming recent deep learning approaches. We interpret our models by extracting insightful gene-specific regulative patterns, and we analyse them for the specific case of the PAX5 gene over three differentiated blood cell lines. Finally, we compare the patterns we obtained with the characteristic emission patterns of ChromHMM, and show that ShallowChrome is able to coherently rank groups of chromatin states w.r.t. their transcriptional activity. Conclusions In this work we demonstrate that it is possible to model HM-modulated gene expression regulation in a highly accurate, yet interpretable way. Our feature extraction algorithm leverages on data downstream the identification of enriched regions to retrieve gene-wise, statistically significant and dynamically located features for each HM. These features are highly predictive of gene transcriptional state, and allow for accurate modeling by computationally efficient logistic regression models. These models allow a direct inspection and a rigorous interpretation, helping to formulate quantifiable hypotheses.
Collapse
|
92
|
Abdalla M, Abdalla M. A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules. PLoS Comput Biol 2022; 18:e1010028. [PMID: 35421087 PMCID: PMC9041867 DOI: 10.1371/journal.pcbi.1010028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 04/26/2022] [Accepted: 03/16/2022] [Indexed: 11/18/2022] Open
Abstract
Genome wide association studies (GWASs) for complex traits have implicated thousands of genetic loci. Most GWAS-nominated variants lie in noncoding regions, complicating the systematic translation of these findings into functional understanding. Here, we leverage convolutional neural networks to assist in this challenge. Our computational framework, peaBrain, models the transcriptional machinery of a tissue as a two-stage process: first, predicting the mean tissue specific abundance of all genes and second, incorporating the transcriptomic consequences of genotype variation to predict individual abundance on a subject-by-subject basis. We demonstrate that peaBrain accounts for the majority (>50%) of variance observed in mean transcript abundance across most tissues and outperforms regularized linear models in predicting the consequences of individual genotype variation. We highlight the validity of the peaBrain model by calculating non-coding impact scores that correlate with nucleotide evolutionary constraint that are also predictive of disease-associated variation and allele-specific transcription factor binding. We further show how these tissue-specific peaBrain scores can be leveraged to pinpoint functional tissues underlying complex traits, outperforming methods that depend on colocalization of eQTL and GWAS signals. We subsequently: (a) derive continuous dense embeddings of genes for downstream applications; (b) highlight the utility of the model in predicting transcriptomic impact of small molecules and shRNA (on par with in vitro experimental replication of external test sets); (c) explore how peaBrain can be used to model difficult-to-study processes (such as neural induction); and (d) identify putatively functional eQTLs that are missed by high-throughput experimental approaches.
Collapse
Affiliation(s)
- Moustafa Abdalla
- Wellcome Trust Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
- Oxford Centre for Diabetes, Endocrinology and Metabolism, Radcliffe Department of Medicine, University of Oxford, Oxford, United Kingdom
- Computational Statistics and Machine Learning, Department of Statistics, University of Oxford, Oxford, United Kingdom
- Department of Surgery, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail: (MA); (MA)
| | - Mohamed Abdalla
- Vector Institute for Artificial Intelligence, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
- * E-mail: (MA); (MA)
| |
Collapse
|
93
|
Bigness J, Loinaz X, Patel S, Larschan E, Singh R. Integrating Long-Range Regulatory Interactions to Predict Gene Expression Using Graph Convolutional Networks. J Comput Biol 2022; 29:409-424. [PMID: 35325548 PMCID: PMC9125570 DOI: 10.1089/cmb.2021.0316] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Long-range regulatory interactions among genomic regions are critical for controlling gene expression, and their disruption has been associated with a host of diseases. However, when modeling the effects of regulatory factors, most deep learning models either neglect long-range interactions or fail to capture the inherent 3D structure of the underlying genomic organization. To address these limitations, we present a Graph Convolutional Model for Epigenetic Regulation of Gene Expression (GC-MERGE). Using a graph-based framework, the model incorporates important information about long-range interactions via a natural encoding of genomic spatial interactions into the graph representation. It integrates measurements of both the global genomic organization and the local regulatory factors, specifically histone modifications, to not only predict the expression of a given gene of interest but also quantify the importance of its regulatory factors. We apply GC-MERGE to data sets for three cell lines-GM12878 (lymphoblastoid), K562 (myelogenous leukemia), and HUVEC (human umbilical vein endothelial)-and demonstrate its state-of-the-art predictive performance. Crucially, we show that our model is interpretable in terms of the observed biological regulatory factors, highlighting both the histone modifications and the interacting genomic regions contributing to a gene's predicted expression. We provide model explanations for multiple exemplar genes and validate them with evidence from the literature. Our model presents a novel setup for predicting gene expression by integrating multimodal data sets in a graph convolutional framework. More importantly, it enables interpretation of the biological mechanisms driving the model's predictions.
Collapse
Affiliation(s)
- Jeremy Bigness
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, USA.,Department of Computer Science, Brown University, Providence, Rhode Island, USA.,Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, Rhode Island, USA
| | - Xavier Loinaz
- Department of Computer Science, Brown University, Providence, Rhode Island, USA
| | - Shalin Patel
- Division of Applied Mathematics, Brown University, Providence, Rhode Island, USA
| | - Erica Larschan
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, USA.,Department of Molecular Biology, Cell Biology and Biochemistry, Brown University, Providence, Rhode Island, USA
| | - Ritambhara Singh
- Center for Computational Molecular Biology, Brown University, Providence, Rhode Island, USA.,Department of Computer Science, Brown University, Providence, Rhode Island, USA
| |
Collapse
|
94
|
Deep neural network prediction of genome-wide transcriptome signatures - beyond the Black-box. NPJ Syst Biol Appl 2022; 8:9. [PMID: 35197482 PMCID: PMC8866467 DOI: 10.1038/s41540-022-00218-9] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2021] [Accepted: 01/24/2022] [Indexed: 11/28/2022] Open
Abstract
Prediction algorithms for protein or gene structures, including transcription factor binding from sequence information, have been transformative in understanding gene regulation. Here we ask whether human transcriptomic profiles can be predicted solely from the expression of transcription factors (TFs). We find that the expression of 1600 TFs can explain >95% of the variance in 25,000 genes. Using the light-up technique to inspect the trained NN, we find an over-representation of known TF-gene regulations. Furthermore, the learned prediction network has a hierarchical organization. A smaller set of around 125 core TFs could explain close to 80% of the variance. Interestingly, reducing the number of TFs below 500 induces a rapid decline in prediction performance. Next, we evaluated the prediction model using transcriptional data from 22 human diseases. The TFs were sufficient to predict the dysregulation of the target genes (rho = 0.61, P < 10−216). By inspecting the model, key causative TFs could be extracted for subsequent validation using disease-associated genetic variants. We demonstrate a methodology for constructing an interpretable neural network predictor, where analyses of the predictors identified key TFs that were inducing transcriptional changes during disease.
Collapse
|
95
|
Abstract
DNA can determine where and when genes are expressed, but the full set of sequence determinants that control gene expression is unknown. Here, we measured the transcriptional activity of DNA sequences that represent an ~100 times larger sequence space than the human genome using massively parallel reporter assays (MPRAs). Machine learning models revealed that transcription factors (TFs) generally act in an additive manner with weak grammar and that most enhancers increase expression from a promoter by a mechanism that does not appear to involve specific TF–TF interactions. The enhancers themselves can be classified into three types: classical, closed chromatin and chromatin dependent. We also show that few TFs are strongly active in a cell, with most activities being similar between cell types. Individual TFs can have multiple gene regulatory activities, including chromatin opening and enhancing, promoting and determining transcription start site (TSS) activity, consistent with the view that the TF binding motif is the key atomic unit of gene expression. Analysis of massively parallel reporter assays measuring the transcriptional activity of DNA sequences indicates that most transcription factor (TF) activity is additive and does not rely on specific TF–TF interactions. Individual TFs can have different gene regulatory activities.
Collapse
|
96
|
Krishnakumar R, Ruffing AM. OperonSEQer: A set of machine-learning algorithms with threshold voting for detection of operon pairs using short-read RNA-sequencing data. PLoS Comput Biol 2022; 18:e1009731. [PMID: 34986143 PMCID: PMC8765615 DOI: 10.1371/journal.pcbi.1009731] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2021] [Revised: 01/18/2022] [Accepted: 12/07/2021] [Indexed: 11/19/2022] Open
Abstract
Operon prediction in prokaryotes is critical not only for understanding the regulation of endogenous gene expression, but also for exogenous targeting of genes using newly developed tools such as CRISPR-based gene modulation. A number of methods have used transcriptomics data to predict operons, based on the premise that contiguous genes in an operon will be expressed at similar levels. While promising results have been observed using these methods, most of them do not address uncertainty caused by technical variability between experiments, which is especially relevant when the amount of data available is small. In addition, many existing methods do not provide the flexibility to determine the stringency with which genes should be evaluated for being in an operon pair. We present OperonSEQer, a set of machine learning algorithms that uses the statistic and p-value from a non-parametric analysis of variance test (Kruskal-Wallis) to determine the likelihood that two adjacent genes are expressed from the same RNA molecule. We implement a voting system to allow users to choose the stringency of operon calls depending on whether your priority is high recall or high specificity. In addition, we provide the code so that users can retrain the algorithm and re-establish hyperparameters based on any data they choose, allowing for this method to be expanded as additional data is generated. We show that our approach detects operon pairs that are missed by current methods by comparing our predictions to publicly available long-read sequencing data. OperonSEQer therefore improves on existing methods in terms of accuracy, flexibility, and adaptability. Bacteria and archaea, single-cell organisms collectively known as prokaryotes, live in all imaginable environments and comprise the majority of living organisms on this planet. Prokaryotes play a critical role in the homeostasis of multicellular organisms (such as animals and plants) and ecosystems. In addition, bacteria can be pathogenic and cause a variety of diseases in these same hosts and ecosystems. In short, understanding the biology and molecular functions of bacteria and archaea and devising mechanisms to engineer and optimize their properties are critical scientific endeavors with significant implications in healthcare, agriculture, manufacturing, and climate science among others. One major molecular difference between unicellular and multicellular organisms is the way they express genes–multicellular organisms make individual RNA molecules for each gene while, prokaryotes express operons (i.e., a group of genes coding functionally related proteins) in contiguous polycistronic RNA molecules. Understanding which genes exist within operons is critical for elucidating basic biology and for engineering organisms. In this work, we use a combination of statistical and machine learning-based methods to use next-generation sequencing data to predict operon structure across a range of prokaryotes. Our method provides an easily implemented, robust, accurate, and flexible way to determine operon structure in an organism-agnostic manner using readily available data.
Collapse
Affiliation(s)
- Raga Krishnakumar
- Systems Biology Department, Sandia National Laboratories, Livermore, California, United States of America
- * E-mail:
| | - Anne M. Ruffing
- Molecular and Microbiology Department, Sandia National Laboratories, Albuquerque, New Mexico, United States of America
| |
Collapse
|
97
|
Gardiner LJ, Krishna R. Bluster or Lustre: Can AI Improve Crops and Plant Health? PLANTS (BASEL, SWITZERLAND) 2021; 10:plants10122707. [PMID: 34961177 PMCID: PMC8707749 DOI: 10.3390/plants10122707] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Revised: 11/24/2021] [Accepted: 12/06/2021] [Indexed: 06/14/2023]
Abstract
In a changing climate where future food security is a growing concern, researchers are exploring new methods and technologies in the effort to meet ambitious crop yield targets. The application of Artificial Intelligence (AI) including Machine Learning (ML) methods in this area has been proposed as a potential mechanism to support this. This review explores current research in the area to convey the state-of-the-art as to how AI/ML have been used to advance research, gain insights, and generally enable progress in this area. We address the question-Can AI improve crops and plant health? We further discriminate the bluster from the lustre by identifying the key challenges that AI has been shown to address, balanced with the potential issues with its usage, and the key requisites for its success. Overall, we hope to raise awareness and, as a result, promote usage, of AI related approaches where they can have appropriate impact to improve practices in agricultural and plant sciences.
Collapse
|
98
|
Dibaeinia P, Sinha S. Deciphering enhancer sequence using thermodynamics-based models and convolutional neural networks. Nucleic Acids Res 2021; 49:10309-10327. [PMID: 34508359 PMCID: PMC8501998 DOI: 10.1093/nar/gkab765] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Revised: 08/18/2021] [Accepted: 08/25/2021] [Indexed: 11/18/2022] Open
Abstract
Deciphering the sequence-function relationship encoded in enhancers holds the key to interpreting non-coding variants and understanding mechanisms of transcriptomic variation. Several quantitative models exist for predicting enhancer function and underlying mechanisms; however, there has been no systematic comparison of these models characterizing their relative strengths and shortcomings. Here, we interrogated a rich data set of neuroectodermal enhancers in Drosophila, representing cis- and trans- sources of expression variation, with a suite of biophysical and machine learning models. We performed rigorous comparisons of thermodynamics-based models implementing different mechanisms of activation, repression and cooperativity. Moreover, we developed a convolutional neural network (CNN) model, called CoNSEPT, that learns enhancer 'grammar' in an unbiased manner. CoNSEPT is the first general-purpose CNN tool for predicting enhancer function in varying conditions, such as different cell types and experimental conditions, and we show that such complex models can suggest interpretable mechanisms. We found model-based evidence for mechanisms previously established for the studied system, including cooperative activation and short-range repression. The data also favored one hypothesized activation mechanism over another and suggested an intriguing role for a direct, distance-independent repression mechanism. Our modeling shows that while fundamentally different models can yield similar fits to data, they vary in their utility for mechanistic inference. CoNSEPT is freely available at: https://github.com/PayamDiba/CoNSEPT.
Collapse
Affiliation(s)
- Payam Dibaeinia
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Carl R. Woese Institute for Genomic Biology, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
- Cancer Center at Illinois, University of Illinois at Urbana-Champaign, Urbana, IL 61801, USA
| |
Collapse
|
99
|
Avsec Ž, Agarwal V, Visentin D, Ledsam JR, Grabska-Barwinska A, Taylor KR, Assael Y, Jumper J, Kohli P, Kelley DR. Effective gene expression prediction from sequence by integrating long-range interactions. Nat Methods 2021; 18:1196-1203. [PMID: 34608324 PMCID: PMC8490152 DOI: 10.1038/s41592-021-01252-x] [Citation(s) in RCA: 455] [Impact Index Per Article: 113.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 07/27/2021] [Indexed: 02/08/2023]
Abstract
How noncoding DNA determines gene expression in different cell types is a major unsolved problem, and critical downstream applications in human genetics depend on improved solutions. Here, we report substantially improved gene expression prediction accuracy from DNA sequences through the use of a deep learning architecture, called Enformer, that is able to integrate information from long-range interactions (up to 100 kb away) in the genome. This improvement yielded more accurate variant effect predictions on gene expression for both natural genetic variants and saturation mutagenesis measured by massively parallel reporter assays. Furthermore, Enformer learned to predict enhancer-promoter interactions directly from the DNA sequence competitively with methods that take direct experimental data as input. We expect that these advances will enable more effective fine-mapping of human disease associations and provide a framework to interpret cis-regulatory evolution.
Collapse
|
100
|
Auerbach BJ, Hu J, Reilly MP, Li M. Applications of single-cell genomics and computational strategies to study common disease and population-level variation. Genome Res 2021; 31:1728-1741. [PMID: 34599006 PMCID: PMC8494214 DOI: 10.1101/gr.275430.121] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022]
Abstract
The advent and rapid development of single-cell technologies have made it possible to study cellular heterogeneity at an unprecedented resolution and scale. Cellular heterogeneity underlies phenotypic differences among individuals, and studying cellular heterogeneity is an important step toward our understanding of the disease molecular mechanism. Single-cell technologies offer opportunities to characterize cellular heterogeneity from different angles, but how to link cellular heterogeneity with disease phenotypes requires careful computational analysis. In this article, we will review the current applications of single-cell methods in human disease studies and describe what we have learned so far from existing studies about human genetic variation. As single-cell technologies are becoming widely applicable in human disease studies, population-level studies have become a reality. We will describe how we should go about pursuing and designing these studies, particularly how to select study subjects, how to determine the number of cells to sequence per subject, and the needed sequencing depth per cell. We also discuss computational strategies for the analysis of single-cell data and describe how single-cell data can be integrated with bulk tissue data and data generated from genome-wide association studies. Finally, we point out open problems and future research directions.
Collapse
Affiliation(s)
- Benjamin J Auerbach
- Graduate Group in Genomics and Computational Biology, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania 19104, USA
| | - Jian Hu
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania 19104, USA
| | - Muredach P Reilly
- Division of Cardiology, Department of Medicine, Columbia University Irving Medical Center, New York, New York 10032, USA
| | - Mingyao Li
- Department of Biostatistics, Epidemiology and Informatics, University of Pennsylvania Perelman School of Medicine, Philadelphia, Pennsylvania 19104, USA
| |
Collapse
|