1
|
Wang Y, Jaime-Lara RB, Roy A, Sun Y, Liu X, Joseph PV. SeqEnhDL: sequence-based classification of cell type-specific enhancers using deep learning models. BMC Res Notes 2021; 14:104. [PMID: 33741075 PMCID: PMC7980595 DOI: 10.1186/s13104-021-05518-7] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2020] [Accepted: 03/09/2021] [Indexed: 11/12/2022] Open
Abstract
Objective To address the challenge of computational identification of cell type-specific regulatory elements on a genome-wide scale. Results We propose SeqEnhDL, a deep learning framework for classifying cell type-specific enhancers based on sequence features. DNA sequences of “strong enhancer” chromatin states in nine cell types from the ENCODE project were retrieved to build and test enhancer classifiers. For any DNA sequence, positional k-mer (k = 5, 7, 9 and 11) fold changes relative to randomly selected non-coding sequences across each nucleotide position were used as features for deep learning models. Three deep learning models were implemented, including multi-layer perceptron (MLP), Convolutional Neural Network (CNN) and Recurrent Neural Network (RNN). All models in SeqEnhDL outperform state-of-the-art enhancer classifiers (including gkm-SVM and DanQ) in distinguishing cell type-specific enhancers from randomly selected non-coding sequences. Moreover, SeqEnhDL can directly discriminate enhancers from different cell types, which has not been achieved by other enhancer classifiers. Our analysis suggests that both enhancers and their tissue-specificity can be accurately identified based on their sequence features. SeqEnhDL is publicly available at https://github.com/wyp1125/SeqEnhDL. Supplementary Information The online version contains supplementary material available at 10.1186/s13104-021-05518-7.
Collapse
Affiliation(s)
- Yupeng Wang
- BDX Research and Consulting LLC, Herndon, VA, 20171, USA. .,Division of Intramural Research, National Institute of Nursing Research, National Institutes of Health, Bethesda, MD, 20892, USA.
| | - Rosario B Jaime-Lara
- Division of Intramural Clinical and Biological Research (DICBR), National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Bethesda, MD, 20892, USA.,Division of Intramural Research, National Institute of Nursing Research, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Abhrarup Roy
- Division of Intramural Research, National Institute of Nursing Research, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Ying Sun
- BDX Research and Consulting LLC, Herndon, VA, 20171, USA
| | - Xinyue Liu
- BDX Research and Consulting LLC, Herndon, VA, 20171, USA
| | - Paule V Joseph
- Division of Intramural Clinical and Biological Research (DICBR), National Institute on Alcohol Abuse and Alcoholism, National Institutes of Health, Bethesda, MD, 20892, USA. .,Division of Intramural Research, National Institute of Nursing Research, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
2
|
Tobias IC, Abatti LE, Moorthy SD, Mullany S, Taylor T, Khader N, Filice MA, Mitchell JA. Transcriptional enhancers: from prediction to functional assessment on a genome-wide scale. Genome 2020; 64:426-448. [PMID: 32961076 DOI: 10.1139/gen-2020-0104] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Enhancers are cis-regulatory sequences located distally to target genes. These sequences consolidate developmental and environmental cues to coordinate gene expression in a tissue-specific manner. Enhancer function and tissue specificity depend on the expressed set of transcription factors, which recognize binding sites and recruit cofactors that regulate local chromatin organization and gene transcription. Unlike other genomic elements, enhancers are challenging to identify because they function independently of orientation, are often distant from their promoters, have poorly defined boundaries, and display no reading frame. In addition, there are no defined genetic or epigenetic features that are unambiguously associated with enhancer activity. Over recent years there have been developments in both empirical assays and computational methods for enhancer prediction. We review genome-wide tools, CRISPR advancements, and high-throughput screening approaches that have improved our ability to both observe and manipulate enhancers in vitro at the level of primary genetic sequences, chromatin states, and spatial interactions. We also highlight contemporary animal models and their importance to enhancer validation. Together, these experimental systems and techniques complement one another and broaden our understanding of enhancer function in development, evolution, and disease.
Collapse
Affiliation(s)
- Ian C Tobias
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Luis E Abatti
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Sakthi D Moorthy
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Shanelle Mullany
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Tiegh Taylor
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Nawrah Khader
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Mario A Filice
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| | - Jennifer A Mitchell
- Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada.,Department of Cell and Systems Biology, University of Toronto, Toronto, ON, M5S 3G5, Canada
| |
Collapse
|
3
|
Colbran LL, Chen L, Capra JA. Sequence Characteristics Distinguish Transcribed Enhancers from Promoters and Predict Their Breadth of Activity. Genetics 2019; 211:1205-1217. [PMID: 30696717 PMCID: PMC6456323 DOI: 10.1534/genetics.118.301895] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2018] [Accepted: 01/27/2019] [Indexed: 01/08/2023] Open
Abstract
Enhancers and promoters both regulate gene expression by recruiting transcription factors (TFs); however, the degree to which enhancer vs. promoter activity is due to differences in their sequences or to genomic context is the subject of ongoing debate. We examined this question by analyzing the sequences of thousands of transcribed enhancers and promoters from hundreds of cellular contexts previously identified by cap analysis of gene expression. Support vector machine classifiers trained on counts of all possible 6-bp-long sequences (6-mers) were able to accurately distinguish promoters from enhancers and distinguish their breadth of activity across tissues. Classifiers trained to predict enhancer activity also performed well when applied to promoter prediction tasks, but promoter-trained classifiers performed poorly on enhancers. This suggests that the learned sequence patterns predictive of enhancer activity generalize to promoters, but not vice versa. Our classifiers also indicate that there are functionally relevant differences in enhancer and promoter GC content beyond the influence of CpG islands. Furthermore, sequences characteristic of broad promoter or broad enhancer activity matched different TFs, with predicted ETS- and RFX-binding sites indicative of promoters, and AP-1 sites indicative of enhancers. Finally, we evaluated the ability of our models to distinguish enhancers and promoters defined by histone modifications. Separating these classes was substantially more difficult, and this difference may contribute to ongoing debates about the similarity of enhancers and promoters. In summary, our results suggest that high-confidence transcribed enhancers and promoters can largely be distinguished based on biologically relevant sequence properties.
Collapse
Affiliation(s)
- Laura L Colbran
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, Tennessee 37235
| | - Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
| | - John A Capra
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, Tennessee 37235
- Department of Biological Sciences, Vanderbilt University, Nashville, Tennessee 37235
- Center for Structural Biology, Departments of Biomedical Informatics and Computer Science, Vanderbilt University, Nashville, Tennessee 37235
| |
Collapse
|
4
|
Chen L, Fish AE, Capra JA. Prediction of gene regulatory enhancers across species reveals evolutionarily conserved sequence properties. PLoS Comput Biol 2018; 14:e1006484. [PMID: 30286077 PMCID: PMC6191148 DOI: 10.1371/journal.pcbi.1006484] [Citation(s) in RCA: 48] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2018] [Revised: 10/16/2018] [Accepted: 09/02/2018] [Indexed: 12/30/2022] Open
Abstract
Genomic regions with gene regulatory enhancer activity turnover rapidly across mammals. In contrast, gene expression patterns and transcription factor binding preferences are largely conserved between mammalian species. Based on this conservation, we hypothesized that enhancers active in different mammals would exhibit conserved sequence patterns in spite of their different genomic locations. To investigate this hypothesis, we evaluated the extent to which sequence patterns that are predictive of enhancers in one species are predictive of enhancers in other mammalian species by training and testing two types of machine learning models. We trained support vector machine (SVM) and convolutional neural network (CNN) classifiers to distinguish enhancers defined by histone marks from the genomic background based on DNA sequence patterns in human, macaque, mouse, dog, cow, and opossum. The classifiers accurately identified many adult liver, developing limb, and developing brain enhancers, and the CNNs outperformed the SVMs. Furthermore, classifiers trained in one species and tested in another performed nearly as well as classifiers trained and tested on the same species. We observed similar cross-species conservation when applying the models to human and mouse enhancers validated in transgenic assays. This indicates that many short sequence patterns predictive of enhancers are largely conserved. The sequence patterns most predictive of enhancers in each species matched the binding motifs for a common set of TFs enriched for expression in relevant tissues, supporting the biological relevance of the learned features. Thus, despite the rapid change of active enhancer locations between mammals, cross-species enhancer prediction is often possible. Our results suggest that short sequence patterns encoding enhancer activity have been maintained across more than 180 million years of mammalian evolution.
Collapse
Affiliation(s)
- Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
| | - Alexandra E. Fish
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, United States of America
| | - John A. Capra
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, United States of America
- Departments of Biomedical Informatics and Computer Science, Center for Structural Biology, Vanderbilt University, Nashville, TN, United States of America
| |
Collapse
|
5
|
Abstract
BACKGROUND Studies have shown that enhancers are significant regulatory elements to play crucial roles in gene expression regulation. Since enhancers are unrelated to the orientation and distance to their target genes, it is a challenging mission for scholars and researchers to accurately predicting distal enhancers. In the past years, with the high-throughout ChiP-seq technologies development, several computational techniques emerge to predict enhancers using epigenetic or genomic features. Nevertheless, the inconsistency of computational models across different cell-lines and the unsatisfactory prediction performance call for further research in this area. RESULTS Here, we propose a new Deep Belief Network (DBN) based computational method for enhancer prediction, which is called EnhancerDBN. This method combines diverse features, composed of DNA sequence compositional features, DNA methylation and histone modifications. Our computational results indicate that 1) EnhancerDBN outperforms 13 existing methods in prediction, and 2) GC content and DNA methylation can serve as relevant features for enhancer prediction. CONCLUSION Deep learning is effective in boosting the performance of enhancer prediction.
Collapse
Affiliation(s)
- Hongda Bu
- Department of Computer Science and Technology, Tongji University, 4800 Cao’an Road, Shanghai, 201804 China
| | - Yanglan Gan
- School of Computer, Donghua University, 2999 Renming North Road, Shanghai, 201620 China
| | - Yang Wang
- School of Software, Jiangxi Normal University, 99 Ziyang Avenue, Jiangxi, 330022 China
| | - Shuigeng Zhou
- Shanghai Key Lab of Intelligent Information Processing, and School of Computer Science, Fudan University, 220 Handan Road, Shanghai, 200433 China
- The Bioinformatics Lab at Changzhou NO. 7 People’s Hospital, Changzhou, Jiangsu, 213011 China
| | - Jihong Guan
- Department of Computer Science and Technology, Tongji University, 4800 Cao’an Road, Shanghai, 201804 China
| |
Collapse
|
6
|
Monti R, Barozzi I, Osterwalder M, Lee E, Kato M, Garvin TH, Plajzer-Frick I, Pickle CS, Akiyama JA, Afzal V, Beerenwinkel N, Dickel DE, Visel A, Pennacchio LA. Limb-Enhancer Genie: An accessible resource of accurate enhancer predictions in the developing limb. PLoS Comput Biol 2017; 13:e1005720. [PMID: 28827824 PMCID: PMC5578682 DOI: 10.1371/journal.pcbi.1005720] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2017] [Revised: 08/31/2017] [Accepted: 08/03/2017] [Indexed: 11/18/2022] Open
Abstract
Epigenomic mapping of enhancer-associated chromatin modifications facilitates the genome-wide discovery of tissue-specific enhancers in vivo. However, reliance on single chromatin marks leads to high rates of false-positive predictions. More sophisticated, integrative methods have been described, but commonly suffer from limited accessibility to the resulting predictions and reduced biological interpretability. Here we present the Limb-Enhancer Genie (LEG), a collection of highly accurate, genome-wide predictions of enhancers in the developing limb, available through a user-friendly online interface. We predict limb enhancers using a combination of >50 published limb-specific datasets and clusters of evolutionarily conserved transcription factor binding sites, taking advantage of the patterns observed at previously in vivo validated elements. By combining different statistical models, our approach outperforms current state-of-the-art methods and provides interpretable measures of feature importance. Our results indicate that including a previously unappreciated score that quantifies tissue-specific nuclease accessibility significantly improves prediction performance. We demonstrate the utility of our approach through in vivo validation of newly predicted elements. Moreover, we describe general features that can guide the type of datasets to include when predicting tissue-specific enhancers genome-wide, while providing an accessible resource to the general biological community and facilitating the functional interpretation of genetic studies of limb malformations.
Collapse
Affiliation(s)
- Remo Monti
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
- Joint Genome Institute, U.S. Department of Energy, Walnut Creek, California, United States of America
| | - Iros Barozzi
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Marco Osterwalder
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Elizabeth Lee
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Momoe Kato
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Tyler H. Garvin
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Ingrid Plajzer-Frick
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Catherine S. Pickle
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Jennifer A. Akiyama
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Veena Afzal
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Niko Beerenwinkel
- Department of Biosystems Science and Engineering, ETH Zurich, Basel, Switzerland
| | - Diane E. Dickel
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
| | - Axel Visel
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
- Joint Genome Institute, U.S. Department of Energy, Walnut Creek, California, United States of America
- School of Natural Sciences, University of California, Merced, California, United States of America
| | - Len A. Pennacchio
- Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
- Joint Genome Institute, U.S. Department of Energy, Walnut Creek, California, United States of America
| |
Collapse
|
7
|
Colbran LL, Chen L, Capra JA. Short DNA sequence patterns accurately identify broadly active human enhancers. BMC Genomics 2017; 18:536. [PMID: 28716036 PMCID: PMC5512948 DOI: 10.1186/s12864-017-3934-9] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2016] [Accepted: 07/09/2017] [Indexed: 12/25/2022] Open
Abstract
Background Enhancers are DNA regulatory elements that influence gene expression. There is substantial diversity in enhancers’ activity patterns: some enhancers drive expression in a single cellular context, while others are active across many. Sequence characteristics, such as transcription factor (TF) binding motifs, influence the activity patterns of regulatory sequences; however, the regulatory logic through which specific sequences drive enhancer activity patterns is poorly understood. Recent analysis of Drosophila enhancers suggested that short dinucleotide repeat motifs (DRMs) are general enhancer sequence features that drive broad regulatory activity. However, it is not known whether the regulatory role of DRMs is conserved across species. Results We performed a comprehensive analysis of the relationship between short DNA sequence patterns, including DRMs, and human enhancer activity in 38,538 enhancers across 411 different contexts. In a machine-learning framework, the occurrence patterns of short sequence motifs accurately predicted broadly active human enhancers. However, DRMs alone were weakly predictive of broad enhancer activity in humans and showed different enrichment patterns than in Drosophila. In general, GC-rich sequence motifs were significantly associated with broad enhancer activity, and consistent with this enrichment, broadly active human TFs recognize GC-rich motifs. Conclusions Our results reveal the importance of specific sequence motifs in broadly active human enhancers, demonstrate the lack of evolutionary conservation of the role of DRMs, and provide a computational framework for investigating the logic of enhancer sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3934-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Laura L Colbran
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, 37235, USA
| | - Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA
| | - John A Capra
- Vanderbilt Genetics Institute, Vanderbilt University, Nashville, TN, 37235, USA. .,Department of Biological Sciences, Vanderbilt University, Nashville, TN, 37235, USA. .,Center for Structural Biology, Departments of Biomedical Informatics and Computer Science, Vanderbilt University, Nashville, TN, 37235, USA.
| |
Collapse
|
8
|
Grice J, Noyvert B, Doglio L, Elgar G. A Simple Predictive Enhancer Syntax for Hindbrain Patterning Is Conserved in Vertebrate Genomes. PLoS One 2015; 10:e0130413. [PMID: 26131856 PMCID: PMC4489388 DOI: 10.1371/journal.pone.0130413] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2015] [Accepted: 05/19/2015] [Indexed: 12/17/2022] Open
Abstract
Background Determining the function of regulatory elements is fundamental for our understanding of development, disease and evolution. However, the sequence features that mediate these functions are often unclear and the prediction of tissue-specific expression patterns from sequence alone is non-trivial. Previous functional studies have demonstrated a link between PBX-HOX and MEIS/PREP binding interactions and hindbrain enhancer activity, but the defining grammar of these sites, if any exists, has remained elusive. Results Here, we identify a shared sequence signature (syntax) within a heterogeneous set of conserved vertebrate hindbrain enhancers composed of spatially co-occurring PBX-HOX and MEIS/PREP transcription factor binding motifs. We use this syntax to accurately predict hindbrain enhancers in 89% of cases (67/75 predicted elements) from a set of conserved non-coding elements (CNEs). Furthermore, mutagenesis of the sites abolishes activity or generates ectopic expression, demonstrating their requirement for segmentally restricted enhancer activity in the hindbrain. We refine and use our syntax to predict over 3,000 hindbrain enhancers across the human genome. These sequences tend to be located near developmental transcription factors and are enriched in known hindbrain activating elements, demonstrating the predictive power of this simple model. Conclusion Our findings support the theory that hundreds of CNEs, and perhaps thousands of regions across the human genome, function to coordinate gene expression in the developing hindbrain. We speculate that deeply conserved sequences of this kind contributed to the co-option of new genes into the hindbrain gene regulatory network during early vertebrate evolution by linking patterns of hox expression to downstream genes involved in segmentation and patterning, and evolutionarily newer instances may have continued to contribute to lineage-specific elaboration of the hindbrain.
Collapse
Affiliation(s)
- Joseph Grice
- The Francis Crick Institute Mill Hill Laboratory, The Ridgeway, Mill Hill, London, NW7 1AA, United Kingdom
| | - Boris Noyvert
- The Francis Crick Institute Mill Hill Laboratory, The Ridgeway, Mill Hill, London, NW7 1AA, United Kingdom
| | - Laura Doglio
- The Francis Crick Institute Mill Hill Laboratory, The Ridgeway, Mill Hill, London, NW7 1AA, United Kingdom
| | - Greg Elgar
- The Francis Crick Institute Mill Hill Laboratory, The Ridgeway, Mill Hill, London, NW7 1AA, United Kingdom
- * E-mail:
| |
Collapse
|
9
|
Dubchak I, Balasubramanian S, Wang S, Meyden C, Sulakhe D, Poliakov A, Börnigen D, Xie B, Taylor A, Ma J, Paciorkowski AR, Mirzaa GM, Dave P, Agam G, Xu J, Al-Gazali L, Mason CE, Ross ME, Maltsev N, Gilliam TC. An integrative computational approach for prioritization of genomic variants. PLoS One 2014; 9:e114903. [PMID: 25506935 PMCID: PMC4266634 DOI: 10.1371/journal.pone.0114903] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2014] [Accepted: 11/15/2014] [Indexed: 12/27/2022] Open
Abstract
An essential step in the discovery of molecular mechanisms contributing to disease phenotypes and efficient experimental planning is the development of weighted hypotheses that estimate the functional effects of sequence variants discovered by high-throughput genomics. With the increasing specialization of the bioinformatics resources, creating analytical workflows that seamlessly integrate data and bioinformatics tools developed by multiple groups becomes inevitable. Here we present a case study of a use of the distributed analytical environment integrating four complementary specialized resources, namely the Lynx platform, VISTA RViewer, the Developmental Brain Disorders Database (DBDB), and the RaptorX server, for the identification of high-confidence candidate genes contributing to pathogenesis of spina bifida. The analysis resulted in prediction and validation of deleterious mutations in the SLC19A placental transporter in mothers of the affected children that causes narrowing of the outlet channel and therefore leads to the reduced folate permeation rate. The described approach also enabled correct identification of several genes, previously shown to contribute to pathogenesis of spina bifida, and suggestion of additional genes for experimental validations. The study demonstrates that the seamless integration of bioinformatics resources enables fast and efficient prioritization and characterization of genomic factors and molecular networks contributing to the phenotypes of interest.
Collapse
Affiliation(s)
- Inna Dubchak
- Genomics Division, Lawrence Berkeley National Laboratory, Berkeley, California, United States of America
- Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America
- * E-mail: (ID); (NM)
| | - Sandhya Balasubramanian
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Sheng Wang
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
| | - Cem Meyden
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, United States of America
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, United States of America
- Feil Family Brain and Mind Research Institute, Weill Cornell Medical College, New York, New York, United States of America
| | - Dinanath Sulakhe
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, University of Chicago/Argonne National Laboratory, Chicago, Illinois, United States of America
| | - Alexander Poliakov
- Department of Energy Joint Genome Institute, Walnut Creek, California, United States of America
| | - Daniela Börnigen
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
| | - Bingqing Xie
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Department of Computer Science, Illinois Institute of Technology, Chicago, Illinois, United States of America
| | - Andrew Taylor
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Jianzhu Ma
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
| | - Alex R. Paciorkowski
- Departments of Neurology, Pediatrics, and Biomedical Genetics and Center for Neural Development and Disease, University of Rochester Medical Center, Rochester, New York, United States of America
| | - Ghayda M. Mirzaa
- Seattle Children's Research Institute and Department of Pediatrics, University of Washington, Seattle, Washington, United States of America
| | - Paul Dave
- Computation Institute, University of Chicago/Argonne National Laboratory, Chicago, Illinois, United States of America
| | - Gady Agam
- Department of Computer Science, Illinois Institute of Technology, Chicago, Illinois, United States of America
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, Illinois, United States of America
| | - Lihadh Al-Gazali
- Department of Pediatrics, Faculty of Medicine and Health Sciences, United Arab Emirates University, Al-Ain, UAE
| | - Christopher E. Mason
- Department of Physiology and Biophysics, Weill Cornell Medical College, New York, New York, United States of America
- The HRH Prince Alwaleed Bin Talal Bin Abdulaziz Alsaud Institute for Computational Biomedicine, Weill Cornell Medical College, New York, New York, United States of America
- Feil Family Brain and Mind Research Institute, Weill Cornell Medical College, New York, New York, United States of America
| | - M. Elizabeth Ross
- Laboratory of Neurogenetics and Development, Weill Cornell Medical College, New York, New York, United States of America
| | - Natalia Maltsev
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, University of Chicago/Argonne National Laboratory, Chicago, Illinois, United States of America
- * E-mail: (ID); (NM)
| | - T. Conrad Gilliam
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Computation Institute, University of Chicago/Argonne National Laboratory, Chicago, Illinois, United States of America
| |
Collapse
|
10
|
Integrating diverse datasets improves developmental enhancer prediction. PLoS Comput Biol 2014; 10:e1003677. [PMID: 24967590 PMCID: PMC4072507 DOI: 10.1371/journal.pcbi.1003677] [Citation(s) in RCA: 113] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2013] [Accepted: 05/06/2014] [Indexed: 01/02/2023] Open
Abstract
Gene-regulatory enhancers have been identified using various approaches, including evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we developed EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background and then predicting their tissue specificity. EnhancerFinder uses a multiple kernel learning approach to integrate DNA sequence motifs, evolutionary patterns, and diverse functional genomics datasets from a variety of cell types. In contrast with prediction approaches that define enhancers based on histone marks or p300 sites from a single cell line, we trained EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser. We comprehensively evaluated EnhancerFinder using cross validation and found that our integrative method improves the identification of enhancers over approaches that consider a single type of data, such as sequence motifs, evolutionary conservation, or the binding of enhancer-associated proteins. We find that VISTA enhancers active in embryonic heart are easier to identify than enhancers active in several other embryonic tissues, likely due to their uniquely high GC content. We applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are significantly enriched near genes with annotated roles in their predicted tissues and lead SNPs from genome-wide association studies. We demonstrate the utility of EnhancerFinder predictions through in vivo validation of novel embryonic gene regulatory enhancers from three developmental transcription factor loci. Our genome-wide developmental enhancer predictions are freely available as a UCSC Genome Browser track, which we hope will enable researchers to further investigate questions in developmental biology. The human genome contains an immense amount of non-protein-coding DNA with unknown function. Some of this DNA regulates when, where, and at what levels genes are active during development. Enhancers, one type of regulatory element, are short stretches of DNA that can act as “switches” to turn a gene on or off at specific times in specific cells or tissues. Understanding where in the genome enhancers are located can provide insight into the genetic basis of development and disease. Enhancers are hard to identify, but clues about their locations are found in different types of data including DNA sequence, evolutionary history, and where proteins bind to DNA. Here, we introduce a new tool, called EnhancerFinder, which combines these data to predict the location and activity of enhancers during embryonic development. We trained EnhancerFinder on a large set of functionally validated human enhancers, and it proved to be very accurate. We used EnhancerFinder to predict tens of thousands of enhancers in the human genome and validated several of the predictions near three important developmental genes in mouse or zebrafish. EnhancerFinder's predictions will be useful in understanding functional regions hidden in the vast amounts of human non-coding DNA.
Collapse
|
11
|
Dissection of thousands of cell type-specific enhancers identifies dinucleotide repeat motifs as general enhancer features. Genome Res 2014; 24:1147-56. [PMID: 24714811 PMCID: PMC4079970 DOI: 10.1101/gr.169243.113] [Citation(s) in RCA: 99] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Gene expression is determined by genomic elements called enhancers, which contain short motifs bound by different transcription factors (TFs). However, how enhancer sequences and TF motifs relate to enhancer activity is unknown, and general sequence requirements for enhancers or comprehensive sets of important enhancer sequence elements have remained elusive. Here, we computationally dissect thousands of functional enhancer sequences from three different Drosophila cell lines. We find that the enhancers display distinct cis-regulatory sequence signatures, which are predictive of the enhancers’ cell type-specific or broad activities. These signatures contain transcription factor motifs and a novel class of enhancer sequence elements, dinucleotide repeat motifs (DRMs). DRMs are highly enriched in enhancers, particularly in enhancers that are broadly active across different cell types. We experimentally validate the importance of the identified TF motifs and DRMs for enhancer function and show that they can be sufficient to create an active enhancer de novo from a nonfunctional sequence. The function of DRMs as a novel class of general enhancer features that are also enriched in human regulatory regions might explain their implication in several diseases and provides important insights into gene regulation.
Collapse
|
12
|
Transcriptional enhancers: from properties to genome-wide predictions. Nat Rev Genet 2014; 15:272-86. [PMID: 24614317 DOI: 10.1038/nrg3682] [Citation(s) in RCA: 960] [Impact Index Per Article: 87.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023]
Abstract
Cellular development, morphology and function are governed by precise patterns of gene expression. These are established by the coordinated action of genomic regulatory elements known as enhancers or cis-regulatory modules. More than 30 years after the initial discovery of enhancers, many of their properties have been elucidated; however, despite major efforts, we only have an incomplete picture of enhancers in animal genomes. In this Review, we discuss how properties of enhancer sequences and chromatin are used to predict enhancers in genome-wide studies. We also cover recently developed high-throughput methods that allow the direct testing and identification of enhancers on the basis of their activity. Finally, we discuss recent technological advances and current challenges in the field of regulatory genomics.
Collapse
|
13
|
Schulte D, Frank D. TALE transcription factors during early development of the vertebrate brain and eye. Dev Dyn 2013; 243:99-116. [DOI: 10.1002/dvdy.24030] [Citation(s) in RCA: 25] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2013] [Revised: 07/11/2013] [Accepted: 07/13/2013] [Indexed: 12/25/2022] Open
Affiliation(s)
- Dorothea Schulte
- Institute of Neurology (Edinger Institute); University Hospital Frankfurt, J.W. Goethe University; Frankfurt Germany
| | - Dale Frank
- Department of Biochemistry; The Rappaport Family Institute for Research in the Medical Sciences, Faculty of Medicine, Technion-Israel Institute of Technology; Haifa Israel
| |
Collapse
|
14
|
Taher L, Smith RP, Kim MJ, Ahituv N, Ovcharenko I. Sequence signatures extracted from proximal promoters can be used to predict distal enhancers. Genome Biol 2013; 14:R117. [PMID: 24156763 PMCID: PMC3983659 DOI: 10.1186/gb-2013-14-10-r117] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2013] [Accepted: 10/24/2013] [Indexed: 01/22/2023] Open
Abstract
BACKGROUND Gene expression is controlled by proximal promoters and distal regulatory elements such as enhancers. While the activity of some promoters can be invariant across tissues, enhancers tend to be highly tissue-specific. RESULTS We compiled sets of tissue-specific promoters based on gene expression profiles of 79 human tissues and cell types. Putative transcription factor binding sites within each set of sequences were used to train a support vector machine classifier capable of distinguishing tissue-specific promoters from control sequences. We obtained reliable classifiers for 92% of the tissues, with an area under the receiver operating characteristic curve between 60% (for subthalamic nucleus promoters) and 98% (for heart promoters). We next used these classifiers to identify tissue-specific enhancers, scanning distal non-coding sequences in the loci of the 200 most highly and lowly expressed genes. Thirty percent of reliable classifiers produced consistent enhancer predictions, with significantly higher densities in the loci of the most highly expressed compared to lowly expressed genes. Liver enhancer predictions were assessed in vivo using the hydrodynamic tail vein injection assay. Fifty-eight percent of the predictions yielded significant enhancer activity in the mouse liver, whereas a control set of five sequences was completely negative. CONCLUSIONS We conclude that promoters of tissue-specific genes often contain unambiguous tissue-specific signatures that can be learned and used for the de novo prediction of enhancers.
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
- Institute for Biostatistics and Informatics in Medicine and Ageing Research, University of Rostock, Rostock, 18057, Germany
| | - Robin P Smith
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Mee J Kim
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, 94158, USA
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA, 94158, USA
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| |
Collapse
|