51
|
Minnoye L, Taskiran II, Mauduit D, Fazio M, Van Aerschot L, Hulselmans G, Christiaens V, Makhzami S, Seltenhammer M, Karras P, Primot A, Cadieu E, van Rooijen E, Marine JC, Egidy G, Ghanem GE, Zon L, Wouters J, Aerts S. Cross-species analysis of enhancer logic using deep learning. Genome Res 2020; 30:1815-1834. [PMID: 32732264 PMCID: PMC7706731 DOI: 10.1101/gr.260844.120] [Citation(s) in RCA: 64] [Impact Index Per Article: 12.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2020] [Accepted: 06/15/2020] [Indexed: 12/23/2022]
Abstract
Deciphering the genomic regulatory code of enhancers is a key challenge in biology because this code underlies cellular identity. A better understanding of how enhancers work will improve the interpretation of noncoding genome variation and empower the generation of cell type-specific drivers for gene therapy. Here, we explore the combination of deep learning and cross-species chromatin accessibility profiling to build explainable enhancer models. We apply this strategy to decipher the enhancer code in melanoma, a relevant case study owing to the presence of distinct melanoma cell states. We trained and validated a deep learning model, called DeepMEL, using chromatin accessibility data of 26 melanoma samples across six different species. We show the accuracy of DeepMEL predictions on the CAGI5 challenge, where it significantly outperforms existing models on the melanoma enhancer of IRF4 Next, we exploit DeepMEL to analyze enhancer architectures and identify accurate transcription factor binding sites for the core regulatory complexes in the two different melanoma states, with distinct roles for each transcription factor, in terms of nucleosome displacement or enhancer activation. Finally, DeepMEL identifies orthologous enhancers across distantly related species, where sequence alignment fails, and the model highlights specific nucleotide substitutions that underlie enhancer turnover. DeepMEL can be used from the Kipoi database to predict and optimize candidate enhancers and to prioritize enhancer mutations. In addition, our computational strategy can be applied to other cancer or normal cell types.
Collapse
Affiliation(s)
- Liesbeth Minnoye
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Ibrahim Ihsan Taskiran
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - David Mauduit
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Maurizio Fazio
- Howard Hughes Medical Institute, Stem Cell Program and the Division of Pediatric Hematology/Oncology, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts 02115, USA
- Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, Cambridge, Massachusetts 02138, USA
| | - Linde Van Aerschot
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
- Laboratory for Disease Mechanisms in Cancer, KU Leuven, 3000 Leuven, Belgium
| | - Gert Hulselmans
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Valerie Christiaens
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Samira Makhzami
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Monika Seltenhammer
- Center for Forensic Medicine, Medical University of Vienna, 1090 Vienna, Austria
- Division of Livestock Sciences (NUWI) - BOKU University of Natural Resources and Life Sciences, 1180 Vienna, Austria
| | - Panagiotis Karras
- VIB-KU Leuven Center for Cancer Biology, 3000 Leuven, Belgium
- KU Leuven, Department of Oncology KU Leuven, 3000 Leuven, Belgium
| | - Aline Primot
- CNRS-University of Rennes 1, UMR6290, Institute of Genetics and Development of Rennes, Faculty of Medicine, 35000 Rennes, France
| | - Edouard Cadieu
- CNRS-University of Rennes 1, UMR6290, Institute of Genetics and Development of Rennes, Faculty of Medicine, 35000 Rennes, France
| | - Ellen van Rooijen
- Howard Hughes Medical Institute, Stem Cell Program and the Division of Pediatric Hematology/Oncology, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts 02115, USA
- Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, Cambridge, Massachusetts 02138, USA
| | - Jean-Christophe Marine
- VIB-KU Leuven Center for Cancer Biology, 3000 Leuven, Belgium
- KU Leuven, Department of Oncology KU Leuven, 3000 Leuven, Belgium
| | - Giorgia Egidy
- Université Paris-Saclay, INRA, AgroParisTech, GABI, 78350 Jouy-en-Josas, France
| | - Ghanem-Elias Ghanem
- Institut Jules Bordet, Université Libre de Bruxelles, 1000 Brussels, Belgium
| | - Leonard Zon
- Howard Hughes Medical Institute, Stem Cell Program and the Division of Pediatric Hematology/Oncology, Boston Children's Hospital and Dana-Farber Cancer Institute, Harvard Medical School, Boston, Massachusetts 02115, USA
- Department of Stem Cell and Regenerative Biology, Harvard Stem Cell Institute, Cambridge, Massachusetts 02138, USA
| | - Jasper Wouters
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| | - Stein Aerts
- VIB-KU Leuven Center for Brain and Disease Research, 3000 Leuven, Belgium
- KU Leuven, Department of Human Genetics KU Leuven, 3000 Leuven, Belgium
| |
Collapse
|
52
|
Chen L, Capra JA. Learning and interpreting the gene regulatory grammar in a deep learning framework. PLoS Comput Biol 2020; 16:e1008334. [PMID: 33137083 PMCID: PMC7660921 DOI: 10.1371/journal.pcbi.1008334] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Revised: 11/12/2020] [Accepted: 09/12/2020] [Indexed: 12/12/2022] Open
Abstract
Deep neural networks (DNNs) have achieved state-of-the-art performance in identifying gene regulatory sequences, but they have provided limited insight into the biology of regulatory elements due to the difficulty of interpreting the complex features they learn. Several models of how combinatorial binding of transcription factors, i.e. the regulatory grammar, drives enhancer activity have been proposed, ranging from the flexible TF billboard model to the stringent enhanceosome model. However, there is limited knowledge of the prevalence of these (or other) sequence architectures across enhancers. Here we perform several hypothesis-driven analyses to explore the ability of DNNs to learn the regulatory grammar of enhancers. We created synthetic datasets based on existing hypotheses about combinatorial transcription factor binding site (TFBS) patterns, including homotypic clusters, heterotypic clusters, and enhanceosomes, from real TF binding motifs from diverse TF families. We then trained deep residual neural networks (ResNets) to model the sequences under a range of scenarios that reflect real-world multi-label regulatory sequence prediction tasks. We developed a gradient-based unsupervised clustering method to extract the patterns learned by the ResNet models. We demonstrated that simulated regulatory grammars are best learned in the penultimate layer of the ResNets, and the proposed method can accurately retrieve the regulatory grammar even when there is heterogeneity in the enhancer categories and a large fraction of TFBS outside of the regulatory grammar. However, we also identify common scenarios where ResNets fail to learn simulated regulatory grammars. Finally, we applied the proposed method to mouse developmental enhancers and were able to identify the components of a known heterotypic TF cluster. Our results provide a framework for interpreting the regulatory rules learned by ResNets, and they demonstrate that the ability and efficiency of ResNets in learning the regulatory grammar depends on the nature of the prediction task.
Collapse
Affiliation(s)
- Ling Chen
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
| | - John A. Capra
- Department of Biological Sciences, Vanderbilt University, Nashville, TN, United States of America
- Vanderbilt Genetics Institute and Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States of America
- Department of Computer Science, Vanderbilt University, Nashville, TN, United States of America
| |
Collapse
|
53
|
Kelley DR. Cross-species regulatory sequence activity prediction. PLoS Comput Biol 2020; 16:e1008050. [PMID: 32687525 PMCID: PMC7392335 DOI: 10.1371/journal.pcbi.1008050] [Citation(s) in RCA: 116] [Impact Index Per Article: 23.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2019] [Revised: 07/30/2020] [Accepted: 06/12/2020] [Indexed: 12/22/2022] Open
Abstract
Machine learning algorithms trained to predict the regulatory activity of nucleic acid sequences have revealed principles of gene regulation and guided genetic variation analysis. While the human genome has been extensively annotated and studied, model organisms have been less explored. Model organism genomes offer both additional training sequences and unique annotations describing tissue and cell states unavailable in humans. Here, we develop a strategy to train deep convolutional neural networks simultaneously on multiple genomes and apply it to learn sequence predictors for large compendia of human and mouse data. Training on both genomes improves gene expression prediction accuracy on held out and variant sequences. We further demonstrate a novel and powerful approach to apply mouse regulatory models to analyze human genetic variants associated with molecular phenotypes and disease. Together these techniques unleash thousands of non-human epigenetic and transcriptional profiles toward more effective investigation of how gene regulation affects human disease.
Collapse
Affiliation(s)
- David R. Kelley
- Calico Life Sciences, South San Francisco, California, United States of America
| |
Collapse
|
54
|
Danilevicz MF, Tay Fernandez CG, Marsh JI, Bayer PE, Edwards D. Plant pangenomics: approaches, applications and advancements. CURRENT OPINION IN PLANT BIOLOGY 2020; 54:18-25. [PMID: 31982844 DOI: 10.1016/j.pbi.2019.12.005] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/19/2019] [Revised: 12/15/2019] [Accepted: 12/18/2019] [Indexed: 05/05/2023]
Abstract
With the assembly of increasing numbers of plant genomes, it is becoming accepted that a single reference assembly does not reflect the gene diversity of a species. The production of pangenomes, which reflect the structural variation and polymorphisms in genomes, enables in depth comparisons of variation within species or higher taxonomic groups. In this review, we discuss the current and emerging approaches for pangenome assembly, analysis and visualisation. In addition, we consider the potential of pangenomes for applied crop improvement, evolutionary and biodiversity studies. To fully exploit the value of pangenomes it is important to integrate broad information such as phenotypic, environmental, and expression data to gain insights into the role of variable regions within genomes.
Collapse
Affiliation(s)
- Monica Furaste Danilevicz
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | | | - Jacob Ian Marsh
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | - Philipp Emanuel Bayer
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia
| | - David Edwards
- School of Biological Sciences and Institute of Agriculture, University of Western Australia, Perth, WA, Australia.
| |
Collapse
|
55
|
Liu H, Duncan K, Helverson A, Kumari P, Mumm C, Xiao Y, Carlson JC, Darbellay F, Visel A, Leslie E, Breheny P, Erives AJ, Cornell RA. Analysis of zebrafish periderm enhancers facilitates identification of a regulatory variant near human KRT8/18. eLife 2020; 9:e51325. [PMID: 32031521 PMCID: PMC7039683 DOI: 10.7554/elife.51325] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2019] [Accepted: 02/06/2020] [Indexed: 12/18/2022] Open
Abstract
Genome-wide association studies for non-syndromic orofacial clefting (OFC) have identified single nucleotide polymorphisms (SNPs) at loci where the presumed risk-relevant gene is expressed in oral periderm. The functional subsets of such SNPs are difficult to predict because the sequence underpinnings of periderm enhancers are unknown. We applied ATAC-seq to models of human palate periderm, including zebrafish periderm, mouse embryonic palate epithelia, and a human oral epithelium cell line, and to complementary mesenchymal cell types. We identified sets of enhancers specific to the epithelial cells and trained gapped-kmer support-vector-machine classifiers on these sets. We used the classifiers to predict the effects of 14 OFC-associated SNPs at 12q13 near KRT18. All the classifiers picked the same SNP as having the strongest effect, but the significance was highest with the classifier trained on zebrafish periderm. Reporter and deletion analyses support this SNP as lying within a periderm enhancer regulating KRT18/KRT8 expression.
Collapse
Affiliation(s)
- Huan Liu
- State Key Laboratory Breeding Base of Basic Science of Stomatology (Hubei-MOST) and Key Laboratory for Oral Biomedicine of Ministry of Education (KLOBM), School and Hospital of Stomatology, Wuhan UniversityWuhanChina
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
- Department of Periodontology, School of Stomatology, Wuhan UniversityWuhanChina
| | - Kaylia Duncan
- Interdisciplinary Program in Molecular Medicine, University of IowaIowa CityUnited States
| | - Annika Helverson
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
| | - Priyanka Kumari
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
| | - Camille Mumm
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
| | - Yao Xiao
- State Key Laboratory Breeding Base of Basic Science of Stomatology (Hubei-MOST) and Key Laboratory for Oral Biomedicine of Ministry of Education (KLOBM), School and Hospital of Stomatology, Wuhan UniversityWuhanChina
| | | | - Fabrice Darbellay
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley LaboratoriesBerkeleyUnited States
| | - Axel Visel
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley LaboratoriesBerkeleyUnited States
- U.S. Department of Energy Joint Genome Institute, Lawrence Berkeley LaboratoriesBerkeleyUnited States
- University of California, MercedMercedUnited States
| | - Elizabeth Leslie
- Department of Human Genetics, Emory University School of MedicineAtlantaGeorgia
| | - Patrick Breheny
- Department of Biostatistics, University of IowaIowa CityUnited States
| | - Albert J Erives
- Department of Biology, University of IowaIowa CityUnited States
| | - Robert A Cornell
- Department of Anatomy and Cell Biology, University of IowaIowa CityUnited States
- Interdisciplinary Program in Molecular Medicine, University of IowaIowa CityUnited States
| |
Collapse
|
56
|
Tomoyasu Y, Halfon MS. How to study enhancers in non-traditional insect models. ACTA ACUST UNITED AC 2020; 223:223/Suppl_1/jeb212241. [PMID: 32034049 DOI: 10.1242/jeb.212241] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Transcriptional enhancers are central to the function and evolution of genes and gene regulation. At the organismal level, enhancers play a crucial role in coordinating tissue- and context-dependent gene expression. At the population level, changes in enhancers are thought to be a major driving force that facilitates evolution of diverse traits. An amazing array of diverse traits seen in insect morphology, physiology and behavior has been the subject of research for centuries. Although enhancer studies in insects outside of Drosophila have been limited, recent advances in functional genomic approaches have begun to make such studies possible in an increasing selection of insect species. Here, instead of comprehensively reviewing currently available technologies for enhancer studies in established model organisms such as Drosophila, we focus on a subset of computational and experimental approaches that are likely applicable to non-Drosophila insects, and discuss the pros and cons of each approach. We discuss the importance of validating enhancer function and evaluate several possible validation methods, such as reporter assays and genome editing. Key points and potential pitfalls when establishing a reporter assay system in non-traditional insect models are also discussed. We close with a discussion of how to advance enhancer studies in insects, both by improving computational approaches and by expanding the genetic toolbox in various insects. Through these discussions, this Review provides a conceptual framework for studying the function and evolution of enhancers in non-traditional insect models.
Collapse
Affiliation(s)
| | - Marc S Halfon
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
| |
Collapse
|
57
|
Representation learning of genomic sequence motifs with convolutional neural networks. PLoS Comput Biol 2019; 15:e1007560. [PMID: 31856220 PMCID: PMC6941814 DOI: 10.1371/journal.pcbi.1007560] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2019] [Revised: 01/03/2020] [Accepted: 11/22/2019] [Indexed: 12/18/2022] Open
Abstract
Although convolutional neural networks (CNNs) have been applied to a variety of computational genomics problems, there remains a large gap in our understanding of how they build representations of regulatory genomic sequences. Here we perform systematic experiments on synthetic sequences to reveal how CNN architecture, specifically convolutional filter size and max-pooling, influences the extent that sequence motif representations are learned by first layer filters. We find that CNNs designed to foster hierarchical representation learning of sequence motifs—assembling partial features into whole features in deeper layers—tend to learn distributed representations, i.e. partial motifs. On the other hand, CNNs that are designed to limit the ability to hierarchically build sequence motif representations in deeper layers tend to learn more interpretable localist representations, i.e. whole motifs. We then validate that this representation learning principle established from synthetic sequences generalizes to in vivo sequences. Although deep convolutional neural networks (CNNs) have demonstrated promise across many regulatory genomics prediction tasks, their inner workings largely remain a mystery. Here we empirically demonstrate how CNN architecture influences the extent that representations of sequence motifs are captured by first layer filters. We find that max-pooling and convolutional filter size modulates information flow, controlling the extent that deeper layers can build features hierarchically. CNNs designed to foster hierarchical representation learning tend to capture partial representations of motifs in first layer filters. On the other hand, CNNs that are designed to limit the ability of deeper layers to hierarchically build upon low-level features tend to learn whole representations of motifs in first layer filters. Together, this study enables the design of CNNs that intentionally learn interpretable representations in easier to access first layer filters (with a small tradeoff in performance), versus building harder to interpret distributed representations, both of which have their strengths and limitations.
Collapse
|
58
|
Newman SA. Inherency of Form and Function in Animal Development and Evolution. Front Physiol 2019; 10:702. [PMID: 31275153 PMCID: PMC6593199 DOI: 10.3389/fphys.2019.00702] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2018] [Accepted: 05/20/2019] [Indexed: 12/11/2022] Open
Abstract
I discuss recent work on the origins of morphology and cell-type diversification in Metazoa – collectively the animals – and propose a scenario for how these two properties became integrated, with the help of a third set of processes, cellular pattern formation, into the developmental programs seen in present-day metazoans. Inherent propensities to generate familiar forms and cell types, in essence a parts kit for the animals, are exhibited by present-day organisms and were likely more prominent in primitive ones. The structural motifs of animal bodies and organs, e.g., multilayered, hollow, elongated and segmented tissues, internal and external appendages, branched tubes, and modular endoskeletons, can be accounted for by the properties of mesoscale masses of metazoan cells. These material properties, in turn, resulted from the recruitment of “generic” physical forces and mechanisms – adhesion, contraction, polarity, chemical oscillation, diffusion – by toolkit molecules that were partly conserved from unicellular holozoan antecedents and partly novel, distributed in the different metazoan phyla in a fashion correlated with morphological complexity. The specialized functions of the terminally differentiated cell types in animals, e.g., contraction, excitability, barrier function, detoxification, excretion, were already present in ancestral unicellular organisms. These functions were implemented in metazoan differentiation in some cases using the same transcription factors as in single-celled ancestors, although controlled by regulatory mechanisms that were hybrids between earlier-evolved processes and regulatory innovations, such as enhancers. Cellular pattern formation, mediated by released morphogens interacting with biochemically responsive and excitable tissues, drew on inherent self-organizing processes in proto-metazoans to transform clusters of holozoan cells into animal embryos and organs.
Collapse
Affiliation(s)
- Stuart A Newman
- Department of Cell Biology and Anatomy, New York Medical College, Valhalla, NY, United States
| |
Collapse
|