1
|
Barbadilla-Martínez L, Klaassen N, van Steensel B, de Ridder J. Predicting gene expression from DNA sequence using deep learning models. Nat Rev Genet 2025:10.1038/s41576-025-00841-2. [PMID: 40360798 DOI: 10.1038/s41576-025-00841-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/01/2025] [Indexed: 05/15/2025]
Abstract
Transcription of genes is regulated by DNA elements such as promoters and enhancers, the activity of which are in turn controlled by many transcription factors. Owing to the highly complex combinatorial logic involved, it has been difficult to construct computational models that predict gene activity from DNA sequence. Recent advances in deep learning techniques applied to data from epigenome mapping and high-throughput reporter assays have made substantial progress towards addressing this complexity. Such models can capture the regulatory grammar with remarkable accuracy and show great promise in predicting the effects of non-coding variants, uncovering detailed molecular mechanisms of gene regulation and designing synthetic regulatory elements for biotechnology. Here, we discuss the principles of these approaches, the types of training data sets that are available and the strengths and limitations of different approaches.
Collapse
Affiliation(s)
- Lucía Barbadilla-Martínez
- Oncode Institute, Utrecht, The Netherlands
- Center for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands
| | - Noud Klaassen
- Oncode Institute, Utrecht, The Netherlands
- Division of Molecular Genetics, Netherlands Cancer Institute, Amsterdam, The Netherlands
| | - Bas van Steensel
- Oncode Institute, Utrecht, The Netherlands.
- Division of Molecular Genetics, Netherlands Cancer Institute, Amsterdam, The Netherlands.
| | - Jeroen de Ridder
- Oncode Institute, Utrecht, The Netherlands.
- Center for Molecular Medicine, UMC Utrecht, Utrecht, The Netherlands.
| |
Collapse
|
2
|
Li Z, Zeng S, Du Q, Li X, Chen Q, Zhang S, Zhou X, Li H, Jiang A, Wang X, Shang P, Li M, Long K. The repression of the lipolytic inhibitor G0s2 enhancers affects lipid metabolism. Gene 2025; 938:149162. [PMID: 39667714 DOI: 10.1016/j.gene.2024.149162] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2024] [Revised: 11/25/2024] [Accepted: 12/09/2024] [Indexed: 12/14/2024]
Abstract
The G0/G1 switch gene 2 (G0s2) is a selective inhibitor of adipose triglyceride lipase (ATGL) which is the rate-limiting enzyme for triglycerides (TGs) hydrolysis in adipocytes, and regulates the mobilization of TGs in adipocytes and hepatocytes. The expression and functional disorders of G0S2 are associated with various metabolic diseases and related pathological states, such as obesity and metabolic syndrome and non-alcoholic fatty liver disease (NAFLD). However, the extent to which the transcriptional regulatory mechanisms mediated by the interaction between the G0s2 gene promoter and enhancer regions are involved remains unknown. Here, through the analysis of epigenomic data (H3K27ac, H3K4me1, and DHS-seq) and luciferase reporter assays, we identified three active enhancers of G0s2 in 3 T3-L1 adipocytes. Subsequently, using the dCas9-KRAB system for epigenetic inhibition of G0S2-En2, -En4, and -En5 revealed the functional role of these enhancers in regulating G0s2 expression and lipid droplet biosynthesis. Additionally, transcriptome analyses revealed that inhibition of G0S2-En5 downregulated pathways associated with lipid metabolism and lipid biosynthesis. Furthermore, overexpression of transcription factors (TFs) and motif mutation experiments identified that PPARG and RXRA regulate the activity of G0S2-En5. Taken together, we identified functional enhancers regulating G0s2 expression and elucidated the important role of the G0S2-En5 in lipid droplet biogenesis.
Collapse
Affiliation(s)
- Ziqi Li
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China
| | - Sha Zeng
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China
| | - Qinjiao Du
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China
| | - Xiaokai Li
- Chongqing Academy of Animal Sciences, Chongqing 402460, China; National Center of Technology Innovation for Pigs, Chongqing 402460, China
| | - Qiuyue Chen
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China
| | - Songling Zhang
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China
| | - Xun Zhou
- College of Veterinary Medicine, Sichuan Agricultural University, Chengdu 611130, China
| | - Haohuan Li
- College of Veterinary Medicine, Sichuan Agricultural University, Chengdu 611130, China
| | - Anan Jiang
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China
| | - Xun Wang
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China
| | - Peng Shang
- Animal Science College, Tibet Agriculture and Animal Husbandry University, Linzhi, 860000, China
| | - Mingzhou Li
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China.
| | - Keren Long
- State Key Laboratory of Swine and Poultry Breeding Industry, Sichuan Agricultural University, Chengdu 611130, China; College of Animal Science and Technology, Sichuan Agricultural University, Chengdu 611130, China; Chongqing Academy of Animal Sciences, Chongqing 402460, China; National Center of Technology Innovation for Pigs, Chongqing 402460, China.
| |
Collapse
|
3
|
Ferrando-Bernal M, Brand CM, Capra JA. Inferring human phenotypes using ancient DNA: from molecules to populations. Curr Opin Genet Dev 2025; 90:102283. [PMID: 39612613 DOI: 10.1016/j.gde.2024.102283] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2024] [Revised: 10/04/2024] [Accepted: 11/04/2024] [Indexed: 12/01/2024]
Abstract
The increasing availability of ancient DNA (aDNA) from human groups across space and time has yielded deep insights into the movements of our species. However, given the challenges of mapping from genotype to phenotype, aDNA has revealed less about the phenotypes of ancient individuals. In this review, we highlight recent advances in inferring ancient phenotypes - from the molecular to population scale - with a focus on applications enabled by new machine learning approaches. The genetic architecture of complex traits across human groups suggests that the prediction of individual-level complex traits, like behavior or disease risk, is often challenging across the relevant evolutionary distances. Thus, we propose an approach that integrates predictions of molecular phenotypes, whose mechanisms are more conserved, with nongenetic data.
Collapse
Affiliation(s)
- Manuel Ferrando-Bernal
- Bakar Computational Health Science Institute, University of California San Francisco, San Francisco, CA, USA; Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA.
| | - Colin M Brand
- Bakar Computational Health Science Institute, University of California San Francisco, San Francisco, CA, USA; Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA.
| | - John A Capra
- Bakar Computational Health Science Institute, University of California San Francisco, San Francisco, CA, USA; Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, USA; Institute for Human Genetics, University of California San Francisco, San Francisco, CA, USA.
| |
Collapse
|
4
|
Zeng S, Li Z, Li X, Du Q, Zhang Y, Zhong Z, Wang H, Zhang S, Li P, Li H, Chen L, Jiang A, Shang P, Li M, Long K. Inhibition of triglyceride metabolism-associated enhancers alters lipid deposition during adipocyte differentiation. FASEB J 2025; 39:e70347. [PMID: 39873971 PMCID: PMC11774232 DOI: 10.1096/fj.202401137r] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Revised: 12/28/2024] [Accepted: 01/09/2025] [Indexed: 01/30/2025]
Abstract
Triglyceride (TG) metabolism is a complex and highly coordinated biological process regulated by a series of genes, and its dysregulation can lead to the occurrence of disorders in lipid metabolism. However, the transcriptional regulatory mechanisms of crucial genes in TG metabolism mediated by enhancer-promoter interactions remain elusive. Here, we identified candidate enhancers regulating the Agpat2, Dgat1, Dgat2, Pnpla2, and Lipe genes in 3T3-L1 adipocytes by integrating epigenomic data (H3K27ac, H3K4me1, and DHS-seq) with chromatin three-dimensional interaction data. Luciferase reporter assays revealed that 11 enhancers exhibited fluorescence activity. The repression of enhancers using the dCas9-KRAB system revealed the functional roles of enhancers of Dgat2 and Pnpla2 in regulating their expression and TG metabolism. Furthermore, transcriptome analyses revealed that inhibition of Dgat2-En4 downregulated pathways associated with lipid metabolism, lipid biosynthesis, and adipocyte differentiation. Additionally, overexpression and motif mutation experiments of transcription factor found that two TFs, PPARG and RXRA, regulate the activity of Agpat2-En1, Dgat2-En4, and Pnpla2-En5. Our study identified functional enhancers regulating TG metabolism and elucidated potential regulatory mechanisms of TG deposition from enhancer-promoter interactions, providing insights into understanding lipid deposition.
Collapse
Affiliation(s)
- Sha Zeng
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Ziqi Li
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Xiaokai Li
- Chongqing Academy of Animal SciencesChongqingChina
- National Center of Technology Innovation for PigsChongqingChina
| | - Qinjiao Du
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Yu Zhang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Zhining Zhong
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Haoming Wang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Songling Zhang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Penghao Li
- Jinxin Research Institute for Reproductive Medicine and GeneticsSichuan Jinxin Xi'nan Women's and Children's HospitalChengduChina
| | - Haohuan Li
- College of Veterinary MedicineSichuan Agricultural UniversityChengduChina
| | - Li Chen
- Chongqing Academy of Animal SciencesChongqingChina
- National Center of Technology Innovation for PigsChongqingChina
| | - Anan Jiang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Peng Shang
- Animal Science CollegeTibet Agriculture and Animal Husbandry UniversityLinzhiChina
| | - Mingzhou Li
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Keren Long
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
- Chongqing Academy of Animal SciencesChongqingChina
- National Center of Technology Innovation for PigsChongqingChina
| |
Collapse
|
5
|
Zuo B, Chen R, Tang X, Shao Y, Liu X, Nneji LM, Sun Y. Genomic Insights Into Genetic Basis of Evolutionary Conservatism and Innovation in Frogs. Integr Zool 2024. [PMID: 39663509 DOI: 10.1111/1749-4877.12931] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 10/12/2024] [Accepted: 11/11/2024] [Indexed: 12/13/2024]
Abstract
Examining closely related species evolving in similar environments offers valuable insights into the mechanisms driving phylogenetic conservatism and evolutionary lability. This can elucidate the intricate relationship between inheritance and environmental factors. Nonetheless, the precise genomic dynamics and molecular underpinnings of this process remain enigmatic. This study explores the evolutionary conservatism and adaptation exhibited by two closely related high-altitude frog species: Nanorana parkeri and N. pleskei. We assembled a high-quality genome for Tibetan N. pleskei and compared it to the genomes of N. parkeri and their lowland relatives. Our findings reveal that these two Tibetan frog species diverged approximately 16.6 million years ago, pointing to a possible ancestral colonization of high-elevation habitats. Following this colonization, significant adaptive evolution occurred in both coding and non-coding regions of the ancestral lineage. This evolution led to notable phenotypic alterations, as evidenced by the reduced body size. Also, due to purifying selection, most ancestral adaptive features persisted in descendant species, indicating a strong element of evolutionary conservatism. However, descendant species evolved novel adaptations to exacerbated environmental challenges in the Tibet Plateau, mainly related to hypoxia response. Furthermore, our analysis underscores the critical role of regulatory variations in descendant adaptive evolution. Notably, hub genes in networks, such as EGLN3, accumulated more variations in regulatory regions as they were transmitted from ancestors to descendants. In sum, our study sheds light on the profound and lasting impact of genetic heritage on species' adaptive evolution.
Collapse
Affiliation(s)
- Bin Zuo
- Ministry of Education Key Laboratory for Transboundary Ecosecurity of Southwest China, Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology and Centre for Invasion Biology, Institute of Biodiversity, School of Ecology and Environmental Science, Yunnan University, Kunming, Yunnan, China
| | - Rongmei Chen
- Ministry of Education Key Laboratory for Transboundary Ecosecurity of Southwest China, Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology and Centre for Invasion Biology, Institute of Biodiversity, School of Ecology and Environmental Science, Yunnan University, Kunming, Yunnan, China
| | - Xiaolong Tang
- Department of Animal and Biomedical Sciences, School of Life Science, Lanzhou University, Lanzhou, China
| | - Yong Shao
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Xiaolong Liu
- School of Life Sciences, Southwest University, Chongqing, China
| | - Lotanna M Nneji
- Department of Biology, Howard University, Washington, DC, USA
| | - Yanbo Sun
- Ministry of Education Key Laboratory for Transboundary Ecosecurity of Southwest China, Yunnan Key Laboratory of Plant Reproductive Adaptation and Evolutionary Ecology and Centre for Invasion Biology, Institute of Biodiversity, School of Ecology and Environmental Science, Yunnan University, Kunming, Yunnan, China
- Laboratory for Conservation and Utilization of Bio-resources, Yunnan University, Kunming, China
- Southwest United Graduate School, Kunming, China
| |
Collapse
|
6
|
Ren YY, Liu Z. Characterization of Single-Cell Cis-regulatory Elements Informs Implications for Cell Differentiation. Genome Biol Evol 2024; 16:evae241. [PMID: 39506564 PMCID: PMC11580522 DOI: 10.1093/gbe/evae241] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2024] [Revised: 10/17/2024] [Accepted: 11/04/2024] [Indexed: 11/08/2024] Open
Abstract
Cis-regulatory elements govern the specific patterns and dynamics of gene expression in cells during development, which are the fundamental mechanisms behind cell differentiation. However, the genomic characteristics of single-cell cis-regulatory elements closely linked to cell differentiation during development remain unclear. To explore this, we systematically analyzed ∼250,000 putative single-cell cis-regulatory elements obtained from snATAC-seq analysis of the developing mouse cerebellum. We found that over 80% of these single-cell cis-regulatory elements show pleiotropic effects, being active in 2 or more cell types. The pleiotropic degrees of proximal and distal single-cell cis-regulatory elements are positively correlated with the density and diversity of transcription factor binding motifs and GC content. There is a negative correlation between the pleiotropic degrees of single-cell cis-regulatory elements and their distances to the nearest transcription start sites, and proximal single-cell cis-regulatory elements display higher relevance strengths than distal ones. Furthermore, both proximal and distal single-cell cis-regulatory elements related to cell differentiation exhibit enhanced sequence-level evolutionary conservation, increased density and diversity of transcription factor binding motifs, elevated GC content, and greater distances from their nearest genes. Together, our findings reveal the general genomic characteristics of putative single-cell cis-regulatory elements and provide insights into the genomic and evolutionary mechanisms by which single-cell cis-regulatory elements regulate cell differentiation during development.
Collapse
Affiliation(s)
- Ying-Ying Ren
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
- Key Laboratory of Genetic Evolution & Animal Models, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| | - Zhen Liu
- State Key Laboratory of Genetic Resources and Evolution, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
- Key Laboratory of Genetic Evolution & Animal Models, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
- Yunnan Key Laboratory of Biodiversity Information, Kunming Institute of Zoology, Chinese Academy of Sciences, Kunming, China
| |
Collapse
|
7
|
Zhang Q, Wang S, Li Z, Pan Y, Huang D. Cross-Species Prediction of Transcription Factor Binding by Adversarial Training of a Novel Nucleotide-Level Deep Neural Network. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2024; 11:e2405685. [PMID: 39076052 PMCID: PMC11423150 DOI: 10.1002/advs.202405685] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/23/2024] [Indexed: 07/31/2024]
Abstract
Cross-species prediction of TF binding remains a major challenge due to the rapid evolutionary turnover of individual TF binding sites, resulting in cross-species predictive performance being consistently worse than within-species performance. In this study, a novel Nucleotide-Level Deep Neural Network (NLDNN) is first proposed to predict TF binding within or across species. NLDNN regards the task of TF binding prediction as a nucleotide-level regression task, which takes DNA sequences as input and directly predicts experimental coverage values. Beyond predictive performance, it also assesses model performance by locating potential TF binding regions, discriminating TF-specific single-nucleotide polymorphisms (SNPs), and identifying causal disease-associated SNPs. The experimental results show that NLDNN outperforms the competing methods in these tasks. Then, a dual-path framework is designed for adversarial training of NLDNN to further improve the cross-species prediction performance by pulling the domain space of human and mouse species closer. Through comparison and analysis, it finds that adversarial training not only can improve the cross-species prediction performance between humans and mice but also enhance the ability to locate TF binding regions and discriminate TF-specific SNPs. By visualizing the predictions, it is figured out that the framework corrects some mispredictions by amplifying the coverage values of incorrectly predicted peaks.
Collapse
Affiliation(s)
- Qinhu Zhang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
- Division of Life Sciences and MedicineUniversity of Science and Technology of ChinaHefei230021China
- Big Data and Intelligent Computing Research CenterGuangxi Academy of ScienceNanning530007China
| | - Siguo Wang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - Zhipeng Li
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - Yijie Pan
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
| | - De‐Shuang Huang
- Ningbo Institute of Digital TwinEastern Institute of TechnologyNingbo315201China
- Institute for Regenerative MedicineShanghai East HospitalTongji UniversityShanghai200092China
| |
Collapse
|
8
|
Oh JW, Beer MA. Gapped-kmer sequence modeling robustly identifies regulatory vocabularies and distal enhancers conserved between evolutionarily distant mammals. Nat Commun 2024; 15:6464. [PMID: 39085231 PMCID: PMC11291912 DOI: 10.1038/s41467-024-50708-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Accepted: 07/17/2024] [Indexed: 08/02/2024] Open
Abstract
Gene regulatory elements drive complex biological phenomena and their mutations are associated with common human diseases. The impacts of human regulatory variants are often tested using model organisms such as mice. However, mapping human enhancers to conserved elements in mice remains a challenge, due to both rapid enhancer evolution and limitations of current computational methods. We analyze distal enhancers across 45 matched human/mouse cell/tissue pairs from a comprehensive dataset of DNase-seq experiments, and show that while cell-specific regulatory vocabulary is conserved, enhancers evolve more rapidly than promoters and CTCF binding sites. Enhancer conservation rates vary across cell types, in part explainable by tissue specific transposable element activity. We present an improved genome alignment algorithm using gapped-kmer features, called gkm-align, and make genome wide predictions for 1,401,803 orthologous regulatory elements. We show that gkm-align discovers 23,660 novel human/mouse conserved enhancers missed by previous algorithms, with strong evidence of conserved functional activity.
Collapse
Affiliation(s)
- Jin Woo Oh
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Michael A Beer
- Department of Biomedical Engineering and McKusick-Nathans Department of Genetic Medicine, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
9
|
Buckley RM, Ostrander EA. Large-scale genomic analysis of the domestic dog informs biological discovery. Genome Res 2024; 34:811-821. [PMID: 38955465 PMCID: PMC11293549 DOI: 10.1101/gr.278569.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/04/2024]
Abstract
Recent advances in genomics, coupled with a unique population structure and remarkable levels of variation, have propelled the domestic dog to new levels as a system for understanding fundamental principles in mammalian biology. Central to this advance are more than 350 recognized breeds, each a closed population that has undergone selection for unique features. Genetic variation in the domestic dog is particularly well characterized compared with other domestic mammals, with almost 3000 high-coverage genomes publicly available. Importantly, as the number of sequenced genomes increases, new avenues for analysis are becoming available. Herein, we discuss recent discoveries in canine genomics regarding behavior, morphology, and disease susceptibility. We explore the limitations of current data sets for variant interpretation, tradeoffs between sequencing strategies, and the burgeoning role of long-read genomes for capturing structural variants. In addition, we consider how large-scale collections of whole-genome sequence data drive rare variant discovery and assess the geographic distribution of canine diversity, which identifies Asia as a major source of missing variation. Finally, we review recent comparative genomic analyses that will facilitate annotation of the noncoding genome in dogs.
Collapse
Affiliation(s)
- Reuben M Buckley
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Elaine A Ostrander
- National Human Genome Research Institute, National Institutes of Health, Bethesda, Maryland 20892, USA
| |
Collapse
|
10
|
Li X, Zeng S, Chen L, Zhang Y, Li X, Zhang B, Su D, Du Q, Zhang J, Wang H, Zhong Z, Zhang J, Li P, Jiang A, Long K, Li M, Ge L. An intronic enhancer of Cebpa regulates adipocyte differentiation and adipose tissue development via long-range loop formation. Cell Prolif 2024; 57:e13552. [PMID: 37905345 PMCID: PMC10905358 DOI: 10.1111/cpr.13552] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 08/29/2023] [Accepted: 09/11/2023] [Indexed: 11/02/2023] Open
Abstract
Cebpa is a master transcription factor gene for adipogenesis. However, the mechanisms of enhancer-promoter chromatin interactions controlling Cebpa transcriptional regulation during adipogenic differentiation remain largely unknown. To reveal how the three-dimensional structure of Cebpa changes during adipogenesis, we generated high-resolution chromatin interactions of Cebpa in 3T3-L1 preadipocytes and 3T3-L1 adipocytes using circularized chromosome conformation capture sequencing (4C-seq). We revealed dramatic changes in chromatin interactions and chromatin status at interaction sites during adipogenic differentiation. Based on this, we identified five active enhancers of Cebpa in 3T3-L1 adipocytes through epigenomic data and luciferase reporter assays. Next, epigenetic repression of Cebpa-L1-AD-En2 or -En3 by the dCas9-KRAB system significantly down-regulated Cebpa expression and inhibited adipocyte differentiation. Furthermore, experimental depletion of cohesin decreased the interaction intensity between Cebpa-L1-AD-En2 and the Cebpa promoter and down-regulated Cebpa expression, indicating that long-range chromatin loop formation was mediated by cohesin. Two transcription factors, RXRA and PPARG, synergistically regulate the activity of Cebpa-L1-AD-En2. To test whether Cebpa-L1-AD-En2 plays a role in adipose tissue development, we injected dCas9-KRAB-En2 lentivirus into the inguinal white adipose tissue (iWAT) of mice to suppress the activity of Cebpa-L1-AD-En2. Repression of Cebpa-L1-AD-En2 significantly decreased Cebpa expression and adipocyte size, altered iWAT transcriptome, and affected iWAT development. We identified functional enhancers regulating Cebpa expression and clarified the crucial roles of Cebpa-L1-AD-En2 and Cebpa promoter interaction in adipocyte differentiation and adipose tissue development.
Collapse
Affiliation(s)
- Xiaokai Li
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Sha Zeng
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Li Chen
- Chongqing Academy of Animal SciencesChongqingChina
- National Center of Technology Innovation for PigsChongqingChina
- Key Laboratory of Pig Industry ScienceMinistry of AgricultureChongqingChina
| | - Yu Zhang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Xuemin Li
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Biwei Zhang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Duo Su
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Qinjiao Du
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Jiaman Zhang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Haoming Wang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Zhining Zhong
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Jinwei Zhang
- Chongqing Academy of Animal SciencesChongqingChina
- National Center of Technology Innovation for PigsChongqingChina
- Key Laboratory of Pig Industry ScienceMinistry of AgricultureChongqingChina
| | - Penghao Li
- Jinxin Research Institute for Reproductive Medicine and GeneticsSichuan Jinxin Xi'nan Women's and Children's HospitalChengduChina
| | - Anan Jiang
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Keren Long
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
- Chongqing Academy of Animal SciencesChongqingChina
| | - Mingzhou Li
- State Key Laboratory of Swine and Poultry Breeding IndustrySichuan Agricultural UniversityChengduChina
- Livestock and Poultry Multi‐omics Key Laboratory of Ministry of Agriculture and Rural Affairs, College of Animal Science and TechnologySichuan Agricultural UniversityChengduChina
| | - Liangpeng Ge
- Chongqing Academy of Animal SciencesChongqingChina
- National Center of Technology Innovation for PigsChongqingChina
- Key Laboratory of Pig Industry ScienceMinistry of AgricultureChongqingChina
| |
Collapse
|
11
|
Ferguson CA, Firulli BA, Zoia M, Osterwalder M, Firulli AB. Identification and characterization of Hand2 upstream genomic enhancers active in developing stomach and limbs. Dev Dyn 2024; 253:215-232. [PMID: 37551791 PMCID: PMC11365009 DOI: 10.1002/dvdy.646] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 07/20/2023] [Accepted: 07/25/2023] [Indexed: 08/09/2023] Open
Abstract
BACKGROUND The bHLH transcription factor HAND2 plays important roles in the development of the embryonic heart, face, limbs, and sympathetic and enteric nervous systems. To define how and when HAND2 regulates these developmental systems, requires understanding the transcriptional regulation of Hand2. RESULTS Remarkably, Hand2 is flanked by an extensive upstream gene desert containing a potentially diverse enhancer landscape. Here, we screened the regulatory interval 200 kb proximal to Hand2 for putative enhancers using evolutionary conservation and histone marks in Hand2-expressing tissues. H3K27ac signatures across embryonic tissues pointed to only two putative enhancer regions showing deep sequence conservation. Assessment of the transcriptional enhancer potential of these elements using transgenic reporter lines uncovered distinct in vivo enhancer activities in embryonic stomach and limb mesenchyme, respectively. Activity of the identified stomach enhancer was restricted to the developing antrum and showed expression within the smooth muscle and enteric neurons. Surprisingly, the activity pattern of the limb enhancer did not overlap Hand2 mRNA but consistently yielded a defined subectodermal anterior expression pattern within multiple transgenic lines. CONCLUSIONS Together, these results start to uncover the diverse regulatory potential inherent to the Hand2 upstream regulatory interval.
Collapse
Affiliation(s)
- Chloe A. Ferguson
- Herman B Wells Center for Pediatric Research Department of Pediatrics, Anatomy, Biochemistry, and Medical and Molecular Genetics, Indiana University School of Medicine, 1044 W. Walnut St., Indianapolis, IN 46202-5225, USA
| | - Beth A. Firulli
- Herman B Wells Center for Pediatric Research Department of Pediatrics, Anatomy, Biochemistry, and Medical and Molecular Genetics, Indiana University School of Medicine, 1044 W. Walnut St., Indianapolis, IN 46202-5225, USA
| | - Matteo Zoia
- Department for BioMedical Research (DBMR), University of Bern, Bern, Switzerland
| | - Marco Osterwalder
- Department for BioMedical Research (DBMR), University of Bern, Bern, Switzerland
- Department of Cardiology, Bern University Hospital, Bern, Switzerland
| | - Anthony B. Firulli
- Herman B Wells Center for Pediatric Research Department of Pediatrics, Anatomy, Biochemistry, and Medical and Molecular Genetics, Indiana University School of Medicine, 1044 W. Walnut St., Indianapolis, IN 46202-5225, USA
| |
Collapse
|
12
|
Garza AB, Garcia R, Solis LM, Halfon MS, Girgis HZ. EnhancerTracker: Comparing cell-type-specific enhancer activity of DNA sequence triplets via an ensemble of deep convolutional neural networks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.12.23.573198. [PMID: 38187673 PMCID: PMC10769370 DOI: 10.1101/2023.12.23.573198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/09/2024]
Abstract
Motivation Transcriptional enhancers - unlike promoters - are unrestrained by distance or strand orientation with respect to their target genes, making their computational identification a challenge. Further, there are insufficient numbers of confirmed enhancers for many cell types, preventing robust training of machine-learning-based models for enhancer prediction for such cell types. Results We present EnhancerTracker , a novel tool that leverages an ensemble of deep separable convolutional neural networks to identify cell-type-specific enhancers with the need of only two confirmed enhancers. EnhancerTracker is trained, validated, and tested on 52,789 putative enhancers obtained from the FANTOM5 Project and control sequences derived from the human genome. Unlike available tools, which accept one sequence at a time, the input to our tool is three sequences; the first two are enhancers active in the same cell type. EnhancerTracker outputs 1 if the third sequence is an enhancer active in the same cell type(s) where the first two enhancers are active. It outputs 0 otherwise. On a held-out set (15%), EnhancerTracker achieved an accuracy of 64%, a specificity of 93%, a recall of 35%, a precision of 84%, and an F1 score of 49%. Availability and implementation https://github.com/BioinformaticsToolsmith/EnhancerTracker. Contact hani.girgis@tamuk.edu.
Collapse
|
13
|
Mao J, Cao Y, Zhang Y, Huang B, Zhao Y. A novel method for identifying key genes in macroevolution based on deep learning with attention mechanism. Sci Rep 2023; 13:19727. [PMID: 37957311 PMCID: PMC10643560 DOI: 10.1038/s41598-023-47113-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Accepted: 11/09/2023] [Indexed: 11/15/2023] Open
Abstract
Macroevolution can be regarded as the result of evolutionary changes of synergistically acting genes. Unfortunately, the importance of these genes in macroevolution is difficult to assess and hence the identification of macroevolutionary key genes is a major challenge in evolutionary biology. In this study, we designed various word embedding libraries of natural language processing (NLP) considering the multiple mechanisms of evolutionary genomics. A novel method (IKGM) based on three types of attention mechanisms (domain attention, kmer attention and fused attention) were proposed to calculate the weights of different genes in macroevolution. Taking 34 species of diurnal butterflies and nocturnal moths in Lepidoptera as an example, we identified a few of key genes with high weights, which annotated to the functions of circadian rhythms, sensory organs, as well as behavioral habits etc. This study not only provides a novel method to identify the key genes of macroevolution at the genomic level, but also helps us to understand the microevolution mechanisms of diurnal butterflies and nocturnal moths in Lepidoptera.
Collapse
Affiliation(s)
- Jiawei Mao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yong Cao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Yan Zhang
- College of Mathematics and Physics, Southwest Forestry University, Kunming, 650224, China
| | - Biaosheng Huang
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China
| | - Youjie Zhao
- College of Big Data and Intelligent Engineering, Southwest Forestry University, Kunming, 650224, China.
| |
Collapse
|
14
|
Müller-Dott S, Tsirvouli E, Vazquez M, Ramirez Flores R, Badia-i-Mompel P, Fallegger R, Türei D, Lægreid A, Saez-Rodriguez J. Expanding the coverage of regulons from high-confidence prior knowledge for accurate estimation of transcription factor activities. Nucleic Acids Res 2023; 51:10934-10949. [PMID: 37843125 PMCID: PMC10639077 DOI: 10.1093/nar/gkad841] [Citation(s) in RCA: 61] [Impact Index Per Article: 30.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 08/08/2023] [Accepted: 09/22/2023] [Indexed: 10/17/2023] Open
Abstract
Gene regulation plays a critical role in the cellular processes that underlie human health and disease. The regulatory relationship between transcription factors (TFs), key regulators of gene expression, and their target genes, the so called TF regulons, can be coupled with computational algorithms to estimate the activity of TFs. However, to interpret these findings accurately, regulons of high reliability and coverage are needed. In this study, we present and evaluate a collection of regulons created using the CollecTRI meta-resource containing signed TF-gene interactions for 1186 TFs. In this context, we introduce a workflow to integrate information from multiple resources and assign the sign of regulation to TF-gene interactions that could be applied to other comprehensive knowledge bases. We find that the signed CollecTRI-derived regulons outperform other public collections of regulatory interactions in accurately inferring changes in TF activities in perturbation experiments. Furthermore, we showcase the value of the regulons by examining TF activity profiles in three different cancer types and exploring TF activities at the level of single-cells. Overall, the CollecTRI-derived TF regulons enable the accurate and comprehensive estimation of TF activities and thereby help to interpret transcriptomics data.
Collapse
Affiliation(s)
- Sophia Müller-Dott
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Eirini Tsirvouli
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
- Department of Biology, Norwegian University of Science and Technology, Trondheim, Norway
| | | | - Ricardo O Ramirez Flores
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Pau Badia-i-Mompel
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Robin Fallegger
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Dénes Türei
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| | - Astrid Lægreid
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Julio Saez-Rodriguez
- Heidelberg University, Faculty of Medicine, and Heidelberg University Hospital, Institute for Computational Biomedicine, Bioquant, Heidelberg, Germany
| |
Collapse
|
15
|
Khodursky S, Zheng EB, Svetec N, Durkin SM, Benjamin S, Gadau A, Wu X, Zhao L. The evolution and mutational robustness of chromatin accessibility in Drosophila. Genome Biol 2023; 24:232. [PMID: 37845780 PMCID: PMC10578003 DOI: 10.1186/s13059-023-03079-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2022] [Accepted: 09/29/2023] [Indexed: 10/18/2023] Open
Abstract
BACKGROUND The evolution of genomic regulatory regions plays a critical role in shaping the diversity of life. While this process is primarily sequence-dependent, the enormous complexity of biological systems complicates the understanding of the factors underlying regulation and its evolution. Here, we apply deep neural networks as a tool to investigate the sequence determinants underlying chromatin accessibility in different species and tissues of Drosophila. RESULTS We train hybrid convolution-attention neural networks to accurately predict ATAC-seq peaks using only local DNA sequences as input. We show that our models generalize well across substantially evolutionarily diverged species of insects, implying that the sequence determinants of accessibility are highly conserved. Using our model to examine species-specific gains in accessibility, we find evidence suggesting that these regions may be ancestrally poised for evolution. Using in silico mutagenesis, we show that accessibility can be accurately predicted from short subsequences in each example. However, in silico knock-out of these sequences does not qualitatively impair classification, implying that accessibility is mutationally robust. Subsequently, we show that accessibility is predicted to be robust to large-scale random mutation even in the absence of selection. Conversely, simulations under strong selection demonstrate that accessibility can be extremely malleable despite its robustness. Finally, we identify motifs predictive of accessibility, recovering both novel and previously known motifs. CONCLUSIONS These results demonstrate the conservation of the sequence determinants of accessibility and the general robustness of chromatin accessibility, as well as the power of deep neural networks to explore fundamental questions in regulatory genomics and evolution.
Collapse
Affiliation(s)
- Samuel Khodursky
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Eric B Zheng
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Nicolas Svetec
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Sylvia M Durkin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
- Present Address: Department of Integrative Biology and Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, CA, USA
| | - Sigi Benjamin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Alice Gadau
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Xia Wu
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA
| | - Li Zhao
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY, 10065, USA.
| |
Collapse
|
16
|
Kleinschmidt H, Xu C, Bai L. Using Synthetic DNA Libraries to Investigate Chromatin and Gene Regulation. Chromosoma 2023; 132:167-189. [PMID: 37184694 PMCID: PMC10542970 DOI: 10.1007/s00412-023-00796-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2023] [Revised: 04/25/2023] [Accepted: 04/26/2023] [Indexed: 05/16/2023]
Abstract
Despite the recent explosion in genome-wide studies in chromatin and gene regulation, we are still far from extracting a set of genetic rules that can predict the function of the regulatory genome. One major reason for this deficiency is that gene regulation is a multi-layered process that involves an enormous variable space, which cannot be fully explored using native genomes. This problem can be partially solved by introducing synthetic DNA libraries into cells, a method that can test the regulatory roles of thousands to millions of sequences with limited variables. Here, we review recent applications of this method to study transcription factor (TF) binding, nucleosome positioning, and transcriptional activity. We discuss the design principles, experimental procedures, and major findings from these studies and compare the pros and cons of different approaches.
Collapse
Affiliation(s)
- Holly Kleinschmidt
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Cheng Xu
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, 16802, USA
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, 16802, USA
| | - Lu Bai
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA, 16802, USA.
- Center for Eukaryotic Gene Regulation, The Pennsylvania State University, University Park, PA, 16802, USA.
- Department of Physics, The Pennsylvania State University, University Park, PA, 16802, USA.
| |
Collapse
|
17
|
Nowling RJ, Njoya K, Peters JG, Riehle MM. Prediction accuracy of regulatory elements from sequence varies by functional sequencing technique. Front Cell Infect Microbiol 2023; 13:1182567. [PMID: 37600946 PMCID: PMC10433755 DOI: 10.3389/fcimb.2023.1182567] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 07/10/2023] [Indexed: 08/22/2023] Open
Abstract
Introduction Various sequencing based approaches are used to identify and characterize the activities of cis-regulatory elements in a genome-wide fashion. Some of these techniques rely on indirect markers such as histone modifications (ChIP-seq with histone antibodies) or chromatin accessibility (ATAC-seq, DNase-seq, FAIRE-seq), while other techniques use direct measures such as episomal assays measuring the enhancer properties of DNA sequences (STARR-seq) and direct measurement of the binding of transcription factors (ChIP-seq with transcription factor-specific antibodies). The activities of cis-regulatory elements such as enhancers, promoters, and repressors are determined by their sequence and secondary processes such as chromatin accessibility, DNA methylation, and bound histone markers. Methods Here, machine learning models are employed to evaluate the accuracy with which cis-regulatory elements identified by various commonly used sequencing techniques can be predicted by their underlying sequence alone to distinguish between cis-regulatory activity that is reflective of sequence content versus secondary processes. Results and discussion Models trained and evaluated on D. melanogaster sequences identified through DNase-seq and STARR-seq are significantly more accurate than models trained on sequences identified by H3K4me1, H3K4me3, and H3K27ac ChIP-seq, FAIRE-seq, and ATAC-seq. These results suggest that the activity detected by DNase-seq and STARR-seq can be largely explained by underlying DNA sequence, independent of secondary processes. Experimentally, a subset of DNase-seq and H3K4me1 ChIP-seq sequences were tested for enhancer activity using luciferase assays and compared with previous tests performed on STARR-seq sequences. The experimental data indicated that STARR-seq sequences are substantially enriched for enhancer-specific activity, while the DNase-seq and H3K4me1 ChIP-seq sequences are not. Taken together, these results indicate that the DNase-seq approach identifies a broad class of regulatory elements of which enhancers are a subset and the associated data are appropriate for training models for detecting regulatory activity from sequence alone, STARR-seq data are best for training enhancer-specific sequence models, and H3K4me1 ChIP-seq data are not well suited for training and evaluating sequence-based models for cis-regulatory element prediction.
Collapse
Affiliation(s)
- Ronald J. Nowling
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Kimani Njoya
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| | - John G. Peters
- Electrical Engineering and Computer Science, Milwaukee School of Engineering, Milwaukee, WI, United States
| | - Michelle M. Riehle
- Department of Microbiology and Immunology, Medical College of Wisconsin, Milwaukee, WI, United States
| |
Collapse
|
18
|
Ma W, Fu Y, Bao Y, Wang Z, Lei B, Zheng W, Wang C, Liu Y. DeepSATA: A Deep Learning-Based Sequence Analyzer Incorporating the Transcription Factor Binding Affinity to Dissect the Effects of Non-Coding Genetic Variants. Int J Mol Sci 2023; 24:12023. [PMID: 37569400 PMCID: PMC10418434 DOI: 10.3390/ijms241512023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 07/13/2023] [Accepted: 07/24/2023] [Indexed: 08/13/2023] Open
Abstract
Utilizing large-scale epigenomics data, deep learning tools can predict the regulatory activity of genomic sequences, annotate non-coding genetic variants, and uncover mechanisms behind complex traits. However, these tools primarily rely on human or mouse data for training, limiting their performance when applied to other species. Furthermore, the limited exploration of many species, particularly in the case of livestock, has led to a scarcity of comprehensive and high-quality epigenetic data, posing challenges in developing reliable deep learning models for decoding their non-coding genomes. The cross-species prediction of the regulatory genome can be achieved by leveraging publicly available data from extensively studied organisms and making use of the conserved DNA binding preferences of transcription factors within the same tissue. In this study, we introduced DeepSATA, a novel deep learning-based sequence analyzer that incorporates the transcription factor binding affinity for the cross-species prediction of chromatin accessibility. By applying DeepSATA to analyze the genomes of pigs, chickens, cattle, humans, and mice, we demonstrated its ability to improve the prediction accuracy of chromatin accessibility and achieve reliable cross-species predictions in animals. Additionally, we showcased its effectiveness in analyzing pig genetic variants associated with economic traits and in increasing the accuracy of genomic predictions. Overall, our study presents a valuable tool to explore the epigenomic landscape of various species and pinpoint regulatory deoxyribonucleic acid (DNA) variants associated with complex traits.
Collapse
Affiliation(s)
- Wenlong Ma
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yang Fu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yongzhou Bao
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- School of Life Sciences, Henan University, Kaifeng 475004, China
| | - Zhen Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- School of Life Sciences, Henan University, Kaifeng 475004, China
| | - Bowen Lei
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Lab of Swine Genetics and Breeding of Ministry of Agriculture and Rural Affairs, Huazhong Agricultural University, Wuhan 430070, China
| | - Weigang Zheng
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Lab of Swine Genetics and Breeding of Ministry of Agriculture and Rural Affairs, Huazhong Agricultural University, Wuhan 430070, China
| | - Chao Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Lab of Swine Genetics and Breeding of Ministry of Agriculture and Rural Affairs, Huazhong Agricultural University, Wuhan 430070, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Foshan 528226, China
| |
Collapse
|
19
|
Khodursky S, Zheng EB, Svetec N, Durkin SM, Benjamin S, Gadau A, Wu X, Zhao L. The evolution and mutational robustness of chromatin accessibility in Drosophila. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.06.26.546587. [PMID: 37425760 PMCID: PMC10327059 DOI: 10.1101/2023.06.26.546587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/11/2023]
Abstract
The evolution of regulatory regions in the genome plays a critical role in shaping the diversity of life. While this process is primarily sequence-dependent, the enormous complexity of biological systems has made it difficult to understand the factors underlying regulation and its evolution. Here, we apply deep neural networks as a tool to investigate the sequence determinants underlying chromatin accessibility in different tissues of Drosophila. We train hybrid convolution-attention neural networks to accurately predict ATAC-seq peaks using only local DNA sequences as input. We show that a model trained in one species has nearly identical performance when tested in another species, implying that the sequence determinants of accessibility are highly conserved. Indeed, model performance remains excellent even in distantly-related species. By using our model to examine species-specific gains in chromatin accessibility, we find that their orthologous inaccessible regions in other species have surprisingly similar model outputs, suggesting that these regions may be ancestrally poised for evolution. We then use in silico saturation mutagenesis to reveal evidence of selective constraint acting specifically on inaccessible chromatin regions. We further show that chromatin accessibility can be accurately predicted from short subsequences in each example. However, in silico knock-out of these sequences does not qualitatively impair classification, implying that chromatin accessibility is mutationally robust. Subsequently, we demonstrate that chromatin accessibility is predicted to be robust to large-scale random mutation even in the absence of selection. We also perform in silico evolution experiments under the regime of strong selection and weak mutation (SSWM) and show that chromatin accessibility can be extremely malleable despite its mutational robustness. However, selection acting in different directions in a tissue-specific manner can substantially slow adaptation. Finally, we identify motifs predictive of chromatin accessibility and recover motifs corresponding to known chromatin accessibility activators and repressors. These results demonstrate the conservation of the sequence determinants of accessibility and the general robustness of chromatin accessibility, as well as the power of deep neural networks as tools to answer fundamental questions in regulatory genomics and evolution.
Collapse
Affiliation(s)
- Samuel Khodursky
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- These authors contributed equally
| | - Eric B Zheng
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- These authors contributed equally
| | - Nicolas Svetec
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Sylvia M Durkin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
- Current Address: Department of Integrative Biology and Museum of Vertebrate Zoology, University of California, Berkeley, Berkeley, CA, USA
| | - Sigi Benjamin
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Alice Gadau
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Xia Wu
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| | - Li Zhao
- Laboratory of Evolutionary Genetics and Genomics, The Rockefeller University, New York, NY 10065, USA
| |
Collapse
|
20
|
Smith GD, Ching WH, Cornejo-Páramo P, Wong ES. Decoding enhancer complexity with machine learning and high-throughput discovery. Genome Biol 2023; 24:116. [PMID: 37173718 PMCID: PMC10176946 DOI: 10.1186/s13059-023-02955-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2022] [Accepted: 04/28/2023] [Indexed: 05/15/2023] Open
Abstract
Enhancers are genomic DNA elements controlling spatiotemporal gene expression. Their flexible organization and functional redundancies make deciphering their sequence-function relationships challenging. This article provides an overview of the current understanding of enhancer organization and evolution, with an emphasis on factors that influence these relationships. Technological advancements, particularly in machine learning and synthetic biology, are discussed in light of how they provide new ways to understand this complexity. Exciting opportunities lie ahead as we continue to unravel the intricacies of enhancer function.
Collapse
Affiliation(s)
- Gabrielle D Smith
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Wan Hern Ching
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
| | - Paola Cornejo-Páramo
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia
| | - Emily S Wong
- Victor Chang Cardiac Research Institute, 405 Liverpool Street, Darlinghurst, NSW, Australia.
- School of Biotechnology and Biomolecular Sciences, UNSW Sydney, Kensington, NSW, Australia.
| |
Collapse
|
21
|
Kaplow IM, Lawler AJ, Schäffer DE, Srinivasan C, Sestili HH, Wirthlin ME, Phan BN, Prasad K, Brown AR, Zhang X, Foley K, Genereux DP, Zoonomia Consortium, Karlsson EK, Lindblad-Toh K, Meyer WK, Pfenning AR. Relating enhancer genetic variation across mammals to complex phenotypes using machine learning. Science 2023; 380:eabm7993. [PMID: 37104615 PMCID: PMC10322212 DOI: 10.1126/science.abm7993] [Citation(s) in RCA: 22] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2021] [Accepted: 02/23/2023] [Indexed: 04/29/2023]
Abstract
Protein-coding differences between species often fail to explain phenotypic diversity, suggesting the involvement of genomic elements that regulate gene expression such as enhancers. Identifying associations between enhancers and phenotypes is challenging because enhancer activity can be tissue-dependent and functionally conserved despite low sequence conservation. We developed the Tissue-Aware Conservation Inference Toolkit (TACIT) to associate candidate enhancers with species' phenotypes using predictions from machine learning models trained on specific tissues. Applying TACIT to associate motor cortex and parvalbumin-positive interneuron enhancers with neurological phenotypes revealed dozens of enhancer-phenotype associations, including brain size-associated enhancers that interact with genes implicated in microcephaly or macrocephaly. TACIT provides a foundation for identifying enhancers associated with the evolution of any convergently evolved phenotype in any large group of species with aligned genomes.
Collapse
Affiliation(s)
- Irene M. Kaplow
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Alyssa J. Lawler
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Daniel E. Schäffer
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Chaitanya Srinivasan
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Heather H. Sestili
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Morgan E. Wirthlin
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - BaDoi N. Phan
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Medical Scientist Training Program, University of Pittsburgh School of Medicine, Pittsburgh, PA, USA
| | - Kavya Prasad
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ashley R. Brown
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Xiaomeng Zhang
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Kathleen Foley
- Department of Biological Sciences, Lehigh University, Bethlehem, PA, USA
| | - Diane P. Genereux
- Broad Institute, Cambridge, MA, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | | | - Elinor K. Karlsson
- Broad Institute, Cambridge, MA, USA
- Program in Bioinformatics and Integrative Biology, University of Massachusetts Chan Medical School, Worcester, MA, USA
| | - Kerstin Lindblad-Toh
- Broad Institute, Cambridge, MA, USA
- Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, Uppsala, Sweden
| | - Wynn K. Meyer
- Department of Biological Sciences, Lehigh University, Bethlehem, PA, USA
| | - Andreas R. Pfenning
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
22
|
Kravchuk EV, Ashniev GA, Gladkova MG, Orlov AV, Vasileva AV, Boldyreva AV, Burenin AG, Skirda AM, Nikitin PI, Orlova NN. Experimental Validation and Prediction of Super-Enhancers: Advances and Challenges. Cells 2023; 12:cells12081191. [PMID: 37190100 DOI: 10.3390/cells12081191] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 04/07/2023] [Accepted: 04/14/2023] [Indexed: 05/17/2023] Open
Abstract
Super-enhancers (SEs) are cis-regulatory elements of the human genome that have been widely discussed since the discovery and origin of the term. Super-enhancers have been shown to be strongly associated with the expression of genes crucial for cell differentiation, cell stability maintenance, and tumorigenesis. Our goal was to systematize research studies dedicated to the investigation of structure and functions of super-enhancers as well as to define further perspectives of the field in various applications, such as drug development and clinical use. We overviewed the fundamental studies which provided experimental data on various pathologies and their associations with particular super-enhancers. The analysis of mainstream approaches for SE search and prediction allowed us to accumulate existing data and propose directions for further algorithmic improvements of SEs' reliability levels and efficiency. Thus, here we provide the description of the most robust algorithms such as ROSE, imPROSE, and DEEPSEN and suggest their further use for various research and development tasks. The most promising research direction, which is based on topic and number of published studies, are cancer-associated super-enhancers and prospective SE-targeted therapy strategies, most of which are discussed in this review.
Collapse
Affiliation(s)
- Ekaterina V Kravchuk
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
- Faculty of Biology, Lomonosov Moscow State University, Leninskiye Gory, MSU, 1-12, 119991 Moscow, Russia
| | - German A Ashniev
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
- Faculty of Biology, Lomonosov Moscow State University, Leninskiye Gory, MSU, 1-12, 119991 Moscow, Russia
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, GSP-1, Leninskiye Gory, MSU, 1-73, 119234 Moscow, Russia
| | - Marina G Gladkova
- Faculty of Bioengineering and Bioinformatics, Lomonosov Moscow State University, GSP-1, Leninskiye Gory, MSU, 1-73, 119234 Moscow, Russia
| | - Alexey V Orlov
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Anastasiia V Vasileva
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Anna V Boldyreva
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Alexandr G Burenin
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Artemiy M Skirda
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Petr I Nikitin
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| | - Natalia N Orlova
- Prokhorov General Physics Institute of the Russian Academy of Sciences, 38 Vavilov St., 119991 Moscow, Russia
| |
Collapse
|
23
|
Latyshev P, Pavlov F, Herbert A, Poptsova M. Unsupervised domain adaptation methods for cross-species transfer of regulatory code signals. Front Big Data 2023; 6:1140663. [PMID: 37063486 PMCID: PMC10101332 DOI: 10.3389/fdata.2023.1140663] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Accepted: 03/14/2023] [Indexed: 04/03/2023] Open
Abstract
Due to advances in NGS technologies whole-genome maps of various functional genomic elements were generated for a dozen of species, however experiments are still expensive and are not available for many species of interest. Deep learning methods became the state-of-the-art computational methods to analyze the available data, but the focus is often only on the species studied. Here we take advantage of the progresses in Transfer Learning in the area of Unsupervised Domain Adaption (UDA) and tested nine UDA methods for prediction of regulatory code signals for genomes of other species. We tested each deep learning implementation by training the model on experimental data from one species, then refined the model using the genome sequence of the target species for which we wanted to make predictions. Among nine tested domain adaptation architectures non-adversarial methods Minimum Class Confusion (MCC) and Deep Adaptation Network (DAN) significantly outperformed others. Conditional Domain Adversarial Network (CDAN) appeared as the third best architecture. Here we provide an empirical assessment of each approach using real world data. The different approaches were tested on ChIP-seq data for transcription factor binding sites and histone marks on human and mouse genomes, but is generalizable to any cross-species transfer of interest. We tested the efficiency of each method using species where experimental data was available for both. The results allows us to assess how well each implementation will work for species for which only limited experimental data is available and will inform the design of future experiments in these understudied organisms. Overall, our results proved the validity of UDA methods for generation of missing experimental data for histone marks and transcription factor binding sites in various genomes and highlights how robust the various approaches are to data that is incomplete, noisy and susceptible to analytic bias.
Collapse
Affiliation(s)
- Pavel Latyshev
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| | - Fedor Pavlov
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| | - Alan Herbert
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
- InsideOutBio, Charlestown, MA, United States
| | - Maria Poptsova
- Laboratory of Bioinformatics, Faculty of Computer Science, HSE University, Moscow, Russia
| |
Collapse
|
24
|
Yang TH, Yu YH, Wu SH, Zhang FY. CFA: An explainable deep learning model for annotating the transcriptional roles of cis-regulatory modules based on epigenetic codes. Comput Biol Med 2023; 152:106375. [PMID: 36502693 DOI: 10.1016/j.compbiomed.2022.106375] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 11/07/2022] [Accepted: 11/27/2022] [Indexed: 11/30/2022]
Abstract
Metazoa gene expression is controlled by modular DNA segments called cis-regulatory modules (CRMs). CRMs can convey promoter/enhancer/insulator roles, generating additional regulation layers in transcription. Experiments for understanding CRM roles are low-throughput and costly. Large-scale CRM function investigation still depends on computational methods. However, existing in silico tools only recognize enhancers or promoters exclusively, thus accumulating errors when considering CRM promoter/enhancer/insulator roles altogether. Currently, no algorithm can concurrently consider these CRM roles. In this research, we developed the CRM Function Annotator (CFA) model. CFA provides complete CRM transcriptional role labeling based on epigenetic profiling interpretation. We demonstrated that CFA achieves high performance (test macro auROC/auPRC = 94.1%/90.3%) and outperforms existing tools in promoter/enhancer/insulator identification. CFA is also inspected to recognize explainable epigenetic codes consistent with previous findings when labeling CRM roles. By considering the higher-order combinations of the epigenetic codes, CFA significantly reduces false-positive rates in CRM transcriptional role annotation. CFA is available at https://github.com/cobisLab/CFA/.
Collapse
Affiliation(s)
- Tzu-Hsien Yang
- Department of Biomedical Engineering, National Cheng Kung University, No. 1, University Road, Tainan 701, Taiwan.
| | - Yu-Huai Yu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| | - Sheng-Hang Wu
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| | - Fang-Yuan Zhang
- Department of Information Management, National University of Kaohsiung, Kaohsiung University Rd, 811 Kaohsiung, Taiwan.
| |
Collapse
|
25
|
Wrightsman T, Marand AP, Crisp PA, Springer NM, Buckler ES. Modeling chromatin state from sequence across angiosperms using recurrent convolutional neural networks. THE PLANT GENOME 2022; 15:e20249. [PMID: 35924336 DOI: 10.1002/tpg2.20249] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/08/2021] [Accepted: 06/20/2022] [Indexed: 06/06/2024]
Abstract
Accessible chromatin regions are critical components of gene regulation but modeling them directly from sequence remains challenging, especially within plants, whose mechanisms of chromatin remodeling are less understood than in animals. We trained an existing deep-learning architecture, DanQ, on data from 12 angiosperm species to predict the chromatin accessibility in leaf of sequence windows within and across species. We also trained DanQ on DNA methylation data from 10 angiosperms because unmethylated regions have been shown to overlap significantly with ACRs in some plants. The across-species models have comparable or even superior performance to a model trained within species, suggesting strong conservation of chromatin mechanisms across angiosperms. Testing a maize (Zea mays L.) held-out model on a multi-tissue chromatin accessibility panel revealed our models are best at predicting constitutively accessible chromatin regions, with diminishing performance as cell-type specificity increases. Using a combination of interpretation methods, we ranked JASPAR motifs by their importance to each model and saw that the TCP and AP2/ERF transcription factor (TF) families consistently ranked highly. We embedded the top three JASPAR motifs for each model at all possible positions on both strands in our sequence window and observed position- and strand-specific patterns in their importance to the model. With our publicly available across-species 'a2z' model it is now feasible to predict the chromatin accessibility and methylation landscape of any angiosperm genome.
Collapse
Affiliation(s)
- Travis Wrightsman
- Section of Plant Breeding and Genetics, Cornell Univ., Ithaca, NY, 14853, USA
| | | | - Peter A Crisp
- School of Agriculture and Food Sciences, Univ. of Queensland, Brisbane, QLD, 4072, Australia
| | - Nathan M Springer
- Dep. of Plant and Microbial Biology, Univ. of Minnesota, Saint Paul, MN, 55108, USA
| | - Edward S Buckler
- Section of Plant Breeding and Genetics, Cornell Univ., Ithaca, NY, 14853, USA
- Institute for Genomic Diversity, Cornell Univ., Ithaca, NY, 14853, USA
- USDA-ARS, Ithaca, NY, 14853, USA
| |
Collapse
|
26
|
Cross-species enhancer prediction using machine learning. Genomics 2022; 114:110454. [PMID: 36030022 DOI: 10.1016/j.ygeno.2022.110454] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/28/2022] [Accepted: 08/16/2022] [Indexed: 11/21/2022]
Abstract
Cis-regulatory elements (CREs) are non-coding parts of the genome that play a critical role in gene expression regulation. Enhancers, as an important example of CREs, interact with genes to influence complex traits like disease, heat tolerance and growth rate. Much of what is known about enhancers come from studies of humans and a few model organisms like mouse, with little known about other mammalian species. Previous studies have attempted to identify enhancers in less studied mammals using comparative genomics but with limited success. Recently, Machine Learning (ML) techniques have shown promising results to predict enhancer regions. Here, we investigated the ability of ML methods to identify enhancers in three non-model mammalian species (cattle, pig and dog) using human and mouse enhancer data from VISTA and publicly available ChIP-seq. We tested nine models, using four different representations of the DNA sequences in cross-species prediction using both the VISTA dataset and species-specific ChIP-seq data. We identified between 809,399 and 877,278 enhancer-like regions (ELRs) in the study species (11.6-13.7% of each genome). These predictions were close to the ~8% proportion of ELRs that covered the human genome. We propose that our ML methods have predictive ability for identifying enhancers in non-model mammalian species. We have provided a list of high confidence enhancers at https://github.com/DaviesCentreInformatics/Cross-species-enhancer-prediction and believe these enhancers will be of great use to the community.
Collapse
|
27
|
Song W, Ovcharenko I. Heterogeneity of enhancers embodies shared and representative functional groups underlying developmental and cell type-specific gene regulation. Gene 2022; 834:146640. [PMID: 35680026 PMCID: PMC9235925 DOI: 10.1016/j.gene.2022.146640] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/10/2022] [Revised: 04/20/2022] [Accepted: 06/02/2022] [Indexed: 11/04/2022]
Abstract
While enhancers in a particular tissue coordinately fulfill regulatory functions, these functions are heterogeneous in nature and comprise of multiple enhancer subclasses and the associated regulatory mechanisms. In this work, we used multiple cell lines to identify enhancer subclasses linked to development, differentiation, and cellular identity. We found that enhancer functional heterogeneity during development encompasses subclasses of ubiquitous functions (11%), development specific regulatory activity (62%), and chromatin interactions (12%). In differentiated cell lines, ubiquitous enhancers (10%) stay active across multiple cell lines.They are accompanied by a large enhancer subclass (ranging from 33% to 63%) with functions specific to the corresponding lineage. The remaining enhancers (27-40%) establish regulatory chromatin structure and facilitate interactions of cell type-specific enhancers with their target promoters. In addition to specialized functions of cell type-specific enhancers, we show that proper accounting of enhancer heterogeneity leads to a 10% increase in accuracy of enhancer classification, which significantly improves the modeling of enhancers and identification of underlying regulatory mechanisms. In summary, our observations suggest that although cell type-specific enhancers are heterogeneous and coordinate different regulatory programs, enhancers from different cell lines maintain common categories of functional groups across developmental and differentiation stages, indicating a higher order rule followed by enhancer-gene regulation.
Collapse
Affiliation(s)
- Wei Song
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
28
|
Pizzollo J, Zintel TM, Babbitt CC. Differentially Active and Conserved Neural Enhancers Define Two Forms of Adaptive Noncoding Evolution in Humans. Genome Biol Evol 2022; 14:evac108. [PMID: 35866592 PMCID: PMC9348619 DOI: 10.1093/gbe/evac108] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2022] [Indexed: 11/28/2022] Open
Abstract
The human and chimpanzee genomes are strikingly similar, but our neural phenotypes are very different. Many of these differences are likely driven by changes in gene expression, and some of those changes may have been adaptive during human evolution. Yet, the relative contributions of positive selection on regulatory regions or other functional regulatory changes are unclear. Where are these changes located throughout the human genome? Are functional regulatory changes near genes or are they in distal enhancer regions? In this study, we experimentally combined both human and chimpanzee cis-regulatory elements (CREs) that showed either (1) signs of accelerated evolution in humans or (2) that have been shown to be active in the human brain. Using a massively parallel reporter assay, we tested the ability of orthologous human and chimpanzee CREs to activate transcription in induced pluripotent stem-cell-derived neural progenitor cells and neurons. With this assay, we identified 179 CREs with differential activity between human and chimpanzee; in contrast, we found 722 CREs with signs of positive selection in humans. Selection and differentially expressed CREs strikingly differ in level of expression, size, and genomic location. We found a subset of 69 CREs in loci with genetic variants associated with neuropsychiatric diseases, which underscores the consequence of regulatory activity in these loci for proper neural development and function. By combining CREs that either experienced recent selection in humans or CREs that are functional brain enhancers, presents a novel way of studying the evolution of noncoding elements that contribute to human neural phenotypes.
Collapse
Affiliation(s)
- Jason Pizzollo
- Molecular and Cellular Biology Graduate Program, University of Massachusetts Amherst, Amherst, MA 01003, USA
- Department of Biology, University of Massachusetts Amherst, Amherst, MA 01003, USA
| | - Trisha M Zintel
- Molecular and Cellular Biology Graduate Program, University of Massachusetts Amherst, Amherst, MA 01003, USA
- Department of Biology, University of Massachusetts Amherst, Amherst, MA 01003, USA
| | - Courtney C Babbitt
- Department of Biology, University of Massachusetts Amherst, Amherst, MA 01003, USA
| |
Collapse
|
29
|
Long K, Li X, Su D, Zeng S, Li H, Zhang Y, Zhang B, Yang W, Li P, Li X, Wang X, Tang Q, Lu L, Jin L, Ma J, Li M. Exploring high-resolution chromatin interaction changes and functional enhancers of myogenic marker genes during myogenic differentiation. J Biol Chem 2022; 298:102149. [PMID: 35787372 PMCID: PMC9352921 DOI: 10.1016/j.jbc.2022.102149] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2022] [Revised: 06/07/2022] [Accepted: 06/09/2022] [Indexed: 11/25/2022] Open
Abstract
Skeletal muscle differentiation (myogenesis) is a complex and highly coordinated biological process regulated by a series of myogenic marker genes. Chromatin interactions between gene's promoters and their enhancers have an important role in transcriptional control. However, the high-resolution chromatin interactions of myogenic genes and their functional enhancers during myogenesis remain largely unclear. Here, we used circularized chromosome conformation capture coupled with next generation sequencing (4C-seq) to investigate eight myogenic marker genes in C2C12 myoblasts (C2C12-MBs) and C2C12 myotubes (C2C12-MTs). We revealed dynamic chromatin interactions of these marker genes during differentiation and identified 163 and 314 significant interaction sites (SISs) in C2C12-MBs and C2C12-MTs, respectively. The interacting genes of SISs in C2C12-MTs were mainly involved in muscle development, and histone modifications of the SISs changed during differentiation. Through functional genomic screening, we also identified 25 and 41 putative active enhancers in C2C12-MBs and C2C12-MTs, respectively. Using luciferase reporter assays for putative enhancers of Myog and Myh3, we identified eight activating enhancers. Furthermore, dCas9-KRAB epigenome editing and RNA-Seq revealed a role for Myog enhancers in the regulation of Myog expression and myogenic differentiation in the native genomic context. Taken together, this study lays the groundwork for understanding 3D chromatin interaction changes of myogenic genes during myogenesis and provides insights that contribute to our understanding of the role of enhancers in regulating myogenesis.
Collapse
Affiliation(s)
- Keren Long
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Xiaokai Li
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Duo Su
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Sha Zeng
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Hengkuan Li
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Yu Zhang
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Biwei Zhang
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Wenying Yang
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Penghao Li
- Jinxin Research Institute for Reproductive Medicine and Genetics, Chengdu Xi'nan Gynecology Hospital Co, Ltd, Chengdu, Sichuan, China
| | - Xuemin Li
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Xun Wang
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Qianzi Tang
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Lu Lu
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Long Jin
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Jideng Ma
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China
| | - Mingzhou Li
- Institute of Animal Genetics and Breeding, College of Animal Science and Technology, Sichuan Agricultural University, Chengdu, China.
| |
Collapse
|
30
|
Abstract
The tremendous amount of biological sequence data available, combined with the recent methodological breakthrough in deep learning in domains such as computer vision or natural language processing, is leading today to the transformation of bioinformatics through the emergence of deep genomics, the application of deep learning to genomic sequences. We review here the new applications that the use of deep learning enables in the field, focusing on three aspects: the functional annotation of genomes, the sequence determinants of the genome functions and the possibility to write synthetic genomic sequences.
Collapse
|
31
|
Lawler AJ, Ramamurthy E, Brown AR, Shin N, Kim Y, Toong N, Kaplow IM, Wirthlin M, Zhang X, Phan BN, Fox GA, Wade K, He J, Ozturk BE, Byrne LC, Stauffer WR, Fish KN, Pfenning AR. Machine learning sequence prioritization for cell type-specific enhancer design. eLife 2022; 11:e69571. [PMID: 35576146 PMCID: PMC9110026 DOI: 10.7554/elife.69571] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 04/25/2022] [Indexed: 11/22/2022] Open
Abstract
Recent discoveries of extreme cellular diversity in the brain warrant rapid development of technologies to access specific cell populations within heterogeneous tissue. Available approaches for engineering-targeted technologies for new neuron subtypes are low yield, involving intensive transgenic strain or virus screening. Here, we present Specific Nuclear-Anchored Independent Labeling (SNAIL), an improved virus-based strategy for cell labeling and nuclear isolation from heterogeneous tissue. SNAIL works by leveraging machine learning and other computational approaches to identify DNA sequence features that confer cell type-specific gene activation and then make a probe that drives an affinity purification-compatible reporter gene. As a proof of concept, we designed and validated two novel SNAIL probes that target parvalbumin-expressing (PV+) neurons. Nuclear isolation using SNAIL in wild-type mice is sufficient to capture characteristic open chromatin features of PV+ neurons in the cortex, striatum, and external globus pallidus. The SNAIL framework also has high utility for multispecies cell probe engineering; expression from a mouse PV+ SNAIL enhancer sequence was enriched in PV+ neurons of the macaque cortex. Expansion of this technology has broad applications in cell type-specific observation, manipulation, and therapeutics across species and disease models.
Collapse
Affiliation(s)
- Alyssa J Lawler
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Biological Sciences Department, Mellon College of Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Easwaran Ramamurthy
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Ashley R Brown
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Naomi Shin
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Yeonju Kim
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Noelle Toong
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Irene M Kaplow
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Morgan Wirthlin
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Xiaoyu Zhang
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - BaDoi N Phan
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
- Medical Scientist Training Program, University of PittsburghPittsburghUnited States
| | - Grant A Fox
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| | - Kirsten Wade
- Department of Psychiatry, Translational Neuroscience Program, University of PittsburghPittsburghUnited States
| | - Jing He
- Department of Neurobiology, University of PittsburghPittsburghUnited States
- Systems Neuroscience Center, Brain Institute, Center for Neuroscience, Center for the Neural Basis of CognitionPittsburghUnited States
| | - Bilge Esin Ozturk
- Department of Ophthalmology, University of PittsburghPittsburghUnited States
| | - Leah C Byrne
- Department of Neurobiology, University of PittsburghPittsburghUnited States
- Department of Ophthalmology, University of PittsburghPittsburghUnited States
- Division of Experimental Retinal Therapies, Department of Clinical Sciences & Advanced Medicine, School of Veterinary Medicine, University of PennsylvaniaPhiladelphiaUnited States
- Department of Bioengineering, University of PittsburghPittsburghUnited States
| | - William R Stauffer
- Department of Neurobiology, University of PittsburghPittsburghUnited States
| | - Kenneth N Fish
- Department of Psychiatry, Translational Neuroscience Program, University of PittsburghPittsburghUnited States
| | - Andreas R Pfenning
- Computational Biology Department, School of Computer Science, Carnegie Mellon UniversityPittsburghUnited States
- Neuroscience Institute, Carnegie Mellon UniversityPittsburghUnited States
| |
Collapse
|
32
|
Kaplow IM, Schäffer DE, Wirthlin ME, Lawler AJ, Brown AR, Kleyman M, Pfenning AR. Inferring mammalian tissue-specific regulatory conservation by predicting tissue-specific differences in open chromatin. BMC Genomics 2022; 23:291. [PMID: 35410163 PMCID: PMC8996547 DOI: 10.1186/s12864-022-08450-7] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 03/07/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Evolutionary conservation is an invaluable tool for inferring functional significance in the genome, including regions that are crucial across many species and those that have undergone convergent evolution. Computational methods to test for sequence conservation are dominated by algorithms that examine the ability of one or more nucleotides to align across large evolutionary distances. While these nucleotide alignment-based approaches have proven powerful for protein-coding genes and some non-coding elements, they fail to capture conservation of many enhancers, distal regulatory elements that control spatial and temporal patterns of gene expression. The function of enhancers is governed by a complex, often tissue- and cell type-specific code that links combinations of transcription factor binding sites and other regulation-related sequence patterns to regulatory activity. Thus, function of orthologous enhancer regions can be conserved across large evolutionary distances, even when nucleotide turnover is high. RESULTS We present a new machine learning-based approach for evaluating enhancer conservation that leverages the combinatorial sequence code of enhancer activity rather than relying on the alignment of individual nucleotides. We first train a convolutional neural network model that can predict tissue-specific open chromatin, a proxy for enhancer activity, across mammals. Next, we apply that model to distinguish instances where the genome sequence would predict conserved function versus a loss of regulatory activity in that tissue. We present criteria for systematically evaluating model performance for this task and use them to demonstrate that our models accurately predict tissue-specific conservation and divergence in open chromatin between primate and rodent species, vastly out-performing leading nucleotide alignment-based approaches. We then apply our models to predict open chromatin at orthologs of brain and liver open chromatin regions across hundreds of mammals and find that brain enhancers associated with neuron activity have a stronger tendency than the general population to have predicted lineage-specific open chromatin. CONCLUSION The framework presented here provides a mechanism to annotate tissue-specific regulatory function across hundreds of genomes and to study enhancer evolution using predicted regulatory differences rather than nucleotide-level conservation measurements.
Collapse
Affiliation(s)
- Irene M Kaplow
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA.
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
| | - Daniel E Schäffer
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Morgan E Wirthlin
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Alyssa J Lawler
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ashley R Brown
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Michael Kleyman
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Andreas R Pfenning
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA.
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
33
|
Li H, Guan Y. Asymmetric predictive relationships across histone modifications. NAT MACH INTELL 2022; 4:288-299. [DOI: 10.1038/s42256-022-00455-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
34
|
Yocca AE, Edger PP. Machine learning approaches to identify core and dispensable genes in pangenomes. THE PLANT GENOME 2022; 15:e20135. [PMID: 34533282 DOI: 10.1002/tpg2.20135] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Accepted: 06/16/2021] [Indexed: 05/25/2023]
Abstract
A gene in a given taxonomic group is either present in every individual (core) or absent in at least a single individual (dispensable). Previous pangenomic studies have identified certain functional differences between core and dispensable genes. However, identifying if a gene belongs to the core or dispensable portion of the genome requires the construction of a pangenome, which involves sequencing the genomes of many individuals. Here we aim to leverage the previously characterized core and dispensable gene content for two grass species [Brachypodium distachyon (L.) P. Beauv. and Oryza sativa L.] to construct a machine learning model capable of accurately classifying genes as core or dispensable using only a single annotated reference genome. Such a model may mitigate the need for pangenome construction, an expensive hurdle especially in orphan crops, which often lack the adequate genomic resources.
Collapse
Affiliation(s)
- Alan E Yocca
- Dep. of Plant Biology, Michigan State Univ., East Lansing, MI, 48824, USA
- Dep. of Horticulture, Michigan State Univ., East Lansing, MI, 48824, USA
| | - Patrick P Edger
- Dep. of Horticulture, Michigan State Univ., East Lansing, MI, 48824, USA
- Genetics and Genome Sciences Program, Michigan State Univ., East Lansing, MI, 48824, USA
| |
Collapse
|
35
|
Jankovic B, Gojobori T. From shallow to deep: some lessons learned from application of machine learning for recognition of functional genomic elements in human genome. Hum Genomics 2022; 16:7. [PMID: 35180894 PMCID: PMC8855580 DOI: 10.1186/s40246-022-00376-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2021] [Accepted: 01/02/2022] [Indexed: 11/25/2022] Open
Abstract
Identification of genomic signals as indicators for functional genomic elements is one of the areas that received early and widespread application of machine learning methods. With time, the methods applied grew in variety and generally exhibited a tendency to improve their ability to identify some major genomic and transcriptomics signals. The evolution of machine learning in genomics followed a similar path to applications of machine learning in other fields. These were impacted in a major way by three dominant developments, namely an enormous increase in availability and quality of data, a significant increase in computational power available to machine learning applications, and finally, new machine learning paradigms, of which deep learning is the most well-known example. It is not easy in general to distinguish factors leading to improvements in results of applications of machine learning. This is even more so in the field of genomics, where the advent of next-generation sequencing and the increased ability to perform functional analysis of raw data have had a major effect on the applicability of machine learning in OMICS fields. In this paper, we survey the results from a subset of published work in application of machine learning in the recognition of genomic signals and regions in human genome and summarize some lessons learnt from this endeavor. There is no doubt that a significant progress has been made both in terms of accuracy and reliability of models. Questions remain however whether the progress has been sufficient and what these developments bring to the field of genomics in general and human genomics in particular. Improving usability, interpretability and accuracy of models remains an important open challenge for current and future research in application of machine learning and more generally of artificial intelligence methods in genomics.
Collapse
Affiliation(s)
- Boris Jankovic
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Takashi Gojobori
- Computational Bioscience Research Center, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia. .,Division of Biological and Environmental Sciences and Engineering, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia.
| |
Collapse
|
36
|
Cochran K, Srivastava D, Shrikumar A, Balsubramani A, Hardison RC, Kundaje A, Mahony S. Domain adaptive neural networks improve cross-species prediction of transcription factor binding. Genome Res 2022; 32:512-523. [PMID: 35042722 PMCID: PMC8896468 DOI: 10.1101/gr.275394.121] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 01/10/2022] [Indexed: 11/29/2022]
Abstract
The intrinsic DNA sequence preferences and cell type–specific cooperative partners of transcription factors (TFs) are typically highly conserved. Hence, despite the rapid evolutionary turnover of individual TF binding sites, predictive sequence models of cell type–specific genomic occupancy of a TF in one species should generalize to closely matched cell types in a related species. To assess the viability of cross-species TF binding prediction, we train neural networks to discriminate ChIP-seq peak locations from genomic background and evaluate their performance within and across species. Cross-species predictive performance is consistently worse than within-species performance, which we show is caused in part by species-specific repeats. To account for this domain shift, we use an augmented network architecture to automatically discourage learning of training species–specific sequence features. This domain adaptation approach corrects for prediction errors on species-specific repeats and improves overall cross-species model performance. Our results show that cross-species TF binding prediction is feasible when models account for domain shifts driven by species-specific repeats.
Collapse
|
37
|
Claussnitzer M, Susztak K. Gaining insight into metabolic diseases from human genetic discoveries. Trends Genet 2021; 37:1081-1094. [PMID: 34315631 PMCID: PMC8578350 DOI: 10.1016/j.tig.2021.07.005] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2021] [Revised: 06/29/2021] [Accepted: 07/05/2021] [Indexed: 12/30/2022]
Abstract
Human large-scale genetic association studies have identified sequence variations at thousands of genetic risk loci that are more common in patients with diverse metabolic disease compared with healthy controls. While these genetic associations have been replicated in multiple large cohorts and sometimes can explain up to 50% of heritability, the molecular and cellular mechanisms affected by common genetic variation associated with metabolic disease remains mostly unknown. A variety of new genome-wide data types, in conjunction with novel biostatistical and computational analytical methodologies and foundational experimental technologies, are paving the way for a principled approach to systematic variant-to-function (V2F) studies for metabolic diseases, turning associated regions into causal variants, cell types and states of action, effector genes, and cellular and physiological mechanisms. Identification of new target genes and cellular programs for metabolic risk loci will improve mechanistic understanding of disease biology and identification of novel therapeutic strategies.
Collapse
Affiliation(s)
- Melina Claussnitzer
- Beth Israel Deaconess Medical Center, The Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Katalin Susztak
- Department of Medicine and Genetics, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
38
|
Srinivasan C, Phan BN, Lawler AJ, Ramamurthy E, Kleyman M, Brown AR, Kaplow IM, Wirthlin ME, Pfenning AR. Addiction-Associated Genetic Variants Implicate Brain Cell Type- and Region-Specific Cis-Regulatory Elements in Addiction Neurobiology. J Neurosci 2021; 41:9008-9030. [PMID: 34462306 PMCID: PMC8549541 DOI: 10.1523/jneurosci.2534-20.2021] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 06/18/2021] [Accepted: 07/10/2021] [Indexed: 12/14/2022] Open
Abstract
Recent large genome-wide association studies have identified multiple confident risk loci linked to addiction-associated behavioral traits. Most genetic variants linked to addiction-associated traits lie in noncoding regions of the genome, likely disrupting cis-regulatory element (CRE) function. CREs tend to be highly cell type-specific and may contribute to the functional development of the neural circuits underlying addiction. Yet, a systematic approach for predicting the impact of risk variants on the CREs of specific cell populations is lacking. To dissect the cell types and brain regions underlying addiction-associated traits, we applied stratified linkage disequilibrium score regression to compare genome-wide association studies to genomic regions collected from human and mouse assays for open chromatin, which is associated with CRE activity. We found enrichment of addiction-associated variants in putative CREs marked by open chromatin in neuronal (NeuN+) nuclei collected from multiple prefrontal cortical areas and striatal regions known to play major roles in reward and addiction. To further dissect the cell type-specific basis of addiction-associated traits, we also identified enrichments in human orthologs of open chromatin regions of female and male mouse neuronal subtypes: cortical excitatory, D1, D2, and PV. Last, we developed machine learning models to predict mouse cell type-specific open chromatin, enabling us to further categorize human NeuN+ open chromatin regions into cortical excitatory or striatal D1 and D2 neurons and predict the functional impact of addiction-associated genetic variants. Our results suggest that different neuronal subtypes within the reward system play distinct roles in the variety of traits that contribute to addiction.SIGNIFICANCE STATEMENT We combine statistical genetic and machine learning techniques to find that the predisposition to for nicotine, alcohol, and cannabis use behaviors can be partially explained by genetic variants in conserved regulatory elements within specific brain regions and neuronal subtypes of the reward system. Our computational framework can flexibly integrate open chromatin data across species to screen for putative causal variants in a cell type- and tissue-specific manner for numerous complex traits.
Collapse
Affiliation(s)
- Chaitanya Srinivasan
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - BaDoi N Phan
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Medical Scientist Training Program, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania 15213
| | - Alyssa J Lawler
- Department of Biological Sciences, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Easwaran Ramamurthy
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Michael Kleyman
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Ashley R Brown
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Irene M Kaplow
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Morgan E Wirthlin
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| | - Andreas R Pfenning
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Department of Biological Sciences, Mellon College of Science, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213
| |
Collapse
|
39
|
Patel ZM, Hughes TR. Global properties of regulatory sequences are predicted by transcription factor recognition mechanisms. Genome Biol 2021; 22:285. [PMID: 34620190 PMCID: PMC8496038 DOI: 10.1186/s13059-021-02503-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Accepted: 09/16/2021] [Indexed: 01/07/2023] Open
Abstract
Background Mammalian genomes contain millions of putative regulatory sequences, which are delineated by binding of multiple transcription factors. The degree to which spacing and orientation constraints among transcription factor binding sites contribute to the recognition and identity of regulatory sequence is an unresolved but important question that impacts our understanding of genome function and evolution. Global mechanisms that underlie phenomena including the size of regulatory sequences, their uniqueness, and their evolutionary turnover remain poorly described. Results Here, we ask whether models incorporating different degrees of spacing and orientation constraints among transcription factor binding sites are broadly consistent with several global properties of regulatory sequence. These properties include length, sequence diversity, turnover rate, and dominance of specific TFs in regulatory site identity and cell type specification. Models with and without spacing and orientation constraints are generally consistent with all observed properties of regulatory sequence, and with regulatory sequences being fundamentally small (~ 1 nucleosome). Uniqueness of regulatory regions and their rapid evolutionary turnover are expected under all models examined. An intriguing issue we identify is that the complexity of eukaryotic regulatory sites must scale with the number of active transcription factors, in order to accomplish observed specificity. Conclusions Models of transcription factor binding with or without spacing and orientation constraints predict that regulatory sequences should be fundamentally short, unique, and turn over rapidly. We posit that the existence of master regulators may be, in part, a consequence of evolutionary pressure to limit the complexity and increase evolvability of regulatory sites. Supplementary Information The online version contains supplementary material available at 10.1186/s13059-021-02503-y.
Collapse
Affiliation(s)
- Zain M Patel
- Donnelly Centre for Cellular and Biomolecular Research and Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 3E1, Canada
| | - Timothy R Hughes
- Donnelly Centre for Cellular and Biomolecular Research and Department of Molecular Genetics, University of Toronto, Toronto, ON, M5S 3E1, Canada.
| |
Collapse
|
40
|
MacPhillamy C, Pitchford WS, Alinejad-Rokny H, Low WY. Opportunity to improve livestock traits using 3D genomics. Anim Genet 2021; 52:785-798. [PMID: 34494283 DOI: 10.1111/age.13135] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 08/24/2021] [Indexed: 11/30/2022]
Abstract
The advent of high-throughput chromosome conformation capture and sequencing (Hi-C) has enabled researchers to probe the 3D architecture of the mammalian genome in a genome-wide manner. Simultaneously, advances in epigenomic assays, such as chromatin immunoprecipitation and sequencing (ChIP-seq) and DNase-seq, have enabled researchers to study cis-regulatory interactions and chromatin accessibility across the same genome-wide scale. The use of these data has revealed many unique insights into gene regulation and disease pathomechanisms in several model organisms. With the advent of these high-throughput sequencing technologies, there has been an ever-increasing number of datasets available for study; however, this is often limited to model organisms. Livestock species play critical roles in the economies of developing and developed nations alike. Despite this, they are greatly underrepresented in the 3D genomics space; Hi-C and related technologies have the potential to revolutionise livestock breeding by enabling a more comprehensive understanding of how production traits are controlled. The growth in human and model organism Hi-C data has seen a surge in the availability of computational tools for use in 3D genomics, with some tools using machine learning techniques to predict features and improve dataset quality. In this review, we provide an overview of the 3D genome and discuss the status of 3D genomics in livestock before delving into advancing the field by drawing inspiration from research in human and mouse. We end by offering future directions for livestock research in the field of 3D genomics.
Collapse
Affiliation(s)
- C MacPhillamy
- Davies Livestock Research Centre, The University of Adelaide, Roseworthy Campus, Mudla Wirra Rd, Roseworthy, SA, 5371, Australia
| | - W S Pitchford
- Davies Livestock Research Centre, The University of Adelaide, Roseworthy Campus, Mudla Wirra Rd, Roseworthy, SA, 5371, Australia
| | - H Alinejad-Rokny
- Biological & Medical Machine Learning Lab, The Graduate School of Biomedical Engineering, UNSW Sydney, Sydney, NSW, 2052, Australia.,School of Computer Science and Engineering, The University of New South Wales (UNSW Sydney), Sydney, NSW, 2052, Australia
| | - W Y Low
- Davies Livestock Research Centre, The University of Adelaide, Roseworthy Campus, Mudla Wirra Rd, Roseworthy, SA, 5371, Australia
| |
Collapse
|
41
|
Vaz JM, Balaji S. Convolutional neural networks (CNNs): concepts and applications in pharmacogenomics. Mol Divers 2021; 25:1569-1584. [PMID: 34031788 PMCID: PMC8342355 DOI: 10.1007/s11030-021-10225-3] [Citation(s) in RCA: 14] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2021] [Accepted: 04/21/2021] [Indexed: 12/17/2022]
Abstract
Convolutional neural networks (CNNs) have been used to extract information from various datasets of different dimensions. This approach has led to accurate interpretations in several subfields of biological research, like pharmacogenomics, addressing issues previously faced by other computational methods. With the rising attention for personalized and precision medicine, scientists and clinicians have now turned to artificial intelligence systems to provide them with solutions for therapeutics development. CNNs have already provided valuable insights into biological data transformation. Due to the rise of interest in precision and personalized medicine, in this review, we have provided a brief overview of the possibilities of implementing CNNs as an effective tool for analyzing one-dimensional biological data, such as nucleotide and protein sequences, as well as small molecular data, e.g., simplified molecular-input line-entry specification, InChI, binary fingerprints, etc., to categorize the models based on their objective and also highlight various challenges. The review is organized into specific research domains that participate in pharmacogenomics for a more comprehensive understanding. Furthermore, the future intentions of deep learning are outlined.
Collapse
Affiliation(s)
- Joel Markus Vaz
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
| | - S Balaji
- Department of Biotechnology, Manipal Institute of Technology, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India.
| |
Collapse
|
42
|
Asma H, Halfon MS. Annotating the Insect Regulatory Genome. INSECTS 2021; 12:591. [PMID: 34209769 PMCID: PMC8305585 DOI: 10.3390/insects12070591] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 05/28/2021] [Revised: 06/23/2021] [Accepted: 06/25/2021] [Indexed: 11/17/2022]
Abstract
An ever-growing number of insect genomes is being sequenced across the evolutionary spectrum. Comprehensive annotation of not only genes but also regulatory regions is critical for reaping the full benefits of this sequencing. Driven by developments in sequencing technologies and in both empirical and computational discovery strategies, the past few decades have witnessed dramatic progress in our ability to identify cis-regulatory modules (CRMs), sequences such as enhancers that play a major role in regulating transcription. Nevertheless, providing a timely and comprehensive regulatory annotation of newly sequenced insect genomes is an ongoing challenge. We review here the methods being used to identify CRMs in both model and non-model insect species, and focus on two tools that we have developed, REDfly and SCRMshaw. These resources can be paired together in a powerful combination to facilitate insect regulatory annotation over a broad range of species, with an accuracy equal to or better than that of other state-of-the-art methods.
Collapse
Affiliation(s)
- Hasiba Asma
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA;
| | - Marc S. Halfon
- Program in Genetics, Genomics, and Bioinformatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA;
- Department of Biochemistry, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biomedical Informatics, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- Department of Biological Sciences, University at Buffalo-State University of New York, Buffalo, NY 14203, USA
- NY State Center of Excellence in Bioinformatics & Life Sciences, Buffalo, NY 14203, USA
| |
Collapse
|
43
|
Hong J, Gao R, Yang Y. CrepHAN: Cross-species prediction of enhancers by using hierarchical attention networks. Bioinformatics 2021; 37:3436-3443. [PMID: 33978703 DOI: 10.1093/bioinformatics/btab349] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2020] [Revised: 04/21/2021] [Accepted: 05/06/2021] [Indexed: 01/17/2023] Open
Abstract
MOTIVATION Enhancers are important functional elements in genome sequences. The identification of enhancers is a very challenging task due to the great diversity of enhancer sequences and the flexible localization on genomes. Till now, the interactions between enhancers and genes have not been fully understood yet. To speed up the studies of the regulatory roles of enhancers, computational tools for the prediction of enhancers have emerged in recent years. Especially, thanks to the ENCODE project and the advances of high-throughput experimental techniques, a large amount of experimentally verified enhancers have been annotated on the human genome, which allows large-scale predictions of unknown enhancers using data-driven methods. However, except for human and some model organisms, the validated enhancer annotations are scarce for most species, leading to more difficulties in the computational identification of enhancers for their genomes. RESULTS In this study, we propose a deep learning-based predictor for enhancers, named CrepHAN, which is featured by a hierarchical attention neural network and word embedding-based representations for DNA sequences. We use the experimentally-supported data of the human genome to train the model, and perform experiments on human and other mammals, including mouse, cow, and dog. The experimental results show that CrepHAN has more advantages on cross-species predictions, and outperforms the existing models by a large margin. Especially, for human-mouse cross-predictions, the AUC score of ROC curve is increased by 0.033∼0.145 on the combined tissue dataset and 0.032∼0.109 on tissue-specific datasets. AVAILABILITY bcmi.sjtu.edu.cn/~yangyang/CrepHAN.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianwei Hong
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,School of Agriculture and Biology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Ruitian Gao
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yang Yang
- Department of Computer Science and Engineering, Shanghai Jiao Tong University, 800 Dong Chuan Rd., Shanghai 200240, China.,Key Laboratory of Shanghai Education Commission for Intelligent Interaction and Cognitive Engineering, Shanghai, 200240, China
| |
Collapse
|
44
|
Schonfeld E, Vendrow E, Vendrow J, Schonfeld E. On the relation of gene essentiality to intron structure: a computational and deep learning approach. Life Sci Alliance 2021; 4:4/6/e202000951. [PMID: 33906938 PMCID: PMC8127325 DOI: 10.26508/lsa.202000951] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2020] [Revised: 04/12/2021] [Accepted: 04/15/2021] [Indexed: 11/24/2022] Open
Abstract
Essential genes have been studied by copy number variants and deletions, both associated with introns. The premise of our work is that introns of essential genes have distinct characteristic properties. We provide support for this by training a deep learning model and demonstrating that introns alone can be used to classify essentiality. The model, limited to first introns, performs at an increased level, implicating first introns in essentiality. We identify unique properties of introns of essential genes, finding that their structure protects against deletion and intron-loss events, especially centered on the first intron. We show that GC density is increased in the first introns of essential genes, allowing for increased enhancer activity, protection against deletions, and improved splice site recognition. We find that first introns of essential genes are of remarkably smaller size than their nonessential counterparts, and to protect against common 3' end deletion events, essential genes carry an increased number of (smaller) introns. To demonstrate the importance of the seven features we identified, we train a feature-based model using only these features and achieve high performance.
Collapse
Affiliation(s)
| | | | - Joshua Vendrow
- University of California, Los Angeles, Los Angeles, CA, USA
| | | |
Collapse
|
45
|
Stearrett N, Dawson T, Rahnavard A, Bachali P, Bendall ML, Zeng C, Caricchio R, Pérez-Losada M, Grammer AC, Lipsky PE, Crandall KA. Expression of Human Endogenous Retroviruses in Systemic Lupus Erythematosus: Multiomic Integration With Gene Expression. Front Immunol 2021; 12:661437. [PMID: 33986751 PMCID: PMC8112243 DOI: 10.3389/fimmu.2021.661437] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Accepted: 04/12/2021] [Indexed: 11/20/2022] Open
Abstract
Systemic lupus erythematosus (SLE) is a chronic autoimmune disease characterized by the production of autoantibodies predominantly to nuclear material. Many aspects of disease pathology are mediated by the deposition of nucleic acid containing immune complexes, which also induce the type 1interferon response, a characteristic feature of SLE. Notably, SLE is remarkably heterogeneous, with a variety of organs involved in different individuals, who also show variation in disease severity related to their ancestries. Here, we probed one potential contribution to disease heterogeneity as well as a possible source of immunoreactive nucleic acids by exploring the expression of human endogenous retroviruses (HERVs). We investigated the expression of HERVs in SLE and their potential relationship to SLE features and the expression of biochemical pathways, including the interferon gene signature (IGS). Towards this goal, we analyzed available and new RNA-Seq data from two independent whole blood studies using Telescope. We identified 481 locus specific HERV encoding regions that are differentially expressed between case and control individuals with only 14% overlap of differentially expressed HERVs between these two datasets. We identified significant differences between differentially expressed HERVs and non-differentially expressed HERVs between the two datasets. We also characterized the host differentially expressed genes and tested their association with the differentially expressed HERVs. We found that differentially expressed HERVs were significantly more physically proximal to host differentially expressed genes than non-differentially expressed HERVs. Finally, we capitalized on locus specific resolution of HERV mapping to identify key molecular pathways impacted by differential HERV expression in people with SLE.
Collapse
Affiliation(s)
- Nathaniel Stearrett
- Computational Biology Institute, George Washington University, Washington, DC, United States
| | - Tyson Dawson
- Computational Biology Institute, George Washington University, Washington, DC, United States
| | - Ali Rahnavard
- Computational Biology Institute, George Washington University, Washington, DC, United States
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States
| | - Prathyusha Bachali
- RILITE Research Institute and AMPEL BioSolutions, Charlottesville, VA, United States
| | - Matthew L. Bendall
- Division of Infectious Diseases, Department of Medicine, Weill Cornell Medicine, New York, NY, United States
| | - Chen Zeng
- Department of Physics, The George Washington University, Washington, DC, United States
| | - Roberto Caricchio
- Lewis Katz School of Medicine, Temple University, Philadelphia, PA, United States
| | - Marcos Pérez-Losada
- Computational Biology Institute, George Washington University, Washington, DC, United States
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States
- CIBIO-InBIO, Centro de Investigação em Biodiversidade e Recursos Genéticos, Universidade do Porto, Vairão, Portugal
| | - Amrie C. Grammer
- RILITE Research Institute and AMPEL BioSolutions, Charlottesville, VA, United States
| | - Peter E. Lipsky
- RILITE Research Institute and AMPEL BioSolutions, Charlottesville, VA, United States
| | - Keith A. Crandall
- Computational Biology Institute, George Washington University, Washington, DC, United States
- Department of Biostatistics & Bioinformatics, Milken Institute School of Public Health, George Washington University, Washington, DC, United States
| |
Collapse
|
46
|
Role of Non-Coding Regulatory Elements in the Control of GR-Dependent Gene Expression. Int J Mol Sci 2021; 22:ijms22084258. [PMID: 33923915 PMCID: PMC8073421 DOI: 10.3390/ijms22084258] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 04/06/2021] [Accepted: 04/08/2021] [Indexed: 11/17/2022] Open
Abstract
The glucocorticoid receptor (GR, also known as NR3C1) coordinates molecular responses to stress. It is a potent transcription activator and repressor that influences hundreds of genes. Enhancers are non-coding DNA regions outside of the core promoters that increase transcriptional activity via long-distance interactions. Active GR binds to pre-existing enhancer sites and recruits further factors, including EP300, a known transcriptional coactivator. However, it is not known how the timing of GR-binding-induced enhancer remodeling relates to transcriptional changes. Here we analyze data from the ENCODE project that provides ChIP-Seq and RNA-Seq data at distinct time points after dexamethasone exposure of human A549 epithelial-like cell line. This study aimed to investigate the temporal interplay between GR binding, enhancer remodeling, and gene expression. By investigating a single distal GR-binding site for each differentially upregulated gene, we show that transcriptional changes follow GR binding, and that the largest enhancer remodeling coincides in time with the highest gene expression changes. A detailed analysis of the time course showed that for upregulated genes, enhancer activation persists after gene expression changes settle. Moreover, genes with the largest change in EP300 binding showed the highest expression dynamics before the peak of EP300 recruitment. Overall, our results show that enhancer remodeling may not directly be driving gene expression dynamics but rather be a consequence of expression activation.
Collapse
|
47
|
Parisi C, Vashisht S, Winata CL. Fish-Ing for Enhancers in the Heart. Int J Mol Sci 2021; 22:3914. [PMID: 33920121 PMCID: PMC8069060 DOI: 10.3390/ijms22083914] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 04/07/2021] [Accepted: 04/08/2021] [Indexed: 12/19/2022] Open
Abstract
Precise control of gene expression is crucial to ensure proper development and biological functioning of an organism. Enhancers are non-coding DNA elements which play an essential role in regulating gene expression. They contain specific sequence motifs serving as binding sites for transcription factors which interact with the basal transcription machinery at their target genes. Heart development is regulated by intricate gene regulatory network ensuring precise spatiotemporal gene expression program. Mutations affecting enhancers have been shown to result in devastating forms of congenital heart defect. Therefore, identifying enhancers implicated in heart biology and understanding their mechanism is key to improve diagnosis and therapeutic options. Despite their crucial role, enhancers are poorly studied, mainly due to a lack of reliable way to identify them and determine their function. Nevertheless, recent technological advances have allowed rapid progress in enhancer discovery. Model organisms such as the zebrafish have contributed significant insights into the genetics of heart development through enabling functional analyses of genes and their regulatory elements in vivo. Here, we summarize the current state of knowledge on heart enhancers gained through studies in model organisms, discuss various approaches to discover and study their function, and finally suggest methods that could further advance research in this field.
Collapse
Affiliation(s)
- Costantino Parisi
- International Institute of Molecular and Cell Biology in Warsaw, 02-109 Warsaw, Poland; (C.P.); (S.V.)
| | - Shikha Vashisht
- International Institute of Molecular and Cell Biology in Warsaw, 02-109 Warsaw, Poland; (C.P.); (S.V.)
| | - Cecilia Lanny Winata
- International Institute of Molecular and Cell Biology in Warsaw, 02-109 Warsaw, Poland; (C.P.); (S.V.)
- Max Planck Institute for Heart and Lung Research, 61231 Bad Nauheim, Germany
| |
Collapse
|
48
|
Singh D, Yi SV. Enhancer pleiotropy, gene expression, and the architecture of human enhancer-gene interactions. Mol Biol Evol 2021; 38:3898-3909. [PMID: 33749795 PMCID: PMC8383896 DOI: 10.1093/molbev/msab085] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 02/10/2021] [Accepted: 03/18/2021] [Indexed: 12/30/2022] Open
Abstract
Enhancers are often studied as noncoding regulatory elements that modulate the precise spatiotemporal expression of genes in a highly tissue-specific manner. This paradigm has been challenged by recent evidence of individual enhancers acting in multiple tissues or developmental contexts. However, the frequency of these enhancers with high degrees of “pleiotropy” out of all putative enhancers is not well understood. Consequently, it is unclear how the variation of enhancer pleiotropy corresponds to the variation in expression breadth of target genes. Here, we use multi-tissue chromatin maps from diverse human tissues to investigate the enhancer–gene interaction architecture while accounting for 1) the distribution of enhancer pleiotropy, 2) the variations of regulatory links from enhancers to target genes, and 3) the expression breadth of target genes. We show that most enhancers are tissue-specific and that highly pleiotropy enhancers account for <1% of all putative regulatory sequences in the human genome. Notably, several genomic features are indicative of increasing enhancer pleiotropy, including longer sequence length, greater number of links to genes, increasing abundance and diversity of encoded transcription factor motifs, and stronger evolutionary conservation. Intriguingly, the number of enhancers per gene remains remarkably consistent for all genes (∼14). However, enhancer pleiotropy does not directly translate to the expression breadth of target genes. We further present a series of Gaussian Mixture Models to represent this organization architecture. Consequently, we demonstrate that a modest trend of more pleiotropic enhancers targeting more broadly expressed genes can generate the observed diversity of expression breadths in the human genome.
Collapse
Affiliation(s)
- Devika Singh
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Soojin V Yi
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
49
|
Singh G, Mullany S, Moorthy SD, Zhang R, Mehdi T, Tian R, Duncan AG, Moses AM, Mitchell JA. A flexible repertoire of transcription factor binding sites and a diversity threshold determines enhancer activity in embryonic stem cells. Genome Res 2021; 31:564-575. [PMID: 33712417 PMCID: PMC8015845 DOI: 10.1101/gr.272468.120] [Citation(s) in RCA: 31] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 02/19/2021] [Indexed: 12/28/2022]
Abstract
Transcriptional enhancers are critical for development and phenotype evolution and are often mutated in disease contexts; however, even in well-studied cell types, the sequence code conferring enhancer activity remains unknown. To examine the enhancer regulatory code for pluripotent stem cells, we identified genomic regions with conserved binding of multiple transcription factors in mouse and human embryonic stem cells (ESCs). Examination of these regions revealed that they contain on average 12.6 conserved transcription factor binding site (TFBS) sequences. Enriched TFBSs are a diverse repertoire of 70 different sequences representing the binding sequences of both known and novel ESC regulators. Using a diverse set of TFBSs from this repertoire was sufficient to construct short synthetic enhancers with activity comparable to native enhancers. Site-directed mutagenesis of conserved TFBSs in endogenous enhancers or TFBS deletion from synthetic sequences revealed a requirement for 10 or more different TFBSs. Furthermore, specific TFBSs, including the POU5F1:SOX2 comotif, are dispensable, despite cobinding the POU5F1 (also known as OCT4), SOX2, and NANOG master regulators of pluripotency. These findings reveal that a TFBS sequence diversity threshold overrides the need for optimized regulatory grammar and individual TFBSs that recruit specific master regulators.
Collapse
Affiliation(s)
- Gurdeep Singh
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada
| | - Shanelle Mullany
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada
| | - Sakthi D Moorthy
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada
| | - Richard Zhang
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada
| | - Tahmid Mehdi
- Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada
| | - Ruxiao Tian
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada
| | - Andrew G Duncan
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada
| | - Alan M Moses
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada.,Department of Computer Science, University of Toronto, Toronto, M5S 2E4, Canada.,Department of Ecology and Evolutionary Biology, University of Toronto, Toronto, M5S 3B3, Canada
| | - Jennifer A Mitchell
- Department of Cell and Systems Biology, University of Toronto, Toronto, Ontario, M5S 3G5, Canada
| |
Collapse
|
50
|
Krützfeldt LM, Schubach M, Kircher M. The impact of different negative training data on regulatory sequence predictions. PLoS One 2020; 15:e0237412. [PMID: 33259518 PMCID: PMC7707526 DOI: 10.1371/journal.pone.0237412] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Accepted: 11/12/2020] [Indexed: 01/08/2023] Open
Abstract
Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
Collapse
Affiliation(s)
- Louisa-Marie Krützfeldt
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Max Schubach
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Martin Kircher
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
- * E-mail:
| |
Collapse
|