1
|
Cross-species enhancer prediction using machine learning. Genomics 2022; 114:110454. [PMID: 36030022 DOI: 10.1016/j.ygeno.2022.110454] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2022] [Revised: 07/28/2022] [Accepted: 08/16/2022] [Indexed: 11/21/2022]
Abstract
Cis-regulatory elements (CREs) are non-coding parts of the genome that play a critical role in gene expression regulation. Enhancers, as an important example of CREs, interact with genes to influence complex traits like disease, heat tolerance and growth rate. Much of what is known about enhancers come from studies of humans and a few model organisms like mouse, with little known about other mammalian species. Previous studies have attempted to identify enhancers in less studied mammals using comparative genomics but with limited success. Recently, Machine Learning (ML) techniques have shown promising results to predict enhancer regions. Here, we investigated the ability of ML methods to identify enhancers in three non-model mammalian species (cattle, pig and dog) using human and mouse enhancer data from VISTA and publicly available ChIP-seq. We tested nine models, using four different representations of the DNA sequences in cross-species prediction using both the VISTA dataset and species-specific ChIP-seq data. We identified between 809,399 and 877,278 enhancer-like regions (ELRs) in the study species (11.6-13.7% of each genome). These predictions were close to the ~8% proportion of ELRs that covered the human genome. We propose that our ML methods have predictive ability for identifying enhancers in non-model mammalian species. We have provided a list of high confidence enhancers at https://github.com/DaviesCentreInformatics/Cross-species-enhancer-prediction and believe these enhancers will be of great use to the community.
Collapse
|
2
|
Kaplow IM, Schäffer DE, Wirthlin ME, Lawler AJ, Brown AR, Kleyman M, Pfenning AR. Inferring mammalian tissue-specific regulatory conservation by predicting tissue-specific differences in open chromatin. BMC Genomics 2022; 23:291. [PMID: 35410163 PMCID: PMC8996547 DOI: 10.1186/s12864-022-08450-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Accepted: 03/07/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Evolutionary conservation is an invaluable tool for inferring functional significance in the genome, including regions that are crucial across many species and those that have undergone convergent evolution. Computational methods to test for sequence conservation are dominated by algorithms that examine the ability of one or more nucleotides to align across large evolutionary distances. While these nucleotide alignment-based approaches have proven powerful for protein-coding genes and some non-coding elements, they fail to capture conservation of many enhancers, distal regulatory elements that control spatial and temporal patterns of gene expression. The function of enhancers is governed by a complex, often tissue- and cell type-specific code that links combinations of transcription factor binding sites and other regulation-related sequence patterns to regulatory activity. Thus, function of orthologous enhancer regions can be conserved across large evolutionary distances, even when nucleotide turnover is high. RESULTS We present a new machine learning-based approach for evaluating enhancer conservation that leverages the combinatorial sequence code of enhancer activity rather than relying on the alignment of individual nucleotides. We first train a convolutional neural network model that can predict tissue-specific open chromatin, a proxy for enhancer activity, across mammals. Next, we apply that model to distinguish instances where the genome sequence would predict conserved function versus a loss of regulatory activity in that tissue. We present criteria for systematically evaluating model performance for this task and use them to demonstrate that our models accurately predict tissue-specific conservation and divergence in open chromatin between primate and rodent species, vastly out-performing leading nucleotide alignment-based approaches. We then apply our models to predict open chromatin at orthologs of brain and liver open chromatin regions across hundreds of mammals and find that brain enhancers associated with neuron activity have a stronger tendency than the general population to have predicted lineage-specific open chromatin. CONCLUSION The framework presented here provides a mechanism to annotate tissue-specific regulatory function across hundreds of genomes and to study enhancer evolution using predicted regulatory differences rather than nucleotide-level conservation measurements.
Collapse
Affiliation(s)
- Irene M Kaplow
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA.
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
| | - Daniel E Schäffer
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Morgan E Wirthlin
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Alyssa J Lawler
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ashley R Brown
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Michael Kleyman
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Andreas R Pfenning
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA.
- Neuroscience Institute, Carnegie Mellon University, Pittsburgh, PA, USA.
- Department of Biology, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
3
|
Powell G, Long H, Zolkiewski L, Dumbell R, Mallon AM, Lindgren CM, Simon MM. Modelling the genetic aetiology of complex disease: human-mouse conservation of noncoding features and disease-associated loci. Biol Lett 2022; 18:20210630. [PMID: 35317627 PMCID: PMC8941414 DOI: 10.1098/rsbl.2021.0630] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022] Open
Abstract
Understanding the genetic aetiology of loci associated with a disease is crucial for developing preventative measures and effective treatments. Mouse models are used extensively to understand human pathobiology and mechanistic functions of disease-associated loci. However, the utility of mouse models is limited in part by evolutionary divergence in transcription regulation for pathways of interest. Here, we summarize the alignment of genomic (exonic and multi-cell regulatory) annotations alongside Mendelian and complex disease-associated variant sites between humans and mice. Our results highlight the importance of understanding evolutionary divergence in transcription regulation when interpreting functional studies using mice as models for human disease variants.
Collapse
Affiliation(s)
- George Powell
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK.,MRC Harwell Institute, Mammalian Genetics Unit, Oxfordshire OX11 0RD, UK
| | - Helen Long
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK.,MRC Harwell Institute, Mammalian Genetics Unit, Oxfordshire OX11 0RD, UK
| | - Louisa Zolkiewski
- MRC Harwell Institute, Mammalian Genetics Unit, Oxfordshire OX11 0RD, UK.,Department of Physiology, Anatomy and Genetics, University of Oxford, Oxford OX3 7BN, UK
| | - Rebecca Dumbell
- Nottingham Trent University, Clifton Lane, Nottingham NG11 8NS, UK
| | - Ann-Marie Mallon
- MRC Harwell Institute, Mammalian Genetics Unit, Oxfordshire OX11 0RD, UK
| | - Cecilia M Lindgren
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford OX3 7LF, UK.,Wellcome Centre for Human Genetics, University of Oxford, Oxford OX3 7BN, UK.,Nottingham Trent University, Clifton Lane, Nottingham NG11 8NS, UK.,Medical and Population Genetics Program, Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Michelle M Simon
- MRC Harwell Institute, Mammalian Genetics Unit, Oxfordshire OX11 0RD, UK
| |
Collapse
|
4
|
Cochran K, Srivastava D, Shrikumar A, Balsubramani A, Hardison RC, Kundaje A, Mahony S. Domain adaptive neural networks improve cross-species prediction of transcription factor binding. Genome Res 2022; 32:512-523. [PMID: 35042722 PMCID: PMC8896468 DOI: 10.1101/gr.275394.121] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 01/10/2022] [Indexed: 11/29/2022]
Abstract
The intrinsic DNA sequence preferences and cell type–specific cooperative partners of transcription factors (TFs) are typically highly conserved. Hence, despite the rapid evolutionary turnover of individual TF binding sites, predictive sequence models of cell type–specific genomic occupancy of a TF in one species should generalize to closely matched cell types in a related species. To assess the viability of cross-species TF binding prediction, we train neural networks to discriminate ChIP-seq peak locations from genomic background and evaluate their performance within and across species. Cross-species predictive performance is consistently worse than within-species performance, which we show is caused in part by species-specific repeats. To account for this domain shift, we use an augmented network architecture to automatically discourage learning of training species–specific sequence features. This domain adaptation approach corrects for prediction errors on species-specific repeats and improves overall cross-species model performance. Our results show that cross-species TF binding prediction is feasible when models account for domain shifts driven by species-specific repeats.
Collapse
|
5
|
Kiser JN, Wang Z, Zanella R, Scraggs E, Neupane M, Cantrell B, Van Tassell CP, White SN, Taylor JF, Neibergs HL. Functional Variants Surrounding Endothelin 2 Are Associated With Mycobacterium avium Subspecies paratuberculosis Infection. Front Vet Sci 2021; 8:625323. [PMID: 34026885 PMCID: PMC8131860 DOI: 10.3389/fvets.2021.625323] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2020] [Accepted: 03/04/2021] [Indexed: 02/04/2023] Open
Abstract
Bovine paratuberculosis, caused by Mycobacterium avium subspecies paratuberculosis (MAP), continues to impact the dairy industry through increased morbidity, mortality, and lost production. Although genome-wide association analyses (GWAAs) have identified loci associated with susceptibility to MAP, limited progress has been made in identifying mutations that cause disease susceptibility. A 235-kb region on Bos taurus chromosome 3 (BTA3), containing a 70-kb haplotype block surrounding endothelin 2 (EDN2), has previously been associated with the risk of MAP infection. EDN2 is highly expressed in the gut and is involved in intracellular calcium signaling and a wide array of biological processes. The objective of this study was to identify putative causal mutations for disease susceptibility in the region surrounding EDN2 in Holstein and Jersey cattle. Using sequence data from 10 Holstein and 10 Jersey cattle, common variants within the 70-kb region containing EDN2 were identified. A custom SNP genotyping array fine-mapped the region using 221 Holstein and 51 Jersey cattle and identified 17 putative causal variants (P < 0.01) located in the 5′ region of EDN2 and a SNP in the 3′ UTR (P = 0.00009) associated with MAP infection. MicroRNA interference assays, mRNA stability assays, and electrophoretic mobility shift assays were performed to determine if allelic changes at each SNP resulted in differences in EDN2 stability or expression. Two SNPs [rs109651404 (G/A) and rs110287192 (G/T)] located within the promoter region of EDN2 displayed differential binding affinity for transcription factors in binding sequences harboring the alternate SNP alleles. The luciferase reporter assay revealed that the transcriptional activity of the EDN2 promoter was increased (P < 0.05) with the A allele for rs109651404 and the G allele for rs110287192. These results suggest that the variants rs109651404 and rs110287192 are mutations that alter transcription and thus may alter susceptibility to MAP infection in Holstein and Jersey cattle.
Collapse
Affiliation(s)
- Jennifer N Kiser
- Department of Animal Sciences, Washington State University, Pullman, WA, United States
| | - Zeping Wang
- Department of Animal Sciences, Washington State University, Pullman, WA, United States
| | - Ricardo Zanella
- Department of Animal Sciences, Washington State University, Pullman, WA, United States
| | - Erik Scraggs
- Department of Animal Sciences, Washington State University, Pullman, WA, United States
| | - Mahesh Neupane
- Department of Animal Sciences, Washington State University, Pullman, WA, United States
| | - Bonnie Cantrell
- Department of Animal Sciences, Washington State University, Pullman, WA, United States
| | - Curtis P Van Tassell
- Animal Genomics and Improvement Laboratory, United States Department of Agriculture, Agricultural Research Service, Beltsville, MD, United States
| | - Stephen N White
- Animal Disease Research, United States Department of Agriculture, Agricultural Research Service, Pullman, WA, United States.,Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States.,Center for Reproductive Biology, Washington State University, Pullman, WA, United States
| | - Jeremy F Taylor
- Division of Animal Sciences, University of Missouri, Columbia, MO, United States
| | - Holly L Neibergs
- Department of Animal Sciences, Washington State University, Pullman, WA, United States
| |
Collapse
|
6
|
Singh D, Yi SV. Enhancer pleiotropy, gene expression, and the architecture of human enhancer-gene interactions. Mol Biol Evol 2021; 38:3898-3909. [PMID: 33749795 PMCID: PMC8383896 DOI: 10.1093/molbev/msab085] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2020] [Revised: 02/10/2021] [Accepted: 03/18/2021] [Indexed: 12/30/2022] Open
Abstract
Enhancers are often studied as noncoding regulatory elements that modulate the precise spatiotemporal expression of genes in a highly tissue-specific manner. This paradigm has been challenged by recent evidence of individual enhancers acting in multiple tissues or developmental contexts. However, the frequency of these enhancers with high degrees of “pleiotropy” out of all putative enhancers is not well understood. Consequently, it is unclear how the variation of enhancer pleiotropy corresponds to the variation in expression breadth of target genes. Here, we use multi-tissue chromatin maps from diverse human tissues to investigate the enhancer–gene interaction architecture while accounting for 1) the distribution of enhancer pleiotropy, 2) the variations of regulatory links from enhancers to target genes, and 3) the expression breadth of target genes. We show that most enhancers are tissue-specific and that highly pleiotropy enhancers account for <1% of all putative regulatory sequences in the human genome. Notably, several genomic features are indicative of increasing enhancer pleiotropy, including longer sequence length, greater number of links to genes, increasing abundance and diversity of encoded transcription factor motifs, and stronger evolutionary conservation. Intriguingly, the number of enhancers per gene remains remarkably consistent for all genes (∼14). However, enhancer pleiotropy does not directly translate to the expression breadth of target genes. We further present a series of Gaussian Mixture Models to represent this organization architecture. Consequently, we demonstrate that a modest trend of more pleiotropic enhancers targeting more broadly expressed genes can generate the observed diversity of expression breadths in the human genome.
Collapse
Affiliation(s)
- Devika Singh
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Soojin V Yi
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|