1
|
Baumgarten N, Rumpf L, Kessler T, Schulz MH. A statistical approach for identifying single nucleotide variants that affect transcription factor binding. iScience 2024; 27:109765. [PMID: 38736546 PMCID: PMC11088338 DOI: 10.1016/j.isci.2024.109765] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 01/30/2024] [Accepted: 04/15/2024] [Indexed: 05/14/2024] Open
Abstract
Non-coding variants located within regulatory elements may alter gene expression by modifying transcription factor (TF) binding sites, thereby leading to functional consequences. Different TF models are being used to assess the effect of DNA sequence variants, such as single nucleotide variants (SNVs). Often existing methods are slow and do not assess statistical significance of results. We investigated the distribution of absolute maximal differential TF binding scores for general computational models that affect TF binding. We find that a modified Laplace distribution can adequately approximate the empirical distributions. A benchmark on in vitro and in vivo datasets showed that our approach improves upon an existing method in terms of performance and speed. Applications on eQTLs and on a genome-wide association study illustrate the usefulness of our statistics by highlighting cell type-specific regulators and target genes. An implementation of our approach is freely available on GitHub and as bioconda package.
Collapse
Affiliation(s)
- Nina Baumgarten
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Laura Rumpf
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| | - Thorsten Kessler
- German Heart Centre Munich, Department of Cardiology, School of Medicine and Health, Technical University of Munich, 80636 Munich, Germany
- German Centre for Cardiovascular Research, Partner Site Munich Heart Alliance, 80636 Munich, Germany
| | - Marcel H. Schulz
- Institute of Cardiovascular Regeneration, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computational Genomic Medicine, Goethe University, 60590 Frankfurt am Main, Germany
- Institute for Computer Science, Goethe University, 60590 Frankfurt am Main, Germany
- German Center for Cardiovascular Research, Partner Site Rhein-Main, 60590 Frankfurt am Main, Germany
| |
Collapse
|
2
|
Li Y, Zhang XO, Liu Y, Lu A. Allele-specific binding (ASB) analyzer for annotation of allele-specific binding SNPs. BMC Bioinformatics 2023; 24:464. [PMID: 38066439 PMCID: PMC10709849 DOI: 10.1186/s12859-023-05604-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Accepted: 12/05/2023] [Indexed: 12/18/2023] Open
Abstract
BACKGROUND Allele-specific binding (ASB) events occur when transcription factors (TFs) bind more favorably to one of the two parental alleles at heterozygous single nucleotide polymorphisms (SNPs). Evidence suggests that ASB events could reveal the impact of sequence variations on TF binding and may have implications for the risk of diseases. RESULTS Here we present ASB-analyzer, a software platform that enables the users to quickly and efficiently input raw sequencing data to generate individual reports containing the cytogenetic map of ASB SNPs and their associated phenotypes. This interactive tool thereby combines ASB SNP identification, biological annotation, motif analysis, phenotype associations and report summary in one pipeline. With this pipeline, we identified 3772 ASB SNPs from thirty GM12878 ChIP-seq datasets and demonstrated that the ASB SNPs were more likely to be enriched at important sites in TF-binding domains. CONCLUSIONS ASB-analyzer is a user-friendly tool that enables the detection, characterization and visualization of ASB SNPs. It is implemented in Python, R and bash shell and packaged in the Conda environment. It is available as an open-source tool on GitHub at https://github.com/Liying1996/ASBanalyzer .
Collapse
Affiliation(s)
- Ying Li
- Research Center for Translational Medicine at East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, 200010, China
| | - Xiao-Ou Zhang
- Research Center for Translational Medicine at East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, 200010, China
| | - Yan Liu
- Research Center for Translational Medicine at East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, 200010, China
| | - Aiping Lu
- Research Center for Translational Medicine at East Hospital, School of Life Sciences and Technology, Tongji University, Shanghai, 200010, China.
| |
Collapse
|
3
|
Tahara S, Tsuchiya T, Matsumoto H, Ozaki H. Transcription factor-binding k-mer analysis clarifies the cell type dependency of binding specificities and cis-regulatory SNPs in humans. BMC Genomics 2023; 24:597. [PMID: 37805453 PMCID: PMC10560430 DOI: 10.1186/s12864-023-09692-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2023] [Accepted: 09/21/2023] [Indexed: 10/09/2023] Open
Abstract
BACKGROUND Transcription factors (TFs) exhibit heterogeneous DNA-binding specificities in individual cells and whole organisms under natural conditions, and de novo motif discovery usually provides multiple motifs, even from a single chromatin immunoprecipitation-sequencing (ChIP-seq) sample. Despite the accumulation of ChIP-seq data and ChIP-seq-derived motifs, the diversity of DNA-binding specificities across different TFs and cell types remains largely unexplored. RESULTS Here, we applied MOCCS2, our k-mer-based motif discovery method, to a collection of human TF ChIP-seq samples across diverse TFs and cell types, and systematically computed profiles of TF-binding specificity scores for all k-mers. After quality control, we compiled a set of TF-binding specificity score profiles for 2,976 high-quality ChIP-seq samples, comprising 473 TFs and 398 cell types. Using these high-quality samples, we confirmed that the k-mer-based TF-binding specificity profiles reflected TF- or TF-family dependent DNA-binding specificities. We then compared the binding specificity scores of ChIP-seq samples with the same TFs but with different cell type classes and found that half of the analyzed TFs exhibited differences in DNA-binding specificities across cell type classes. Additionally, we devised a method to detect differentially bound k-mers between two ChIP-seq samples and detected k-mers exhibiting statistically significant differences in binding specificity scores. Moreover, we demonstrated that differences in the binding specificity scores between k-mers on the reference and alternative alleles could be used to predict the effect of variants on TF binding, as validated by in vitro and in vivo assay datasets. Finally, we demonstrated that binding specificity score differences can be used to interpret disease-associated non-coding single-nucleotide polymorphisms (SNPs) as TF-affecting SNPs and provide candidates responsible for TFs and cell types. CONCLUSIONS Our study provides a basis for investigating the regulation of gene expression in a TF-, TF family-, or cell-type-dependent manner. Furthermore, our differential analysis of binding-specificity scores highlights noncoding disease-associated variants in humans.
Collapse
Affiliation(s)
- Saeko Tahara
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- School of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Takaho Tsuchiya
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan
| | - Hirotaka Matsumoto
- School of Information and Data Sciences, Nagasaki University, 1-14, Bunkyo-Machi, Nagasaki City, Nagasaki, 852-8521, Japan
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan
| | - Haruka Ozaki
- Bioinformatics Laboratory, Institute of Medicine, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Center for Artificial Intelligence Research, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, Ibaraki, 305-8577, Japan.
- Laboratory for Bioinformatics Research, RIKEN Center for Biosystems Dynamics, Wako, Saitama, 351-0198, Japan.
| |
Collapse
|
4
|
Yang M, Ali O, Bjørås M, Wang J. Identifying functional regulatory mutation blocks by integrating genome sequencing and transcriptome data. iScience 2023; 26:107266. [PMID: 37520692 PMCID: PMC10371843 DOI: 10.1016/j.isci.2023.107266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2022] [Revised: 04/05/2023] [Accepted: 06/28/2023] [Indexed: 08/01/2023] Open
Abstract
Millions of single nucleotide variants (SNVs) exist in the human genome; however, it remains challenging to identify functional SNVs associated with diseases. We propose a non-encoding SNVs analysis tool bpb3, BayesPI-BAR version 3, aiming to identify the functional mutation blocks (FMBs) by integrating genome sequencing and transcriptome data. The identified FMBs display high frequency SNVs, significant changes in transcription factors (TFs) binding affinity and are nearby the regulatory regions of differentially expressed genes. A two-level Bayesian approach with a biophysical model for protein-DNA interactions is implemented, to compute TF-DNA binding affinity changes based on clustered position weight matrices (PWMs) from over 1700 TF-motifs. The epigenetic data, such as the DNA methylome can also be integrated to scan FMBs. By testing the datasets from follicular lymphoma and melanoma, bpb3 automatically and robustly identifies FMBs, demonstrating that bpb3 can provide insight into patho-mechanisms, and therapeutic targets from transcriptomic and genomic data.
Collapse
Affiliation(s)
- Mingyi Yang
- Department of Microbiology, Oslo University Hospital and University of Oslo, Oslo, Norway
- Department of Medical Biochemistry, Oslo University Hospital and University of Oslo, Oslo, Norway
| | - Omer Ali
- Department of Pathology, Oslo University Hospital - Norwegian Radium Hospital, Oslo, Norway
- Faculty of Medicine, University of Oslo, Oslo, Norway
| | - Magnar Bjørås
- Department of Microbiology, Oslo University Hospital and University of Oslo, Oslo, Norway
- Department of Clinical and Molecular Medicine, Norwegian University of Science and Technology, Trondheim, Norway
| | - Junbai Wang
- Department of Clinical Molecular Biology (EpiGen), Akershus University Hospital and University of Oslo, Lørenskog, Norway
| |
Collapse
|
5
|
Luo K, Zhong J, Safi A, Hong LK, Tewari AK, Song L, Reddy TE, Ma L, Crawford GE, Hartemink AJ. Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data. Genome Res 2022; 32:1183-1198. [PMID: 35609992 PMCID: PMC9248881 DOI: 10.1101/gr.272203.120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 05/06/2022] [Indexed: 11/24/2022]
Abstract
Over a thousand different transcription factors (TFs) bind with varying occupancy across the human genome. Chromatin immunoprecipitation (ChIP) can assay occupancy genome-wide, but only one TF at a time, limiting our ability to comprehensively observe the TF occupancy landscape, let alone quantify how it changes across conditions. We developed TF occupancy profiler (TOP), a Bayesian hierarchical regression framework, to profile genome-wide quantitative occupancy of numerous TFs using data from a single chromatin accessibility experiment (DNase- or ATAC-seq). TOP is supervised, and its hierarchical structure allows it to predict the occupancy of any sequence-specific TF, even those never assayed with ChIP. We used TOP to profile the quantitative occupancy of hundreds of sequence-specific TFs at sites throughout the genome and examined how their occupancies changed in multiple contexts: in approximately 200 human cell types, through 12 h of exposure to different hormones, and across the genetic backgrounds of 70 individuals. TOP enables cost-effective exploration of quantitative changes in the landscape of TF binding.
Collapse
Affiliation(s)
- Kaixuan Luo
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Human Genetics, The University of Chicago, Chicago, Illinois 60637, USA
| | - Jianling Zhong
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
| | - Alexias Safi
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Linda K Hong
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alok K Tewari
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Lingyun Song
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Timothy E Reddy
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Biostatistics and Bioinformatics, Durham, North Carolina 27710, USA
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina 27710, USA
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA
| | - Li Ma
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA
| | - Gregory E Crawford
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alexander J Hartemink
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Biology, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
6
|
Abdalla M, Abdalla M. A general framework for predicting the transcriptomic consequences of non-coding variation and small molecules. PLoS Comput Biol 2022; 18:e1010028. [PMID: 35421087 PMCID: PMC9041867 DOI: 10.1371/journal.pcbi.1010028] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2021] [Revised: 04/26/2022] [Accepted: 03/16/2022] [Indexed: 11/18/2022] Open
Abstract
Genome wide association studies (GWASs) for complex traits have implicated thousands of genetic loci. Most GWAS-nominated variants lie in noncoding regions, complicating the systematic translation of these findings into functional understanding. Here, we leverage convolutional neural networks to assist in this challenge. Our computational framework, peaBrain, models the transcriptional machinery of a tissue as a two-stage process: first, predicting the mean tissue specific abundance of all genes and second, incorporating the transcriptomic consequences of genotype variation to predict individual abundance on a subject-by-subject basis. We demonstrate that peaBrain accounts for the majority (>50%) of variance observed in mean transcript abundance across most tissues and outperforms regularized linear models in predicting the consequences of individual genotype variation. We highlight the validity of the peaBrain model by calculating non-coding impact scores that correlate with nucleotide evolutionary constraint that are also predictive of disease-associated variation and allele-specific transcription factor binding. We further show how these tissue-specific peaBrain scores can be leveraged to pinpoint functional tissues underlying complex traits, outperforming methods that depend on colocalization of eQTL and GWAS signals. We subsequently: (a) derive continuous dense embeddings of genes for downstream applications; (b) highlight the utility of the model in predicting transcriptomic impact of small molecules and shRNA (on par with in vitro experimental replication of external test sets); (c) explore how peaBrain can be used to model difficult-to-study processes (such as neural induction); and (d) identify putatively functional eQTLs that are missed by high-throughput experimental approaches.
Collapse
Affiliation(s)
- Moustafa Abdalla
- Wellcome Trust Centre for Human Genetics, Nuffield Department of Medicine, University of Oxford, Oxford, United Kingdom
- Oxford Centre for Diabetes, Endocrinology and Metabolism, Radcliffe Department of Medicine, University of Oxford, Oxford, United Kingdom
- Computational Statistics and Machine Learning, Department of Statistics, University of Oxford, Oxford, United Kingdom
- Department of Surgery, Harvard Medical School, Boston, Massachusetts, United States of America
- * E-mail: (MA); (MA)
| | - Mohamed Abdalla
- Vector Institute for Artificial Intelligence, Toronto, Canada
- Department of Computer Science, University of Toronto, Toronto, Canada
- * E-mail: (MA); (MA)
| |
Collapse
|
7
|
Jiang L, Guo F, Tang J, Yu H, Ness S, Duan M, Mao P, Zhao YY, Guo Y. SBSA: an online service for somatic binding sequence annotation. Nucleic Acids Res 2022; 50:e4. [PMID: 34606615 PMCID: PMC8500130 DOI: 10.1093/nar/gkab877] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Revised: 09/10/2021] [Accepted: 09/17/2021] [Indexed: 12/11/2022] Open
Abstract
Efficient annotation of alterations in binding sequences of molecular regulators can help identify novel candidates for mechanisms study and offer original therapeutic hypotheses. In this work, we developed Somatic Binding Sequence Annotator (SBSA) as a full-capacity online tool to annotate altered binding motifs/sequences, addressing diverse types of genomic variants and molecular regulators. The genomic variants can be somatic mutation, single nucleotide polymorphism, RNA editing, etc. The binding motifs/sequences involve transcription factors (TFs), RNA-binding proteins, miRNA seeds, miRNA-mRNA 3'-UTR binding target, or can be any custom motifs/sequences. Compared to similar tools, SBSA is the first to support miRNA seeds and miRNA-mRNA 3'-UTR binding target, and it unprecedentedly implements a personalized genome approach that accommodates joint adjacent variants. SBSA is empowered to support an indefinite species, including preloaded reference genomes for SARS-Cov-2 and 25 other common organisms. We demonstrated SBSA by annotating multi-omics data from over 30,890 human subjects. Of the millions of somatic binding sequences identified, many are with known severe biological repercussions, such as the somatic mutation in TERT promoter region which causes a gained binding sequence for E26 transformation-specific factor (ETS1). We further validated the function of this TERT mutation using experimental data in cancer cells. Availability:http://innovebioinfo.com/Annotation/SBSA/SBSA.php.
Collapse
Affiliation(s)
- Limin Jiang
- Faculty of Life Science & Medicine, Northwest University, No. 229 Taibai North Road, Xi’an 710069, China
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Fei Guo
- School of Computer Science and Technology, College of Intelligence and Computing, Tianjin University, Tianjin 300350, China
| | - Jijun Tang
- Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
| | - Hui Yu
- Comprehensive cancer center, Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87109, USA
| | - Scott Ness
- Comprehensive cancer center, Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87109, USA
| | - Mingrui Duan
- Comprehensive cancer center, Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87109, USA
| | - Peng Mao
- Comprehensive cancer center, Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87109, USA
| | - Ying-Yong Zhao
- Faculty of Life Science & Medicine, Northwest University, No. 229 Taibai North Road, Xi’an 710069, China
| | - Yan Guo
- Comprehensive cancer center, Department of Internal Medicine, University of New Mexico, Albuquerque, NM 87109, USA
| |
Collapse
|
8
|
Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and perspectives. Brief Bioinform 2021; 23:6425809. [PMID: 34791021 DOI: 10.1093/bib/bbab460] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 09/29/2021] [Accepted: 10/07/2021] [Indexed: 02/07/2023] Open
Abstract
The innovation of biotechnologies has allowed the accumulation of omics data at an alarming rate, thus introducing the era of 'big data'. Extracting inherent valuable knowledge from various omics data remains a daunting problem in bioinformatics. Better solutions often need some kind of more innovative methods for efficient handlings and effective results. Recent advancements in integrated analysis and computational modeling of multi-omics data helped address such needs in an increasingly harmonious manner. The development and application of machine learning have largely advanced our insights into biology and biomedicine and greatly promoted the development of therapeutic strategies, especially for precision medicine. Here, we propose a comprehensive survey and discussion on what happened, is happening and will happen when machine learning meets omics. Specifically, we describe how artificial intelligence can be applied to omics studies and review recent advancements at the interface between machine learning and the ever-widest range of omics including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as those at the single-cell resolution. We also discuss and provide a synthesis of ideas, new insights, current challenges and perspectives of machine learning in omics.
Collapse
Affiliation(s)
- Rufeng Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Lixin Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Yungang Xu
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Juan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China.,Key Laboratory of Environment and Genes Related to Diseases (Xi'an Jiaotong University), Ministry of Education of China, Xi'an 710061, P. R. China
| |
Collapse
|
9
|
Yao Q, Ferragina P, Reshef Y, Lettre G, Bauer DE, Pinello L. Motif-Raptor: a cell type-specific and transcription factor centric approach for post-GWAS prioritization of causal regulators. Bioinformatics 2021; 37:2103-2111. [PMID: 33532840 PMCID: PMC11025460 DOI: 10.1093/bioinformatics/btab072] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Genome-wide association studies (GWASs) have identified thousands of common trait-associated genetic variants but interpretation of their function remains challenging. These genetic variants can overlap the binding sites of transcription factors (TFs) and therefore could alter gene expression. However, we currently lack a systematic understanding on how this mechanism contributes to phenotype. RESULTS We present Motif-Raptor, a TF-centric computational tool that integrates sequence-based predictive models, chromatin accessibility, gene expression datasets and GWAS summary statistics to systematically investigate how TF function is affected by genetic variants. Given trait-associated non-coding variants, Motif-Raptor can recover relevant cell types and critical TFs to drive hypotheses regarding their mechanism of action. We tested Motif-Raptor on complex traits such as rheumatoid arthritis and red blood cell count and demonstrated its ability to prioritize relevant cell types, potential regulatory TFs and non-coding SNPs which have been previously characterized and validated. AVAILABILITY AND IMPLEMENTATION Motif-Raptor is freely available as a Python package at: https://github.com/pinellolab/MotifRaptor. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qiuming Yao
- Department of Pathology, Massachusetts General Hospital, Charlestown, MA 02129, USA
- Division of Hematology/Oncology, Boston Children’s Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
| | - Paolo Ferragina
- Department of Computer Science, University of Pisa, Pisa 56128, Italy
| | - Yakir Reshef
- Department of Computer Science, Harvard University, Cambridge, MA 02138, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Guillaume Lettre
- Faculty of Medicine, Université de Montréal, Montreal, Quebec H3C3J7, Canada
- Montreal Heart Institute, Montreal, Quebec H1T1C8, Canada
| | - Daniel E Bauer
- Division of Hematology/Oncology, Boston Children’s Hospital, Boston, MA 02115, USA
- Harvard Medical School, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
- Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA 02115, USA
| | - Luca Pinello
- Department of Pathology, Massachusetts General Hospital, Charlestown, MA 02129, USA
- Harvard Medical School, Boston, MA 02115, USA
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| |
Collapse
|
10
|
Zheng A, Lamkin M, Zhao H, Wu C, Su H, Gymrek M. Deep neural networks identify sequence context features predictive of transcription factor binding. NAT MACH INTELL 2021; 3:172-180. [PMID: 33796819 PMCID: PMC8009085 DOI: 10.1038/s42256-020-00282-y] [Citation(s) in RCA: 44] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 12/10/2020] [Indexed: 12/11/2022]
Abstract
Transcription factors (TFs) bind DNA by recognizing specific sequence motifs, typically of length 6-12bp. A motif can occur many thousands of times in the human genome, but only a subset of those sites are actually bound. Here we present a machine learning framework leveraging existing convolutional neural network architectures and model interpretation techniques to identify and interpret sequence context features most important for predicting whether a particular motif instance will be bound. We apply our framework to predict binding at motifs for 38 TFs in a lymphoblastoid cell line, score the importance of context sequences at base-pair resolution, and characterize context features most predictive of binding. We find that the choice of training data heavily influences classification accuracy and the relative importance of features such as open chromatin. Overall, our framework enables novel insights into features predictive of TF binding and is likely to inform future deep learning applications to interpret non-coding genetic variants.
Collapse
Affiliation(s)
- An Zheng
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA
| | - Michael Lamkin
- Department of Bioengineering, University of California San Diego, La Jolla, CA USA
| | - Hanqing Zhao
- Department of Biology, University of California San Diego, La Jolla, CA, USA
| | - Cynthia Wu
- Bioinformatics and Systems Biology Program, University of California San Diego, La Jolla, CA USA
| | - Hao Su
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA
| | - Melissa Gymrek
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA USA
- Department of Medicine, University of California San Diego, La Jolla, CA USA
| |
Collapse
|
11
|
Krützfeldt LM, Schubach M, Kircher M. The impact of different negative training data on regulatory sequence predictions. PLoS One 2020; 15:e0237412. [PMID: 33259518 PMCID: PMC7707526 DOI: 10.1371/journal.pone.0237412] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Accepted: 11/12/2020] [Indexed: 01/08/2023] Open
Abstract
Regulatory regions, like promoters and enhancers, cover an estimated 5–15% of the human genome. Changes to these sequences are thought to underlie much of human phenotypic variation and a substantial proportion of genetic causes of disease. However, our understanding of their functional encoding in DNA is still very limited. Applying machine or deep learning methods can shed light on this encoding and gapped k-mer support vector machines (gkm-SVMs) or convolutional neural networks (CNNs) are commonly trained on putative regulatory sequences. Here, we investigate the impact of negative sequence selection on model performance. By training gkm-SVM and CNN models on open chromatin data and corresponding negative training dataset, both learners and two approaches for negative training data are compared. Negative sets use either genomic background sequences or sequence shuffles of the positive sequences. Model performance was evaluated on three different tasks: predicting elements active in a cell-type, predicting cell-type specific elements, and predicting elements' relative activity as measured from independent experimental data. Our results indicate strong effects of the negative training data, with genomic backgrounds showing overall best results. Specifically, models trained on highly shuffled sequences perform worse on the complex tasks of tissue-specific activity and quantitative activity prediction, and seem to learn features of artificial sequences rather than regulatory activity. Further, we observe that insufficient matching of genomic background sequences results in model biases. While CNNs achieved and exceeded the performance of gkm-SVMs for larger training datasets, gkm-SVMs gave robust and best results for typical training dataset sizes without the need of hyperparameter optimization.
Collapse
Affiliation(s)
- Louisa-Marie Krützfeldt
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Max Schubach
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
| | - Martin Kircher
- Charité–Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health (BIH), Berlin, Germany
- * E-mail:
| |
Collapse
|
12
|
Wang M, Wang D, Zhang K, Ngo V, Fan S, Wang W. Motto: Representing Motifs in Consensus Sequences with Minimum Information Loss. Genetics 2020; 216:353-358. [PMID: 32816922 PMCID: PMC7536857 DOI: 10.1534/genetics.120.303597] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2020] [Accepted: 08/17/2020] [Indexed: 11/21/2022] Open
Abstract
Sequence analysis frequently requires intuitive understanding and convenient representation of motifs. Typically, motifs are represented as position weight matrices (PWMs) and visualized using sequence logos. However, in many scenarios, in order to interpret the motif information or search for motif matches, it is compact and sufficient to represent motifs by wildcard-style consensus sequences (such as [GC][AT]GATAAG[GAC]). Based on mutual information theory and Jensen-Shannon divergence, we propose a mathematical framework to minimize the information loss in converting PWMs to consensus sequences. We name this representation as sequence Motto and have implemented an efficient algorithm with flexible options for converting motif PWMs into Motto from nucleotides, amino acids, and customized characters. We show that this representation provides a simple and efficient way to identify the binding sites of 1156 common transcription factors (TFs) in the human genome. The effectiveness of the method was benchmarked by comparing sequence matches found by Motto with PWM scanning results found by FIMO. On average, our method achieves a 0.81 area under the precision-recall curve, significantly (P-value < 0.01) outperforming all existing methods, including maximal positional weight, Cavener's method, and minimal mean square error. We believe this representation provides a distilled summary of a motif, as well as the statistical justification.
Collapse
Affiliation(s)
- Mengchi Wang
- Bioinformatics and Systems Biology, University of California at San Diego, La Jolla, California 92093
| | - David Wang
- Department of Chemistry and Biochemistry, University of California at San Diego, La Jolla, California 92093
| | - Kai Zhang
- Bioinformatics and Systems Biology, University of California at San Diego, La Jolla, California 92093
| | - Vu Ngo
- Bioinformatics and Systems Biology, University of California at San Diego, La Jolla, California 92093
| | - Shicai Fan
- Department of Chemistry and Biochemistry, University of California at San Diego, La Jolla, California 92093
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, China 610054
| | - Wei Wang
- Bioinformatics and Systems Biology, University of California at San Diego, La Jolla, California 92093
- Department of Chemistry and Biochemistry, University of California at San Diego, La Jolla, California 92093
- Department of Cellular and Molecular Medicine, University of California at San Diego, La Jolla, California 92093
| |
Collapse
|
13
|
Kreimer A, Yan Z, Ahituv N, Yosef N. Meta-analysis of massively parallel reporter assays enables prediction of regulatory function across cell types. Hum Mutat 2019; 40:1299-1313. [PMID: 31131957 PMCID: PMC6771677 DOI: 10.1002/humu.23820] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2019] [Revised: 05/18/2019] [Accepted: 05/24/2019] [Indexed: 01/01/2023]
Abstract
Deciphering the potential of noncoding loci to influence gene regulation has been the subject of intense research, with important implications in understanding genetic underpinnings of human diseases. Massively parallel reporter assays (MPRAs) can measure regulatory activity of thousands of DNA sequences and their variants in a single experiment. With increasing number of publically available MPRA data sets, one can now develop data-driven models which, given a DNA sequence, predict its regulatory activity. Here, we performed a comprehensive meta-analysis of several MPRA data sets in a variety of cellular contexts. We first applied an ensemble of methods to predict MPRA output in each context and observed that the most predictive features are consistent across data sets. We then demonstrate that predictive models trained in one cellular context can be used to predict MPRA output in another, with loss of accuracy attributed to cell-type-specific features. Finally, we show that our approach achieves top performance in the Fifth Critical Assessment of Genome Interpretation "Regulation Saturation" Challenge for predicting effects of single-nucleotide variants. Overall, our analysis provides insights into how MPRA data can be leveraged to highlight functional regulatory regions throughout the genome and can guide effective design of future experiments by better prioritizing regions of interest.
Collapse
Affiliation(s)
- Anat Kreimer
- Department of Electrical Engineering and Computer Sciences, Center for Computational BiologyUniversity of CaliforniaBerkeleyCalifornia
- Department of Bioengineering and Therapeutic SciencesUniversity of California, San FranciscoSan FranciscoCalifornia
| | - Zhongxia Yan
- Department of Electrical Engineering and Computer Sciences, Center for Computational BiologyUniversity of CaliforniaBerkeleyCalifornia
| | - Nadav Ahituv
- Department of Bioengineering and Therapeutic SciencesUniversity of California, San FranciscoSan FranciscoCalifornia
| | - Nir Yosef
- Department of Electrical Engineering and Computer Sciences, Center for Computational BiologyUniversity of CaliforniaBerkeleyCalifornia
- Ragon Institute of MGH MIT and HarvardCambridgeMassachusetts
- Chan Zuckerberg BiohubSan FranciscoCalifornia
| |
Collapse
|
14
|
Xie X, Hanson C, Sinha S. Mechanistic interpretation of non-coding variants for discovering transcriptional regulators of drug response. BMC Biol 2019; 17:62. [PMID: 31362726 PMCID: PMC6664756 DOI: 10.1186/s12915-019-0679-8] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2019] [Accepted: 07/09/2019] [Indexed: 12/21/2022] Open
Abstract
BACKGROUND Identification of functional non-coding variants and their mechanistic interpretation is a major challenge of modern genomics, especially for precision medicine. Transcription factor (TF) binding profiles and epigenomic landscapes in reference samples allow functional annotation of the genome, but do not provide ready answers regarding the effects of non-coding variants on phenotypes. A promising computational approach is to build models that predict TF-DNA binding from sequence, and use such models to score a variant's impact on TF binding strength. Here, we asked if this mechanistic approach to variant interpretation can be combined with information on genotype-phenotype associations to discover transcription factors regulating phenotypic variation among individuals. RESULTS We developed a statistical approach that integrates phenotype, genotype, gene expression, TF ChIP-seq, and Hi-C chromatin interaction data to answer this question. Using drug sensitivity of lymphoblastoid cell lines as the phenotype of interest, we tested if non-coding variants statistically linked to the phenotype are enriched for strong predicted impact on DNA binding strength of a TF and thus identified TFs regulating individual differences in the phenotype. Our approach relies on a new method for predicting variant impact on TF-DNA binding that uses a combination of biophysical modeling and machine learning. We report statistical and literature-based support for many of the TFs discovered here as regulators of drug response variation. We show that the use of mechanistically driven variant impact predictors can identify TF-drug associations that would otherwise be missed. We examined in depth one reported association-that of the transcription factor ELF1 with the drug doxorubicin-and identified several genes that may mediate this regulatory relationship. CONCLUSION Our work represents initial steps in utilizing predictions of variant impact on TF binding sites for discovery of regulatory mechanisms underlying phenotypic variation. Future advances on this topic will be greatly beneficial to the reconstruction of phenotype-associated gene regulatory networks.
Collapse
Affiliation(s)
- Xiaoman Xie
- Center for Biophysics and Quantitative Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Casey Hanson
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA
| | - Saurabh Sinha
- Department of Computer Science, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA. .,Institute of Genomic Biology, University of Illinois Urbana-Champaign, Urbana, IL, 61801, USA.
| |
Collapse
|
15
|
Li S, Kvon EZ, Visel A, Pennacchio LA, Ovcharenko I. Stable enhancers are active in development, and fragile enhancers are associated with evolutionary adaptation. Genome Biol 2019; 20:140. [PMID: 31307522 PMCID: PMC6631995 DOI: 10.1186/s13059-019-1750-z] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2019] [Accepted: 06/28/2019] [Indexed: 12/13/2022] Open
Abstract
Background Despite continual progress in the identification and characterization of trait- and disease-associated variants that disrupt transcription factor (TF)-DNA binding, little is known about the distribution of TF binding deactivating mutations (deMs) in enhancer sequences. Here, we focus on elucidating the mechanism underlying the different densities of deMs in human enhancers. Results We identify two classes of enhancers based on the density of nucleotides prone to deMs. Firstly, fragile enhancers with abundant deM nucleotides are associated with the immune system and regular cellular maintenance. Secondly, stable enhancers with only a few deM nucleotides are associated with the development and regulation of TFs and are evolutionarily conserved. These two classes of enhancers feature different regulatory programs: the binding sites of pioneer TFs of FOX family are specifically enriched in stable enhancers, while tissue-specific TFs are enriched in fragile enhancers. Moreover, stable enhancers are more tolerant of deMs due to their dominant employment of homotypic TF binding site (TFBS) clusters, as opposed to the larger-extent usage of heterotypic TFBS clusters in fragile enhancers. Notably, the sequence environment and chromatin context of the cognate motif, other than the motif itself, contribute more to the susceptibility to deMs of TF binding. Conclusions This dichotomy of enhancer activity is conserved across different tissues, has a specific footprint in epigenetic profiles, and argues for a bimodal evolution of gene regulatory programs in vertebrates. Specifically encoded stable enhancers are evolutionarily conserved and associated with development, while differently encoded fragile enhancers are associated with the adaptation of species. Electronic supplementary material The online version of this article (10.1186/s13059-019-1750-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Shan Li
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Evgeny Z Kvon
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Axel Visel
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.,United States Department of Energy Joint Genome Institute, Walnut Creek, CA, 94598, USA.,School of Natural Sciences, University of California, Merced, CA, 95343, USA
| | - Len A Pennacchio
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA.,United States Department of Energy Joint Genome Institute, Walnut Creek, CA, 94598, USA.,Comparative Biochemistry Program, University of California, Berkeley, CA, 94720, USA
| | - Ivan Ovcharenko
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 20892, USA.
| |
Collapse
|
16
|
Batmanov K, Delabie J, Wang J. BayesPI-BAR2: A New Python Package for Predicting Functional Non-coding Mutations in Cancer Patient Cohorts. Front Genet 2019; 10:282. [PMID: 31001324 PMCID: PMC6454009 DOI: 10.3389/fgene.2019.00282] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2018] [Accepted: 03/15/2019] [Indexed: 12/21/2022] Open
Abstract
Most of somatic mutations in cancer occur outside of gene coding regions. These mutations may disrupt the gene regulation by affecting protein-DNA interaction. A study of these disruptions is important in understanding tumorigenesis. However, current computational tools process DNA sequence variants individually, when predicting the effect on protein-DNA binding. Thus, it is a daunting task to identify functional regulatory disturbances among thousands of mutations in a patient. Previously, we have reported and validated a pipeline for identifying functional non-coding somatic mutations in cancer patient cohorts, by integrating diverse information such as gene expression, spatial distribution of the mutations, and a biophysical model for estimating protein binding affinity. Here, we present a new user-friendly Python package BayesPI-BAR2 based on the proposed pipeline for integrative whole-genome sequence analysis. This may be the first prediction package that considers information from both multiple mutations and multiple patients. It is evaluated in follicular lymphoma and skin cancer patients, by focusing on sequence variants in gene promoter regions. BayesPI-BAR2 is a useful tool for predicting functional non-coding mutations in whole genome sequencing data: it allows identification of novel transcription factors (TFs) whose binding is altered by non-coding mutations in cancer. BayesPI-BAR2 program can analyze multiple datasets of genome-wide mutations at once and generate concise, easily interpretable reports for potentially affected gene regulatory sites. The package is freely available at http://folk.uio.no/junbaiw/BayesPI-BAR2/.
Collapse
Affiliation(s)
- Kirill Batmanov
- Department of Pathology, Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway
| | - Jan Delabie
- Department of Pathology, University Health Network, Toronto, ON, Canada
| | - Junbai Wang
- Department of Pathology, Norwegian Radium Hospital, Oslo University Hospital, Oslo, Norway
| |
Collapse
|
17
|
Recurrent Neural Network for Predicting Transcription Factor Binding Sites. Sci Rep 2018; 8:15270. [PMID: 30323198 PMCID: PMC6189047 DOI: 10.1038/s41598-018-33321-1] [Citation(s) in RCA: 111] [Impact Index Per Article: 15.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2018] [Accepted: 09/25/2018] [Indexed: 12/23/2022] Open
Abstract
It is well known that DNA sequence contains a certain amount of transcription factors (TF) binding sites, and only part of them are identified through biological experiments. However, these experiments are expensive and time-consuming. To overcome these problems, some computational methods, based on k-mer features or convolutional neural networks, have been proposed to identify TF binding sites from DNA sequences. Although these methods have good performance, the context information that relates to TF binding sites is still lacking. Research indicates that standard recurrent neural networks (RNN) and its variants have better performance in time-series data compared with other models. In this study, we propose a model, named KEGRU, to identify TF binding sites by combining Bidirectional Gated Recurrent Unit (GRU) network with k-mer embedding. Firstly, DNA sequences are divided into k-mer sequences with a specified length and stride window. And then, we treat each k-mer as a word and pre-trained word representation model though word2vec algorithm. Thirdly, we construct a deep bidirectional GRU model for feature learning and classification. Experimental results have shown that our method has better performance compared with some state-of-the-art methods. Additional experiments about embedding strategy show that k-mer embedding will be helpful to enhance model performance. The robustness of KEGRU is proved by experiments with different k-mer length, stride window and embedding vector dimension.
Collapse
|
18
|
Reshef YA, Finucane HK, Kelley DR, Gusev A, Kotliar D, Ulirsch JC, Hormozdiari F, Nasser J, O'Connor L, van de Geijn B, Loh PR, Grossman SR, Bhatia G, Gazal S, Palamara PF, Pinello L, Patterson N, Adams RP, Price AL. Detecting genome-wide directional effects of transcription factor binding on polygenic disease risk. Nat Genet 2018; 50:1483-1493. [PMID: 30177862 PMCID: PMC6202062 DOI: 10.1038/s41588-018-0196-7] [Citation(s) in RCA: 45] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2017] [Accepted: 07/11/2018] [Indexed: 12/19/2022]
Abstract
Biological interpretation of genome-wide association study data frequently involves assessing whether SNPs linked to a biological process, for example, binding of a transcription factor, show unsigned enrichment for disease signal. However, signed annotations quantifying whether each SNP allele promotes or hinders the biological process can enable stronger statements about disease mechanism. We introduce a method, signed linkage disequilibrium profile regression, for detecting genome-wide directional effects of signed functional annotations on disease risk. We validate the method via simulations and application to molecular quantitative trait loci in blood, recovering known transcriptional regulators. We apply the method to expression quantitative trait loci in 48 Genotype-Tissue Expression tissues, identifying 651 transcription factor-tissue associations including 30 with robust evidence of tissue specificity. We apply the method to 46 diseases and complex traits (average n = 290 K), identifying 77 annotation-trait associations representing 12 independent transcription factor-trait associations, and characterize the underlying transcriptional programs using gene-set enrichment analyses. Our results implicate new causal disease genes and new disease mechanisms.
Collapse
Affiliation(s)
- Yakir A Reshef
- Department of Computer Science, Harvard University, Cambridge, MA, USA.
- Harvard/MIT MD/PhD Program, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | | | - David R Kelley
- California Life Sciences LLC, South San Francisco, CA, USA
| | | | - Dylan Kotliar
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jacob C Ulirsch
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Dana Farber Cancer Institute, Boston, MA, USA
- Boston Children's Hospital, Boston, MA, USA
| | - Farhad Hormozdiari
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Joseph Nasser
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Luke O'Connor
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Program in Bioinformatics and Integrative Genomics, Harvard University, Cambridge, MA, USA
| | - Bryce van de Geijn
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Po-Ru Loh
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Sharon R Grossman
- Harvard/MIT MD/PhD Program, Boston, MA, USA
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Gaurav Bhatia
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Steven Gazal
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Pier Francesco Palamara
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Department of Statistics, University of Oxford, Oxford, UK
| | - Luca Pinello
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Massachusetts General Hospital, Charlestown, MA, USA
- Department of Pathology, Harvard Medical School, Boston, MA, USA
| | | | - Ryan P Adams
- Google Brain, New York, NY, USA
- Department of Computer Science, Princeton University, Princeton, NJ, USA
| | - Alkes L Price
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
| |
Collapse
|
19
|
Wong KC. DNA Motif Recognition Modeling from Protein Sequences. iScience 2018; 7:198-211. [PMID: 30267681 PMCID: PMC6153143 DOI: 10.1016/j.isci.2018.09.003] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2018] [Revised: 08/08/2018] [Accepted: 09/04/2018] [Indexed: 12/31/2022] Open
Abstract
Although the existing works on DNA motif discovery on DNA sequences are plethoric, mechanistic knowledge to infer DNA motifs from protein sequences across multiple DNA-binding domain families without conducting any wet-lab experiments is still lacking. Therefore, the k-spectrum recognition modeling is proposed to address the issues at the highest possible resolutions. The k-spectrum model can capture DNA motif patterns from protein sequences at the resolution in which local sequence context and nucleotide dependency can be taken into account completely. Multiple evaluation metrics are adopted and measured on millions of k-mer binding intensities from 92 proteins across 5 DNA-binding families (i.e., bHLH, bZIP, ETS, Forkhead, and Homeodomain), demonstrating its competitive edges. In addition, it not only can contribute to DNA motif recognition modeling but also can help prioritize the observed or even unobserved binding of single nucleotide variants on transcription factor binding sites in a genome-wide manner. DNA motif modeling from protein is fundamental for understanding gene regulation A framework is proposed at the highest possible sequence resolution for the first time It is validated on millions of k-mer intensities from 92 proteins across 5 families It can prioritize the unobserved regulatory single nucleotide variants on DNA motifs
Collapse
Affiliation(s)
- Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong.
| |
Collapse
|
20
|
Abstract
Transcription is regulated by transcription factor (TF) binding at promoters and distal regulatory elements and histone modifications that control the accessibility of these elements. Chromatin immunoprecipitation followed by sequencing (ChIP-seq) has become the standard assay for identifying genome-wide protein-DNA interactions in vitro and in vivo. As large-scale ChIP-seq data sets have been collected for different TFs and histone modifications, their potential to predict gene expression can be used to test hypotheses about the mechanisms of gene regulation. In addition, complementary functional genomics assays provide a global view of chromatin accessibility and long-range cis-regulatory interactions that are being combined with TF binding and histone remodeling to study the regulation of gene expression. Thus, ChIP-seq analysis is now widely integrated with other functional genomics assays to better understand gene regulatory mechanisms. In this review, we discuss advances and challenges in integrating ChIP-seq data to identify context-specific chromatin states associated with gene activity. We describe the overall computational design of integrating ChIP-seq data with other functional genomics assays. We also discuss the challenges of extending these methods to low-input ChIP-seq assays and related single-cell assays.
Collapse
Affiliation(s)
| | - Ali Mortazavi
- Corresponding author: Ali Mortazavi, Department of Developmental and Cell Biology, 2300 Biological Sciences 3, University of California, Irvine, CA 92697, USA. Tel: (949)824-6762; E-mail:
| |
Collapse
|
21
|
Lee W, Park B, Han K. Sequence-based prediction of putative transcription factor binding sites in DNA sequences of any length. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2017; 15:1461-1469. [PMID: 29990126 DOI: 10.1109/tcbb.2017.2773075] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
A transcription factor (TF) is a protein that regulates gene expression by binding to specific DNA sequences. Despite the recent advances in experimental techniques for identifying transcription factor binding sites (TFBS) in DNA sequences, a large number of TFBS are to be unveiled in many species. Several computational methods developed for predicting TFBS in DNA are tissue- or species-specific methods, so cannot be used without prior knowledge of tissue or species. Some computational methods are applicable to finding TFBS in short DNA sequences only. In this paper we propose a new learning method for predicting TFBS in DNA of any length using the composition, transition and distribution of nucleotides and amino acids in DNA and TF sequences. In independent testing of the method on datasets that were not used in training the method, its accuracy and MCC were as high as 81.84% and 0.634, respectively. The proposed method can be a useful aid for selecting potential TFBS in a large amount of DNA sequences before conducting biochemical experiments to empirically determine TFBS. The program and data sets are available at http://bclab.inha.ac.kr/TFbinding.
Collapse
|
22
|
Schwessinger R, Suciu MC, McGowan SJ, Telenius J, Taylor S, Higgs DR, Hughes JR. Sasquatch: predicting the impact of regulatory SNPs on transcription factor binding from cell- and tissue-specific DNase footprints. Genome Res 2017; 27:1730-1742. [PMID: 28904015 PMCID: PMC5630036 DOI: 10.1101/gr.220202.117] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2017] [Accepted: 08/07/2017] [Indexed: 12/22/2022]
Abstract
In the era of genome-wide association studies (GWAS) and personalized medicine, predicting the impact of single nucleotide polymorphisms (SNPs) in regulatory elements is an important goal. Current approaches to determine the potential of regulatory SNPs depend on inadequate knowledge of cell-specific DNA binding motifs. Here, we present Sasquatch, a new computational approach that uses DNase footprint data to estimate and visualize the effects of noncoding variants on transcription factor binding. Sasquatch performs a comprehensive k-mer-based analysis of DNase footprints to determine any k-mer's potential for protein binding in a specific cell type and how this may be changed by sequence variants. Therefore, Sasquatch uses an unbiased approach, independent of known transcription factor binding sites and motifs. Sasquatch only requires a single DNase-seq data set per cell type, from any genotype, and produces consistent predictions from data generated by different experimental procedures and at different sequence depths. Here we demonstrate the effectiveness of Sasquatch using previously validated functional SNPs and benchmark its performance against existing approaches. Sasquatch is available as a versatile webtool incorporating publicly available data, including the human ENCODE collection. Thus, Sasquatch provides a powerful tool and repository for prioritizing likely regulatory SNPs in the noncoding genome.
Collapse
Affiliation(s)
- Ron Schwessinger
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Maria C Suciu
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Simon J McGowan
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jelena Telenius
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Stephen Taylor
- Computational Biology Research Group, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Doug R Higgs
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| | - Jim R Hughes
- MRC Molecular Haematology Unit, MRC Weatherall Institute of Molecular Medicine, Oxford OX3 9DS, United Kingdom
| |
Collapse
|
23
|
Kreimer A, Zeng H, Edwards MD, Guo Y, Tian K, Shin S, Welch R, Wainberg M, Mohan R, Sinnott-Armstrong NA, Li Y, Eraslan G, AMIN TB, Goke J, Mueller NS, Kellis M, Kundaje A, Beer MA, Keles S, Gifford DK, Yosef N. Predicting gene expression in massively parallel reporter assays: A comparative study. Hum Mutat 2017; 38:1240-1250. [PMID: 28220625 PMCID: PMC5560998 DOI: 10.1002/humu.23197] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2016] [Revised: 01/19/2017] [Accepted: 02/12/2017] [Indexed: 02/03/2023]
Abstract
In many human diseases, associated genetic changes tend to occur within noncoding regions, whose effect might be related to transcriptional control. A central goal in human genetics is to understand the function of such noncoding regions: given a region that is statistically associated with changes in gene expression (expression quantitative trait locus [eQTL]), does it in fact play a regulatory role? And if so, how is this role "coded" in its sequence? These questions were the subject of the Critical Assessment of Genome Interpretation eQTL challenge. Participants were given a set of sequences that flank eQTLs in humans and were asked to predict whether these are capable of regulating transcription (as evaluated by massively parallel reporter assays), and whether this capability changes between alternative alleles. Here, we report lessons learned from this community effort. By inspecting predictive properties in isolation, and conducting meta-analysis over the competing methods, we find that using chromatin accessibility and transcription factor binding as features in an ensemble of classifiers or regression models leads to the most accurate results. We then characterize the loci that are harder to predict, putting the spotlight on areas of weakness, which we expect to be the subject of future studies.
Collapse
Affiliation(s)
- Anat Kreimer
- Department of Electrical Engineering and Computer Science and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Department of Bioengineering and Therapeutic Sciences, Institute for Human Genetics, University of California, San Francisco, San Francisco, California, USA
| | - Haoyang Zeng
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Matthew D. Edwards
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Yuchun Guo
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Kevin Tian
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Sunyoung Shin
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Rene Welch
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - Michael Wainberg
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Rahul Mohan
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Nicholas A. Sinnott-Armstrong
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Yue Li
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
| | - Gökcen Eraslan
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstr. 1 85764 Neuherberg, Germany
| | - Talal Bin AMIN
- Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672, Singapore
| | - Jonathan Goke
- Computational and Systems Biology, Genome Institute of Singapore, Singapore 138672, Singapore
| | - Nikola S. Mueller
- Computational Cell Maps, Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstr. 1 85764 Neuherberg, Germany
| | - Manolis Kellis
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, 32 Vassar St, Cambridge, Massachusetts 02139, USA
| | - Anshul Kundaje
- Department of Genetics, Stanford University School of Medicine, Department of Computer Science, Stanford, California 94305, USA
| | - Michael A Beer
- McKusick-Nathans Institute of Genetic Medicine, Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Sunduz Keles
- Department of Statistics, Department of Biostatistics and Medical Informatics University of Wisconsin-Madison, Madison, Wisconsin, USA
| | - David K. Gifford
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02142, USA
| | - Nir Yosef
- Department of Electrical Engineering and Computer Science and Center for Computational Biology, University of California, Berkeley, Berkeley, CA 94720, USA
- Ragon Institute of Massachusetts General Hospital, MIT and Harvard, Cambridge, MA, 02139
| |
Collapse
|
24
|
Nikolic M, Papantonis A, Rada-Iglesias A. GARLIC: a bioinformatic toolkit for aetiologically connecting diseases and cell type-specific regulatory maps. Hum Mol Genet 2017; 26:742-752. [PMID: 28007912 PMCID: PMC5409087 DOI: 10.1093/hmg/ddw423] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2016] [Accepted: 12/12/2016] [Indexed: 12/28/2022] Open
Abstract
Genome-wide association studies (GWAS) have emerged as a powerful tool to uncover the genetic basis of human common diseases, which often show a complex, polygenic and multi-factorial aetiology. These studies have revealed that 70–90% of all single nucleotide polymorphisms (SNPs) associated with common complex diseases do not occur within genes (i.e. they are non-coding), making the discovery of disease-causative genetic variants and the elucidation of the underlying pathological mechanisms far from straightforward. Based on emerging evidences suggesting that disease-associated SNPs are frequently found within cell type-specific regulatory sequences, here we present GARLIC (GWAS-based Prediction Toolkit for Connecting Diseases and Cell Types), a user-friendly, multi-purpose software with an associated database and online viewer that, using global maps of cis-regulatory elements, can aetiologically connect human diseases with relevant cell types. Additionally, GARLIC can be used to retrieve potential disease-causative genetic variants overlapping regulatory sequences of interest. Overall, GARLIC can satisfy several important needs within the field of medical genetics, thus potentially assisting in the ultimate goal of uncovering the elusive and complex genetic basis of common human disorders.
Collapse
Affiliation(s)
- Miloš Nikolic
- Center for Molecular Medicine Cologne (CMMC), Robert-Koch-Str. 21, 50931 Cologne, Germany.,The Cologne Cluster of Excellence in Cellular Stress Responses in Aging-associated Diseases (CECAD), Joseph-Stelzmann-Straße 26, 50931 Cologne, Germany
| | - Argyris Papantonis
- Center for Molecular Medicine Cologne (CMMC), Robert-Koch-Str. 21, 50931 Cologne, Germany
| | - Alvaro Rada-Iglesias
- Center for Molecular Medicine Cologne (CMMC), Robert-Koch-Str. 21, 50931 Cologne, Germany.,The Cologne Cluster of Excellence in Cellular Stress Responses in Aging-associated Diseases (CECAD), Joseph-Stelzmann-Straße 26, 50931 Cologne, Germany
| |
Collapse
|
25
|
Zhang H, Zhu L, Huang DS. WSMD: weakly-supervised motif discovery in transcription factor ChIP-seq data. Sci Rep 2017; 7:3217. [PMID: 28607381 PMCID: PMC5468353 DOI: 10.1038/s41598-017-03554-7] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2016] [Accepted: 05/02/2017] [Indexed: 01/24/2023] Open
Abstract
Although discriminative motif discovery (DMD) methods are promising for eliciting motifs from high-throughput experimental data, due to consideration of computational expense, most of existing DMD methods have to choose approximate schemes that greatly restrict the search space, leading to significant loss of predictive accuracy. In this paper, we propose Weakly-Supervised Motif Discovery (WSMD) to discover motifs from ChIP-seq datasets. In contrast to the learning strategies adopted by previous DMD methods, WSMD allows a "global" optimization scheme of the motif parameters in continuous space, thereby reducing the information loss of model representation and improving the quality of resultant motifs. Meanwhile, by exploiting the connection between DMD framework and existing weakly supervised learning (WSL) technologies, we also present highly scalable learning strategies for the proposed method. The experimental results on both real ChIP-seq datasets and synthetic datasets show that WSMD substantially outperforms former DMD methods (including DREME, HOMER, XXmotif, motifRG and DECOD) in terms of predictive accuracy, while also achieving a competitive computational speed.
Collapse
Affiliation(s)
- Hongbo Zhang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - Lin Zhu
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, College of Electronics and Information Engineering, Tongji University, Shanghai, 201804, P.R. China.
| |
Collapse
|
26
|
Liu Q, Gan M, Jiang R. A sequence-based method to predict the impact of regulatory variants using random forest. BMC SYSTEMS BIOLOGY 2017; 11:7. [PMID: 28361702 PMCID: PMC5374684 DOI: 10.1186/s12918-017-0389-1] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/11/2022]
Abstract
Background Most disease-associated variants identified by genome-wide association studies (GWAS) exist in noncoding regions. In spite of the common agreement that such variants may disrupt biological functions of their hosting regulatory elements, it remains a great challenge to characterize the risk of a genetic variant within the implicated genome sequence. Therefore, it is essential to develop an effective computational model that is not only capable of predicting the potential risk of a genetic variant but also valid in interpreting how the function of the genome is affected with the occurrence of the variant. Results We developed a method named kmerForest that used a random forest classifier with k-mer counts to predict accessible chromatin regions purely based on DNA sequences. We demonstrated that our method outperforms existing methods in distinguishing known accessible chromatin regions from random genomic sequences. Furthermore, the performance of our method can further be improved with the incorporation of sequence conservation features. Based on this model, we assessed importance of the k-mer features by a series of permutation experiments, and we characterized the risk of a single nucleotide polymorphism (SNP) on the function of the genome using the difference between the importance of the k-mer features affected by the occurrence of the SNP. We conducted a series of experiments and showed that our model can well discriminate between pathogenic and normal SNPs. Particularly, our model correctly prioritized SNPs that are proved to be enriched for the binding sites of FOXA1 in breast cancer cell lines from previous studies. Conclusions We presented a novel method to interpret functional genetic variants purely base on DNA sequences. The proposed k-mer based score offers an effective means of measuring the impact of SNPs on the function of the genome, and thus shedding light on the identification of genetic risk factors underlying complex traits and diseases.
Collapse
Affiliation(s)
- Qiao Liu
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing, 100084, China
| | - Mingxin Gan
- Department of Management Science and Engineering, Dongling School of Economics and Management, University of Science and Technology Beijing, Beijing, 100083, China
| | - Rui Jiang
- MOE Key Laboratory of Bioinformatics; Bioinformatics Division and Center for Synthetic and Systems Biology, TNLIST; Department of Automation, Tsinghua University, Beijing, 100084, China.
| |
Collapse
|
27
|
Qin Q, Feng J. Imputation for transcription factor binding predictions based on deep learning. PLoS Comput Biol 2017; 13:e1005403. [PMID: 28234893 PMCID: PMC5345877 DOI: 10.1371/journal.pcbi.1005403] [Citation(s) in RCA: 74] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2016] [Revised: 03/10/2017] [Accepted: 02/09/2017] [Indexed: 01/11/2023] Open
Abstract
Understanding the cell-specific binding patterns of transcription factors (TFs) is fundamental to studying gene regulatory networks in biological systems, for which ChIP-seq not only provides valuable data but is also considered as the gold standard. Despite tremendous efforts from the scientific community to conduct TF ChIP-seq experiments, the available data represent only a limited percentage of ChIP-seq experiments, considering all possible combinations of TFs and cell lines. In this study, we demonstrate a method for accurately predicting cell-specific TF binding for TF-cell line combinations based on only a small fraction (4%) of the combinations using available ChIP-seq data. The proposed model, termed TFImpute, is based on a deep neural network with a multi-task learning setting to borrow information across transcription factors and cell lines. Compared with existing methods, TFImpute achieves comparable accuracy on TF-cell line combinations with ChIP-seq data; moreover, TFImpute achieves better accuracy on TF-cell line combinations without ChIP-seq data. This approach can predict cell line specific enhancer activities in K562 and HepG2 cell lines, as measured by massively parallel reporter assays, and predicts the impact of SNPs on TF binding. Transcription factors play a central role in regulating various cellular processes. They bind to DNA in a cell-specific way. To study where a TF would bind to DNA, ChIP-seq experiment has been developed and widely adopted by the science community to study genome-wide in vivo protein-DNA interactions. However, for each TF, only a limited number of cell types have been explored by ChIP-seq experiments. To study the binding of a TF to a DNA sequence in a cell line without corresponding ChIP-seq data, researchers would check whether there is a motif for the TF in the sequence. However, motif alone contains only sequence information and therefore cannot reflect the cell specificity of TF binding. In this work, we demonstrate how to model the TF binding problem using deep learning and achieve cell specific binding prediction for TF-cell line combinations without ChIP-seq data.
Collapse
Affiliation(s)
- Qian Qin
- Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China
| | - Jianxing Feng
- Department of Bioinformatics, School of Life Sciences and Technology, Tongji University, Shanghai, China
- * E-mail:
| |
Collapse
|
28
|
Abundant contribution of short tandem repeats to gene expression variation in humans. Nat Genet 2015; 48:22-9. [PMID: 26642241 DOI: 10.1038/ng.3461] [Citation(s) in RCA: 263] [Impact Index Per Article: 26.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2015] [Accepted: 11/12/2015] [Indexed: 12/16/2022]
Abstract
The contribution of repetitive elements to quantitative human traits is largely unknown. Here we report a genome-wide survey of the contribution of short tandem repeats (STRs), which constitute one of the most polymorphic and abundant repeat classes, to gene expression in humans. Our survey identified 2,060 significant expression STRs (eSTRs). These eSTRs were replicable in orthogonal populations and expression assays. We used variance partitioning to disentangle the contribution of eSTRs from that of linked SNPs and indels and found that eSTRs contribute 10-15% of the cis heritability mediated by all common variants. Further functional genomic analyses showed that eSTRs are enriched in conserved regions, colocalize with regulatory elements and may modulate certain histone modifications. By analyzing known genome-wide association study (GWAS) signals and searching for new associations in 1,685 whole genomes from deeply phenotyped individuals, we found that eSTRs are enriched in various clinically relevant conditions. These results highlight the contribution of STRs to the genetic architecture of quantitative human traits.
Collapse
|