1
|
Chai L, Gao J, Li Z, Sun H, Liu J, Wang Y, Zhang L. Predicting CTCF cell type active binding sites in human genome. Sci Rep 2024; 14:31744. [PMID: 39738353 PMCID: PMC11686126 DOI: 10.1038/s41598-024-82238-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Accepted: 12/03/2024] [Indexed: 01/02/2025] Open
Abstract
The CCCTC-binding factor (CTCF) is pivotal in orchestrating diverse biological functions across the human genome, yet the mechanisms driving its cell type-active DNA binding affinity remain underexplored. Here, we collected ChIP-seq data from 67 cell lines in ENCODE, constructed a unique dataset of cell type-active CTCF binding sites (CBS), and trained convolutional neural networks (CNN) to dissect the patterns of CTCF binding activity. Our analysis reveals that transcription factors RAD21/SMC3 and chromatin accessibility are more predictive compared to sequence motifs and histone modifications. Integrating them together achieved AUPRC values consistently above 0.868, highlighting their utility in deciphering CTCF transcription factor binding dynamics. This study provides a deeper understanding of the regulatory functions of CTCF via machine learning framework.
Collapse
Affiliation(s)
- Lu Chai
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, People's Republic of China
| | - Jie Gao
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, People's Republic of China
| | - Zihan Li
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, People's Republic of China
| | - Hao Sun
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, People's Republic of China
| | - Junjie Liu
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, People's Republic of China
| | - Yong Wang
- CEMS, NCMIS, HCMS, MDIS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, 100190, People's Republic of China.
| | - Lirong Zhang
- School of Physical Science and Technology, Inner Mongolia University, Hohhot, 010021, People's Republic of China.
| |
Collapse
|
2
|
Yang Y, Pe’er D. REUNION: transcription factor binding prediction and regulatory association inference from single-cell multi-omics data. Bioinformatics 2024; 40:i567-i575. [PMID: 38940155 PMCID: PMC11211829 DOI: 10.1093/bioinformatics/btae234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Profiling of gene expression and chromatin accessibility by single-cell multi-omics approaches can help to systematically decipher how transcription factors (TFs) regulate target gene expression via cis-region interactions. However, integrating information from different modalities to discover regulatory associations is challenging, in part because motif scanning approaches miss many likely TF binding sites. RESULTS We develop REUNION, a framework for predicting genome-wide TF binding and cis-region-TF-gene "triplet" regulatory associations using single-cell multi-omics data. The first component of REUNION, Unify, utilizes information theory-inspired complementary score functions that incorporate TF expression, chromatin accessibility, and target gene expression to identify regulatory associations. The second component, Rediscover, takes Unify estimates as input for pseudo semi-supervised learning to predict TF binding in accessible genomic regions that may or may not include detected TF motifs. Rediscover leverages latent chromatin accessibility and sequence feature spaces of the genomic regions, without requiring chromatin immunoprecipitation data for model training. Applied to peripheral blood mononuclear cell data, REUNION outperforms alternative methods in TF binding prediction on average performance. In particular, it recovers missing region-TF associations from regions lacking detected motifs, which circumvents the reliance on motif scanning and facilitates discovery of novel associations involving potential co-binding transcriptional regulators. Newly identified region-TF associations, even in regions lacking a detected motif, improve the prediction of target gene expression in regulatory triplets, and are thus likely to genuinely participate in the regulation. AVAILABILITY AND IMPLEMENTATION All source code is available at https://github.com/yangymargaret/REUNION.
Collapse
Affiliation(s)
- Yang Yang
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, United States
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, United States
| | - Dana Pe’er
- Computational and Systems Biology Program, Sloan Kettering Institute, Memorial Sloan Kettering Cancer Center, New York, NY 10065, United States
- Howard Hughes Medical Institute, Chevy Chase, MD 20815, United States
| |
Collapse
|
3
|
Filipovic D, Qi W, Kana O, Marri D, LeCluyse EL, Andersen ME, Cuddapah S, Bhattacharya S. Interpretable predictive models of genome-wide aryl hydrocarbon receptor-DNA binding reveal tissue-specific binding determinants. Toxicol Sci 2023; 196:170-186. [PMID: 37707797 PMCID: PMC10682972 DOI: 10.1093/toxsci/kfad094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/15/2023] Open
Abstract
The aryl hydrocarbon receptor (AhR) is an inducible transcription factor whose ligands include the potent environmental contaminant 2,3,7,8-tetrachlorodibenzo-p-dioxin (TCDD). Ligand-activated AhR binds to DNA at dioxin response elements (DREs) containing the core motif 5'-GCGTG-3'. However, AhR binding is highly tissue specific. Most DREs in accessible chromatin are not bound by TCDD-activated AhR, and DREs accessible in multiple tissues can be bound in some and unbound in others. As such, AhR functions similarly to many nuclear receptors. Given that AhR possesses a strong core motif, it is suited for a motif-centered analysis of its binding. We developed interpretable machine learning models predicting the AhR binding status of DREs in MCF-7, GM17212, and HepG2 cells, as well as primary human hepatocytes. Cross-tissue models predicting transcription factor (TF)-DNA binding generally perform poorly. However, reasons for the low performance remain unexplored. By interpreting the results of individual within-tissue models and by examining the features leading to low cross-tissue performance, we identified sequence and chromatin context patterns correlated with AhR binding. We conclude that AhR binding is driven by a complex interplay of tissue-agnostic DRE flanking DNA sequence and tissue-specific local chromatin context. Additionally, we demonstrate that interpretable machine learning models can provide novel and experimentally testable mechanistic insights into DNA binding by inducible TFs.
Collapse
Affiliation(s)
- David Filipovic
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
| | - Wenjie Qi
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
| | - Omar Kana
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Pharmacology & Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
| | - Daniel Marri
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
| | - Edward L LeCluyse
- LifeSciences Division, LifeNet Health, Research Triangle Park, North Carolina 27709, USA
| | | | - Suresh Cuddapah
- Division of Environmental Medicine, Department of Medicine, New York University School of Medicine, New York, New York 10010, USA
| | - Sudin Bhattacharya
- Department of Biomedical Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Quantitative Health Science & Engineering, Michigan State University, East Lansing, Michigan 48824, USA
- Department of Pharmacology & Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
- Institute for Integrative Toxicology, Michigan State University, East Lansing, Michigan 48824, USA
- Center for Research on Ingredient Safety, Michigan State University, East Lansing, Michigan 48824, USA
| |
Collapse
|
4
|
Zhang Z, Feng F, Qiu Y, Liu J. A generalizable framework to comprehensively predict epigenome, chromatin organization, and transcriptome. Nucleic Acids Res 2023; 51:5931-5947. [PMID: 37224527 PMCID: PMC10325920 DOI: 10.1093/nar/gkad436] [Citation(s) in RCA: 10] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 03/31/2023] [Accepted: 05/09/2023] [Indexed: 05/26/2023] Open
Abstract
Many deep learning approaches have been proposed to predict epigenetic profiles, chromatin organization, and transcription activity. While these approaches achieve satisfactory performance in predicting one modality from another, the learned representations are not generalizable across predictive tasks or across cell types. In this paper, we propose a deep learning approach named EPCOT which employs a pre-training and fine-tuning framework, and is able to accurately and comprehensively predict multiple modalities including epigenome, chromatin organization, transcriptome, and enhancer activity for new cell types, by only requiring cell-type specific chromatin accessibility profiles. Many of these predicted modalities, such as Micro-C and ChIA-PET, are quite expensive to get in practice, and the in silico prediction from EPCOT should be quite helpful. Furthermore, this pre-training and fine-tuning framework allows EPCOT to identify generic representations generalizable across different predictive tasks. Interpreting EPCOT models also provides biological insights including mapping between different genomic modalities, identifying TF sequence binding patterns, and analyzing cell-type specific TF impacts on enhancer activity.
Collapse
Affiliation(s)
- Zhenhao Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Fan Feng
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Yiyang Qiu
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| | - Jie Liu
- Department of Computational Medicine and Bioinformatics, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
- Department of Computer Science and Engineering, University of Michigan, 500 S. State St, Ann Arbor, MI 48109, USA
| |
Collapse
|
5
|
Wolpe JB, Martins AL, Guertin MJ. Correction of transposase sequence bias in ATAC-seq data with rule ensemble modeling. NAR Genom Bioinform 2023; 5:lqad054. [PMID: 37274120 PMCID: PMC10236359 DOI: 10.1093/nargab/lqad054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 04/02/2023] [Accepted: 05/19/2023] [Indexed: 06/06/2023] Open
Abstract
Chromatin accessibility assays have revolutionized the field of transcription regulation by providing single-nucleotide resolution measurements of regulatory features such as promoters and transcription factor binding sites. ATAC-seq directly measures how well the Tn5 transposase accesses chromatinized DNA. Tn5 has a complex sequence bias that is not effectively scaled with traditional bias-correction methods. We model this complex bias using a rule ensemble machine learning approach that integrates information from many input k-mers proximal to the ATAC sequence reads. We effectively characterize and correct single-nucleotide sequence biases and regional sequence biases of the Tn5 enzyme. Correction of enzymatic sequence bias is an important step in interpreting chromatin accessibility assays that aim to infer transcription factor binding and regulatory activity of elements in the genome.
Collapse
Affiliation(s)
- Jacob B Wolpe
- Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA, USA
| | - André L Martins
- Center for Cell Analysis and Modeling, University of Connecticut, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut, Farmington, CT, USA
| | - Michael J Guertin
- Center for Cell Analysis and Modeling, University of Connecticut, Farmington, CT, USA
- Department of Genetics and Genome Sciences, University of Connecticut, Farmington, CT, USA
| |
Collapse
|
6
|
Villaman C, Pollastri G, Saez M, Martin AJ. Benefiting from the intrinsic role of epigenetics to predict patterns of CTCF binding. Comput Struct Biotechnol J 2023; 21:3024-3031. [PMID: 37266407 PMCID: PMC10229758 DOI: 10.1016/j.csbj.2023.05.012] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 05/11/2023] [Accepted: 05/11/2023] [Indexed: 06/03/2023] Open
Abstract
Motivation One of the most relevant mechanisms involved in the determination of chromatin structure is the formation of structural loops that are also related with the conservation of chromatin states. Many of these loops are stabilized by CCCTC-binding factor (CTCF) proteins at their base. Despite the relevance of chromatin structure and the key role of CTCF, the role of the epigenetic factors that are involved in the regulation of CTCF binding, and thus, in the formation of structural loops in the chromatin, is not thoroughly understood. Results Here we describe a CTCF binding predictor based on Random Forest that employs different epigenetic data and genomic features. Importantly, given the ability of Random Forests to determine the relevance of features for the prediction, our approach also shows how the different types of descriptors impact the binding of CTCF, confirming previous knowledge on the relevance of chromatin accessibility and DNA methylation, but demonstrating the effect of epigenetic modifications on the activity of CTCF. We compared our approach against other predictors and found improved performance in terms of areas under PR and ROC curves (PRAUC-ROCAUC), outperforming current state-of-the-art methods.
Collapse
Affiliation(s)
- Camilo Villaman
- Programa de Doctorado en Genómica Integrativa, Vicerrectoría de Investigación, Universidad Mayor, Santiago, Chile
- Laboratorio de Redes Biológicas, Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Fundación Ciencia & Vida, Escuela de Ingeniería, Facultad de Ingeniería, Arquitectura y Diseño, Universidad San Sebastián, Santiago, Chile
| | | | - Mauricio Saez
- Centro de Oncología de Precisión, Facultad de Medicina y Ciencias de la Salud, Universidad Mayor, Santiago, Chile
- Laboratorio de Investigación en Salud de Precisión, Departamento de Procesos Diagnósticos y Evaluación, Facultad de Ciencias de la Salud, Universidad Católica de Temuco, Chile
| | - Alberto J.M. Martin
- Laboratorio de Redes Biológicas, Centro Científico y Tecnológico de Excelencia Ciencia & Vida, Fundación Ciencia & Vida, Escuela de Ingeniería, Facultad de Ingeniería, Arquitectura y Diseño, Universidad San Sebastián, Santiago, Chile
| |
Collapse
|
7
|
Sindeeva M, Chekanov N, Avetisian M, Shashkova TI, Baranov N, Malkin E, Lapin A, Kardymon O, Fishman V. Cell type-specific interpretation of noncoding variants using deep learning-based methods. Gigascience 2023; 12:giad015. [PMID: 36971292 PMCID: PMC10041527 DOI: 10.1093/gigascience/giad015] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2022] [Revised: 12/11/2022] [Accepted: 02/27/2023] [Indexed: 03/29/2023] Open
Abstract
Interpretation of noncoding genomic variants is one of the most important challenges in human genetics. Machine learning methods have emerged recently as a powerful tool to solve this problem. State-of-the-art approaches allow prediction of transcriptional and epigenetic effects caused by noncoding mutations. However, these approaches require specific experimental data for training and cannot generalize across cell types where required features were not experimentally measured. We show here that available epigenetic characteristics of human cell types are extremely sparse, limiting those approaches that rely on specific epigenetic input. We propose a new neural network architecture, DeepCT, which can learn complex interconnections of epigenetic features and infer unmeasured data from any available input. Furthermore, we show that DeepCT can learn cell type-specific properties, build biologically meaningful vector representations of cell types, and utilize these representations to generate cell type-specific predictions of the effects of noncoding variations in the human genome.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | | | - Veniamin Fishman
- AIRI, Moscow, 121170, Russia
- Institute of Cytology and Genetics, Novosibirsk, 630099, Russia
- Novosibirsk State University, Novosibirsk, 630090, Russia
| |
Collapse
|
8
|
van der Sande M, Frölich S, van Heeringen SJ. Computational approaches to understand transcription regulation in development. Biochem Soc Trans 2023; 51:1-12. [PMID: 36695505 PMCID: PMC9988001 DOI: 10.1042/bst20210145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 01/07/2023] [Accepted: 01/13/2023] [Indexed: 01/26/2023]
Abstract
Gene regulatory networks (GRNs) serve as useful abstractions to understand transcriptional dynamics in developmental systems. Computational prediction of GRNs has been successfully applied to genome-wide gene expression measurements with the advent of microarrays and RNA-sequencing. However, these inferred networks are inaccurate and mostly based on correlative rather than causative interactions. In this review, we highlight three approaches that significantly impact GRN inference: (1) moving from one genome-wide functional modality, gene expression, to multi-omics, (2) single cell sequencing, to measure cell type-specific signals and predict context-specific GRNs, and (3) neural networks as flexible models. Together, these experimental and computational developments have the potential to significantly impact the quality of inferred GRNs. Ultimately, accurately modeling the regulatory interactions between transcription factors and their target genes will be essential to understand the role of transcription factors in driving developmental gene expression programs and to derive testable hypotheses for validation.
Collapse
Affiliation(s)
| | | | - Simon J. van Heeringen
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| |
Collapse
|
9
|
Zhang Q, Teng P, Wang S, He Y, Cui Z, Guo Z, Liu Y, Yuan C, Liu Q, Huang DS. Computational prediction and characterization of cell-type-specific and shared binding sites. Bioinformatics 2022; 39:6885447. [PMID: 36484687 PMCID: PMC9825777 DOI: 10.1093/bioinformatics/btac798] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2022] [Revised: 11/24/2022] [Accepted: 12/08/2022] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Cell-type-specific gene expression is maintained in large part by transcription factors (TFs) selectively binding to distinct sets of sites in different cell types. Recent research works have provided evidence that such cell-type-specific binding is determined by TF's intrinsic sequence preferences, cooperative interactions with co-factors, cell-type-specific chromatin landscapes and 3D chromatin interactions. However, computational prediction and characterization of cell-type-specific and shared binding sites is rarely studied. RESULTS In this article, we propose two computational approaches for predicting and characterizing cell-type-specific and shared binding sites by integrating multiple types of features, in which one is based on XGBoost and another is based on convolutional neural network (CNN). To validate the performance of our proposed approaches, ChIP-seq datasets of 10 binding factors were collected from the GM12878 (lymphoblastoid) and K562 (erythroleukemic) human hematopoietic cell lines, each of which was further categorized into cell-type-specific (GM12878- and K562-specific) and shared binding sites. Then, multiple types of features for these binding sites were integrated to train the XGBoost- and CNN-based models. Experimental results show that our proposed approaches significantly outperform other competing methods on three classification tasks. Moreover, we identified independent feature contributions for cell-type-specific and shared sites through SHAP values and explored the ability of the CNN-based model to predict cell-type-specific and shared binding sites by excluding or including DNase signals. Furthermore, we investigated the generalization ability of our proposed approaches to different binding factors in the same cellular environment. AVAILABILITY AND IMPLEMENTATION The source code is available at: https://github.com/turningpoint1988/CSSBS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Qinhu Zhang
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Pengrui Teng
- School of Information and Control Engineering, China University of Mining and Technology, Xuzhou 221116, China
| | - Siguo Wang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Ying He
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Zhen Cui
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Zhenghao Guo
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai 201804, China
| | - Yixin Liu
- School of Health Science and Engineering, University of Shanghai for Science and Technology, Shanghai 200093, China
| | - Changan Yuan
- Big Data and Intelligent Computing Research Center, Guangxi Academy of Science, Nanning 530007, China
| | - Qi Liu
- To whom correspondence should be addressed. or
| | | |
Collapse
|
10
|
Toneyan S, Tang Z, Koo PK. Evaluating deep learning for predicting epigenomic profiles. NAT MACH INTELL 2022. [DOI: 10.1038/s42256-022-00570-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
|
11
|
Rivière Q, Corso M, Ciortan M, Noël G, Verbruggen N, Defrance M. Exploiting Genomic Features to Improve the Prediction of Transcription Factor-Binding Sites in Plants. PLANT & CELL PHYSIOLOGY 2022; 63:1457-1473. [PMID: 35799371 DOI: 10.1093/pcp/pcac095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Revised: 06/07/2022] [Accepted: 07/06/2022] [Indexed: 06/15/2023]
Abstract
The identification of transcription factor (TF) target genes is central in biology. A popular approach is based on the location by pattern matching of potential cis-regulatory elements (CREs). During the last few years, tools integrating next-generation sequencing data have been developed to improve the performance of pattern matching. However, such tools have not yet been comprehensively evaluated in plants. Hence, we developed a new streamlined method aiming at predicting CREs and target genes of plant TFs in specific organs or conditions. Our approach implements a supervised machine learning strategy, which allows decision rule models to be learnt using TF ChIP-chip/seq experimental data. Different layers of genomic features were integrated in predictive models: the position on the gene, the DNA sequence conservation, the chromatin state and various CRE footprints. Among the tested features, the chromatin features were crucial for improving the accuracy of the method. Furthermore, we evaluated the transferability of predictive models across TFs, organs and species. Finally, we validated our method by correctly inferring the target genes of key TFs controlling metabolite biosynthesis at the organ level in Arabidopsis. We developed a tool-Wimtrap-to reproduce our approach in plant species and conditions/organs for which ChIP-chip/seq data are available. Wimtrap is a user-friendly R package that supports an R Shiny web interface and is provided with pre-built models that can be used to quickly get predictions of CREs and TF gene targets in different organs or conditions in Arabidopsis thaliana, Solanum lycopersicum, Oryza sativa and Zea mays.
Collapse
Affiliation(s)
- Quentin Rivière
- Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
| | - Massimiliano Corso
- Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
- INRAE, AgroParisTech, Institut Jean-Pierre Bourgin (IJPB), Université Paris-Saclay, Versailles 78000, France
| | - Madalina Ciortan
- Interuniversity Institute of Bioinformatics in Brussels, Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium
| | - Grégoire Noël
- Functional and Evolutionary Entomology, Gembloux Agro-Bio Tech, University of Liège, Passage des Déportés 2, Gembloux 5030, Belgium
| | - Nathalie Verbruggen
- Brussels Bioengineering School, Laboratory of Plant Physiology and molecular Genetics, Université Libre de Bruxelles, Brussels 1050, Belgium
| | - Matthieu Defrance
- Interuniversity Institute of Bioinformatics in Brussels, Machine Learning Group, Université Libre de Bruxelles, Brussels 1050, Belgium
| |
Collapse
|
12
|
Liao J, Wang Q, Wu F, Huang Z. In Silico Methods for Identification of Potential Active Sites of Therapeutic Targets. Molecules 2022; 27:7103. [PMID: 36296697 PMCID: PMC9609013 DOI: 10.3390/molecules27207103] [Citation(s) in RCA: 27] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 08/12/2022] [Accepted: 08/25/2022] [Indexed: 07/30/2023] Open
Abstract
Target identification is an important step in drug discovery, and computer-aided drug target identification methods are attracting more attention compared with traditional drug target identification methods, which are time-consuming and costly. Computer-aided drug target identification methods can greatly reduce the searching scope of experimental targets and associated costs by identifying the diseases-related targets and their binding sites and evaluating the druggability of the predicted active sites for clinical trials. In this review, we introduce the principles of computer-based active site identification methods, including the identification of binding sites and assessment of druggability. We provide some guidelines for selecting methods for the identification of binding sites and assessment of druggability. In addition, we list the databases and tools commonly used with these methods, present examples of individual and combined applications, and compare the methods and tools. Finally, we discuss the challenges and limitations of binding site identification and druggability assessment at the current stage and provide some recommendations and future perspectives.
Collapse
Affiliation(s)
- Jianbo Liao
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- The Second School of Clinical Medicine, Guangdong Medical University, Dongguan 523808, China
| | - Qinyu Wang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
| | - Fengxu Wu
- Hubei Key Laboratory of Wudang Local Chinese Medicine Research, School of Pharmaceutical Sciences, Hubei University of Medicine, Shiyan 442000, China
| | - Zunnan Huang
- Key Laboratory of Big Data Mining and Precision Drug Design of Guangdong Medical University, Key Laboratory of Computer-Aided Drug Design of Dongguan City, Key Laboratory for Research and Development of Natural Drugs of Guangdong Province, School of Pharmacy, Guangdong Medical University, Dongguan 523808, China
- Marine Biomedical Research Institute of Guangdong Zhanjiang, Zhanjiang 524023, China
| |
Collapse
|
13
|
Lal A. Deciphering the regulatory syntax of genomic DNA with deep learning. J Biosci 2022. [DOI: 10.1007/s12038-022-00291-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|
14
|
Yi R, Cho K, Bonneau R. NetTIME: a Multitask and Base-pair Resolution Framework for Improved Transcription Factor Binding Site Prediction. Bioinformatics 2022; 38:4762-4770. [PMID: 35997560 PMCID: PMC9563695 DOI: 10.1093/bioinformatics/btac569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2021] [Revised: 08/16/2022] [Accepted: 08/20/2022] [Indexed: 12/05/2022] Open
Abstract
Motivation Machine learning models for predicting cell-type-specific transcription factor (TF) binding sites have become increasingly more accurate thanks to the increased availability of next-generation sequencing data and more standardized model evaluation criteria. However, knowledge transfer from data-rich to data-limited TFs and cell types remains crucial for improving TF binding prediction models because available binding labels are highly skewed towards a small collection of TFs and cell types. Transfer prediction of TF binding sites can potentially benefit from a multitask learning approach; however, existing methods typically use shallow single-task models to generate low-resolution predictions. Here, we propose NetTIME, a multitask learning framework for predicting cell-type-specific TF binding sites with base-pair resolution. Results We show that the multitask learning strategy for TF binding prediction is more efficient than the single-task approach due to the increased data availability. NetTIME trains high-dimensional embedding vectors to distinguish TF and cell-type identities. We show that this approach is critical for the success of the multitask learning strategy and allows our model to make accurate transfer predictions within and beyond the training panels of TFs and cell types. We additionally train a linear-chain conditional random field (CRF) to classify binding predictions and show that this CRF eliminates the need for setting a probability threshold and reduces classification noise. We compare our method’s predictive performance with two state-of-the-art methods, Catchitt and Leopard, and show that our method outperforms previous methods under both supervised and transfer learning settings. Availability and implementation NetTIME is freely available at https://github.com/ryi06/NetTIME and the code is also archived at https://doi.org/10.5281/zenodo.6994897. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ren Yi
- Department of Computer Science, New York University, New York, NY, 10011, USA
| | - Kyunghyun Cho
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, NY, 10011, USA.,Center for Data Science, New York University, New York, NY, 10011, USA.,Department of Biology, New York University, New York, NY, 10003, USA.,Prescient Design, a Genentech accelerator, New York, NY, 10010, USA
| |
Collapse
|
15
|
Ng JWK, Ong EHQ, Tucker-Kellogg L, Tucker-Kellogg G. Deep learning for de-convolution of Smad2 versus Smad3 binding sites. BMC Genomics 2022; 23:525. [PMID: 35858839 PMCID: PMC9297549 DOI: 10.1186/s12864-022-08565-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2022] [Accepted: 04/19/2022] [Indexed: 11/10/2022] Open
Abstract
Background The transforming growth factor beta-1 (TGF β-1) cytokine exerts both pro-tumor and anti-tumor effects in carcinogenesis. An increasing body of literature suggests that TGF β-1 signaling outcome is partially dependent on the regulatory targets of downstream receptor-regulated Smad (R-Smad) proteins Smad2 and Smad3. However, the lack of Smad-specific antibodies for ChIP-seq hinders convenient identification of Smad-specific binding sites. Results In this study, we use localization and affinity purification (LAP) tags to identify Smad-specific binding sites in a cancer cell line. Using ChIP-seq data obtained from LAP-tagged Smad proteins, we develop a convolutional neural network with long-short term memory (CNN-LSTM) as a deep learning approach to classify a pool of Smad-bound sites as being Smad2- or Smad3-bound. Our data showed that this approach is able to accurately classify Smad2- versus Smad3-bound sites. We use our model to dissect the role of each R-Smad in the progression of breast cancer using a previously published dataset. Conclusions Our results suggests that deep learning approaches can be used to dissect binding site specificity of closely related transcription factors. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08565-x).
Collapse
Affiliation(s)
- Jeremy W K Ng
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Esther H Q Ong
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore
| | - Lisa Tucker-Kellogg
- Cancer and Stem Cell Biology, and Centre for Computational Biology, Duke-NUS Medical School, Singapore, Singapore.
| | - Greg Tucker-Kellogg
- Department of Biological Sciences, National University of Singapore, Singapore, Singapore. .,Computational Biology Programme, Faculty of Science, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
16
|
Luo K, Zhong J, Safi A, Hong LK, Tewari AK, Song L, Reddy TE, Ma L, Crawford GE, Hartemink AJ. Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data. Genome Res 2022; 32:1183-1198. [PMID: 35609992 PMCID: PMC9248881 DOI: 10.1101/gr.272203.120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 05/06/2022] [Indexed: 11/24/2022]
Abstract
Over a thousand different transcription factors (TFs) bind with varying occupancy across the human genome. Chromatin immunoprecipitation (ChIP) can assay occupancy genome-wide, but only one TF at a time, limiting our ability to comprehensively observe the TF occupancy landscape, let alone quantify how it changes across conditions. We developed TF occupancy profiler (TOP), a Bayesian hierarchical regression framework, to profile genome-wide quantitative occupancy of numerous TFs using data from a single chromatin accessibility experiment (DNase- or ATAC-seq). TOP is supervised, and its hierarchical structure allows it to predict the occupancy of any sequence-specific TF, even those never assayed with ChIP. We used TOP to profile the quantitative occupancy of hundreds of sequence-specific TFs at sites throughout the genome and examined how their occupancies changed in multiple contexts: in approximately 200 human cell types, through 12 h of exposure to different hormones, and across the genetic backgrounds of 70 individuals. TOP enables cost-effective exploration of quantitative changes in the landscape of TF binding.
Collapse
Affiliation(s)
- Kaixuan Luo
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Human Genetics, The University of Chicago, Chicago, Illinois 60637, USA
| | - Jianling Zhong
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
| | - Alexias Safi
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Linda K Hong
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alok K Tewari
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Lingyun Song
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Timothy E Reddy
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Biostatistics and Bioinformatics, Durham, North Carolina 27710, USA
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina 27710, USA
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA
| | - Li Ma
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA
| | - Gregory E Crawford
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alexander J Hartemink
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Biology, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
17
|
Li H, Guan Y. Asymmetric predictive relationships across histone modifications. NAT MACH INTELL 2022; 4:288-299. [DOI: 10.1038/s42256-022-00455-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
18
|
Morrow A, Hughes J, Singh J, Joseph A, Yosef N. Epitome: predicting epigenetic events in novel cell types with multi-cell deep ensemble learning. Nucleic Acids Res 2021; 49:e110. [PMID: 34379786 PMCID: PMC8565335 DOI: 10.1093/nar/gkab676] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 07/19/2021] [Accepted: 07/25/2021] [Indexed: 01/04/2023] Open
Abstract
The accumulation of large epigenomics data consortiums provides us with the opportunity to extrapolate existing knowledge to new cell types and conditions. We propose Epitome, a deep neural network that learns similarities of chromatin accessibility between well characterized reference cell types and a query cellular context, and copies over signal of transcription factor binding and modification of histones from reference cell types when chromatin profiles are similar to the query. Epitome achieves state-of-the-art accuracy when predicting transcription factor binding sites on novel cellular contexts and can further improve predictions as more epigenetic signals are collected from both reference cell types and the query cellular context of interest.
Collapse
Affiliation(s)
- Alyssa Kramer Morrow
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - John Weston Hughes
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Computer Science Department, Stanford University, 353 Serra Mall, Stanford, CA 94305, USA
| | - Jahnavi Singh
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - Anthony Douglas Joseph
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Unite Genomics, Inc., 1301 Marina Village Pkwy, Suite 320, Alameda, CA 94501, USA
| | - Nir Yosef
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard University, Boston, MA, 02139, USA
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA
| |
Collapse
|
19
|
Wang H, Huang B, Wang J. Predict long-range enhancer regulation based on protein-protein interactions between transcription factors. Nucleic Acids Res 2021; 49:10347-10368. [PMID: 34570239 PMCID: PMC8501976 DOI: 10.1093/nar/gkab841] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2021] [Revised: 08/10/2021] [Accepted: 09/10/2021] [Indexed: 12/18/2022] Open
Abstract
Long-range regulation by distal enhancers plays critical roles in cell-type specific transcriptional programs. Computational predictions of genome-wide enhancer-promoter interactions are still challenging due to limited accuracy and the lack of knowledge on the molecular mechanisms. Based on recent biological investigations, the protein-protein interactions (PPIs) between transcription factors (TFs) have been found to participate in the regulation of chromatin loops. Therefore, we developed a novel predictive model for cell-type specific enhancer-promoter interactions by leveraging the information of TF PPI signatures. Evaluated by a series of rigorous performance comparisons, the new model achieves superior performance over other methods. The model also identifies specific TF PPIs that may mediate long-range regulatory interactions, revealing new mechanistic understandings of enhancer regulation. The prioritized TF PPIs are associated with genes in distinct biological pathways, and the predicted enhancer-promoter interactions are strongly enriched with cis-eQTLs. Most interestingly, the model discovers enhancer-mediated trans-regulatory links between TFs and genes, which are significantly enriched with trans-eQTLs. The new predictive model, along with the genome-wide analyses, provides a platform to systematically delineate the complex interplay among TFs, enhancers and genes in long-range regulation. The novel predictions also lead to mechanistic interpretations of eQTLs to decode the genetic associations with gene expression.
Collapse
Affiliation(s)
- Hao Wang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, 428 S. Shaw Ln., East Lansing, MI 48824, USA
| | - Binbin Huang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, 428 S. Shaw Ln., East Lansing, MI 48824, USA
| | - Jianrong Wang
- Department of Computational Mathematics, Science and Engineering, Michigan State University, 428 S. Shaw Ln., East Lansing, MI 48824, USA
| |
Collapse
|
20
|
Xu Q, Georgiou G, Frölich S, van der Sande M, Veenstra G, Zhou H, van Heeringen S. ANANSE: an enhancer network-based computational approach for predicting key transcription factors in cell fate determination. Nucleic Acids Res 2021; 49:7966-7985. [PMID: 34244796 PMCID: PMC8373078 DOI: 10.1093/nar/gkab598] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2020] [Revised: 06/02/2021] [Accepted: 06/28/2021] [Indexed: 12/21/2022] Open
Abstract
Proper cell fate determination is largely orchestrated by complex gene regulatory networks centered around transcription factors. However, experimental elucidation of key transcription factors that drive cellular identity is currently often intractable. Here, we present ANANSE (ANalysis Algorithm for Networks Specified by Enhancers), a network-based method that exploits enhancer-encoded regulatory information to identify the key transcription factors in cell fate determination. As cell type-specific transcription factors predominantly bind to enhancers, we use regulatory networks based on enhancer properties to prioritize transcription factors. First, we predict genome-wide binding profiles of transcription factors in various cell types using enhancer activity and transcription factor binding motifs. Subsequently, applying these inferred binding profiles, we construct cell type-specific gene regulatory networks, and then predict key transcription factors controlling cell fate transitions using differential networks between cell types. This method outperforms existing approaches in correctly predicting major transcription factors previously identified to be sufficient for trans-differentiation. Finally, we apply ANANSE to define an atlas of key transcription factors in 18 normal human tissues. In conclusion, we present a ready-to-implement computational tool for efficient prediction of transcription factors in cell fate determination and to study transcription factor-mediated regulatory mechanisms. ANANSE is freely available at https://github.com/vanheeringen-lab/ANANSE.
Collapse
Affiliation(s)
- Quan Xu
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Georgios Georgiou
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Siebren Frölich
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Maarten van der Sande
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Gert Jan C Veenstra
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Huiqing Zhou
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
- Radboud University Medical Center, Department of Human Genetics, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| | - Simon J van Heeringen
- Radboud University, Department of Molecular Developmental Biology, Faculty of Science, Radboud Institute for Molecular Life Sciences, 6525GA Nijmegen, The Netherlands
| |
Collapse
|
21
|
Abstract
Interpreting the effects of genetic variants is key to understanding individual susceptibility to disease and designing personalized therapeutic approaches. Modern experimental technologies are enabling the generation of massive compendia of human genome sequence data and associated molecular and phenotypic traits, together with genome-scale expression, epigenomics and other functional genomic data. Integrative computational models can leverage these data to understand variant impact, elucidate the effect of dysregulated genes on biological pathways in specific disease and tissue contexts, and interpret disease risk beyond what is feasible with experiments alone. In this Review, we discuss recent developments in machine learning algorithms for genome interpretation and for integrative molecular-level modelling of cells, tissues and organs relevant to disease. More specifically, we highlight existing methods and key challenges and opportunities in identifying specific disease-causing genetic variants and linking them to molecular pathways and, ultimately, to disease phenotypes.
Collapse
|
22
|
Murgas L, Contreras-Riquelme S, Martínez-Hernandez JE, Villaman C, Santibáñez R, Martin AJM. Automated generation of context-specific gene regulatory networks with a weighted approach in Drosophila melanogaster. Interface Focus 2021; 11:20200076. [PMID: 34123358 PMCID: PMC8193463 DOI: 10.1098/rsfs.2020.0076] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 04/21/2021] [Indexed: 01/22/2023] Open
Abstract
The regulation of gene expression is a key factor in the development and maintenance of life in all organisms. Even so, little is known at whole genome scale for most genes and contexts. We propose a method, Tool for Weighted Epigenomic Networks in Drosophila melanogaster (Fly T-WEoN), to generate context-specific gene regulatory networks starting from a reference network that contains all known gene regulations in the fly. Unlikely regulations are removed by applying a series of knowledge-based filters. Each of these filters is implemented as an independent module that considers a type of experimental evidence, including DNA methylation, chromatin accessibility, histone modifications and gene expression. Fly T-WEoN is based on heuristic rules that reflect current knowledge on gene regulation in D. melanogaster obtained from the literature. Experimental data files can be generated with several standard procedures and used solely when and if available. Fly T-WEoN is available as a Cytoscape application that permits integration with other tools and facilitates downstream network analysis. In this work, we first demonstrate the reliability of our method to then provide a relevant application case of our tool: early development of D. melanogaster. Fly T-WEoN together with its step-by-step guide is available at https://weon.readthedocs.io.
Collapse
Affiliation(s)
- Leandro Murgas
- Laboratorio de Biología de Redes, Centro de Genónica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago 8580745, Chile.,Programa de Doctorado en Genómica Integrativa, Vicerrectoría de Investigación, Universidad Mayor, Santiago, Chile
| | - Sebastian Contreras-Riquelme
- Laboratorio de Biología de Redes, Centro de Genónica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago 8580745, Chile.,Facultad de Ciencias de la Vida, Universidad Andres Bello, Santiago 8370146, Chile
| | - J Eduardo Martínez-Hernandez
- Laboratorio de Biología de Redes, Centro de Genónica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago 8580745, Chile.,Centro de Modelamiento Molecular, Biofísica y Bioinformática-CM2B2, Facultad de Ciencias Químicas y Farmaceuticas, Universidad de Chile, Santiago 8380492, Chile.,Programa de Doctorado en Genómica Integrativa, Vicerrectoría de Investigación, Universidad Mayor, Santiago, Chile
| | - Camilo Villaman
- Laboratorio de Biología de Redes, Centro de Genónica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago 8580745, Chile.,Programa de Doctorado en Genómica Integrativa, Vicerrectoría de Investigación, Universidad Mayor, Santiago, Chile
| | - Rodrigo Santibáñez
- Laboratorio de Biología de Redes, Centro de Genónica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago 8580745, Chile
| | - Alberto J M Martin
- Laboratorio de Biología de Redes, Centro de Genónica y Bioinformática, Facultad de Ciencias, Universidad Mayor, Santiago 8580745, Chile
| |
Collapse
|
23
|
Schreiber J, Singh R. Machine learning for profile prediction in genomics. Curr Opin Chem Biol 2021; 65:35-41. [PMID: 34107341 DOI: 10.1016/j.cbpa.2021.04.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 04/21/2021] [Accepted: 04/24/2021] [Indexed: 02/08/2023]
Abstract
A recent deluge of publicly available multi-omics data has fueled the development of machine learning methods aimed at investigating important questions in genomics. Although the motivations for these methods vary, a task that is commonly adopted is that of profile prediction, where predictions are made for one or more forms of biochemical activity along the genome, for example, histone modification, chromatin accessibility, or protein binding. In this review, we give an overview of the research works performing profile prediction, define two broad categories of profile prediction tasks, and discuss the types of scientific questions that can be answered in each.
Collapse
Affiliation(s)
| | - Ritambhara Singh
- Department of Computer Science, Center for Computational Molecular Biology, Brown University, United States.
| |
Collapse
|
24
|
Meyer P, Saez-Rodriguez J. Advances in systems biology modeling: 10 years of crowdsourcing DREAM challenges. Cell Syst 2021; 12:636-653. [PMID: 34139170 DOI: 10.1016/j.cels.2021.05.015] [Citation(s) in RCA: 25] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2020] [Revised: 03/29/2021] [Accepted: 05/18/2021] [Indexed: 02/07/2023]
Abstract
Computational and mathematical models are key to obtain a system-level understanding of biological processes, but their limitations have to be clearly defined to allow their proper application and interpretation. Crowdsourced benchmarks in the form of challenges provide an unbiased assessment of methods, and for the past decade, the Dialogue for Reverse Engineering Assessment and Methods (DREAM) organized more than 15 systems biology challenges. From transcription factor binding to dynamical network models, from signaling networks to gene regulation, from whole-cell models to cell-lineage reconstruction, and from single-cell positioning in a tissue to drug combinations and cell survival, the breadth is broad. To celebrate the 5-year anniversary of Cell Systems, we review the genesis of these systems biology challenges and discuss how interlocking the forward- and reverse-modeling paradigms allows to push the rim of systems biology. This approach will persist for systems levels approaches in biology and medicine.
Collapse
Affiliation(s)
- Pablo Meyer
- IBM T.J. Watson Research Center, Yorktown Heights, NY, USA.
| | - Julio Saez-Rodriguez
- Institute for Computational Biomedicine, Heidelberg University Hospital and Heidelberg University, Faculty of Medicine, Bioquant, Heidelberg 69120, Germany
| |
Collapse
|
25
|
Li H, Guan Y. Fast decoding cell type-specific transcription factor binding landscape at single-nucleotide resolution. Genome Res 2021; 31:721-731. [PMID: 33741685 PMCID: PMC8015851 DOI: 10.1101/gr.269613.120] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Accepted: 02/17/2021] [Indexed: 01/22/2023]
Abstract
Decoding the cell type-specific transcription factor (TF) binding landscape at single-nucleotide resolution is crucial for understanding the regulatory mechanisms underlying many fundamental biological processes and human diseases. However, limits on time and resources restrict the high-resolution experimental measurements of TF binding profiles of all possible TF-cell type combinations. Previous computational approaches either cannot distinguish the cell context-dependent TF binding profiles across diverse cell types or can only provide a relatively low-resolution prediction. Here we present a novel deep learning approach, Leopard, for predicting TF binding sites at single-nucleotide resolution, achieving the average area under receiver operating characteristic curve (AUROC) of 0.982 and the average area under precision recall curve (AUPRC) of 0.208. Our method substantially outperformed the state-of-the-art methods Anchor and FactorNet, improving the predictive AUPRC by 19% and 27%, respectively, when evaluated at 200-bp resolution. Meanwhile, by leveraging a many-to-many neural network architecture, Leopard features a hundredfold to thousandfold speedup compared with current many-to-one machine learning methods.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
26
|
Integrative analysis identifies bHLH transcription factors as contributors to Parkinson's disease risk mechanisms. Sci Rep 2021; 11:3502. [PMID: 33568722 PMCID: PMC7875985 DOI: 10.1038/s41598-021-83087-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2019] [Accepted: 01/26/2021] [Indexed: 11/08/2022] Open
Abstract
Genome-wide association studies (GWAS) have identified multiple genetic risk signals for Parkinson’s disease (PD), however translation into underlying biological mechanisms remains scarce. Genomic functional annotations of neurons provide new resources that may be integrated into analyses of GWAS findings. Altered transcription factor binding plays an important role in human diseases. Insight into transcriptional networks involved in PD risk mechanisms may thus improve our understanding of pathogenesis. We analysed overlap between genome-wide association signals in PD and open chromatin in neurons across multiple brain regions, finding a significant enrichment in the superior temporal cortex. The involvement of transcriptional networks was explored in neurons of the superior temporal cortex based on the location of candidate transcription factor motifs identified by two de novo motif discovery methods. Analyses were performed in parallel, both finding that PD risk variants significantly overlap with open chromatin regions harboring motifs of basic Helix-Loop-Helix (bHLH) transcription factors. Our findings show that cortical neurons are likely mediators of genetic risk for PD. The concentration of PD risk variants at sites of open chromatin targeted by members of the bHLH transcription factor family points to an involvement of these transcriptional networks in PD risk mechanisms.
Collapse
|
27
|
Chen C, Hou J, Shi X, Yang H, Birchler JA, Cheng J. DeepGRN: prediction of transcription factor binding site across cell-types using attention-based deep neural networks. BMC Bioinformatics 2021; 22:38. [PMID: 33522898 PMCID: PMC7852092 DOI: 10.1186/s12859-020-03952-1] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Accepted: 12/29/2020] [Indexed: 12/21/2022] Open
Abstract
Background Due to the complexity of the biological systems, the prediction of the potential DNA binding sites for transcription factors remains a difficult problem in computational biology. Genomic DNA sequences and experimental results from parallel sequencing provide available information about the affinity and accessibility of genome and are commonly used features in binding sites prediction. The attention mechanism in deep learning has shown its capability to learn long-range dependencies from sequential data, such as sentences and voices. Until now, no study has applied this approach in binding site inference from massively parallel sequencing data. The successful applications of attention mechanism in similar input contexts motivate us to build and test new methods that can accurately determine the binding sites of transcription factors. Results In this study, we propose a novel tool (named DeepGRN) for transcription factors binding site prediction based on the combination of two components: single attention module and pairwise attention module. The performance of our methods is evaluated on the ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge datasets. The results show that DeepGRN achieves higher unified scores in 6 of 13 targets than any of the top four methods in the DREAM challenge. We also demonstrate that the attention weights learned by the model are correlated with potential informative inputs, such as DNase-Seq coverage and motifs, which provide possible explanations for the predictive improvements in DeepGRN. Conclusions DeepGRN can automatically and effectively predict transcription factor binding sites from DNA sequences and DNase-Seq coverage. Furthermore, the visualization techniques we developed for the attention modules help to interpret how critical patterns from different types of input features are recognized by our model.
Collapse
Affiliation(s)
- Chen Chen
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA
| | - Jie Hou
- Department of Computer Science, Saint Louis University, St. Louis, MO, 63103, USA
| | - Xiaowen Shi
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Hua Yang
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - James A Birchler
- Division of Biological Sciences, University of Missouri, Columbia, MO, 65211, USA
| | - Jianlin Cheng
- Electrical Engineering and Computer Science Department, University of Missouri, Columbia, MO, 65211, USA.
| |
Collapse
|
28
|
Zhou M, Li H, Wang X, Guan Y. Evidence of widespread, independent sequence signature for transcription factor cobinding. Genome Res 2021; 31:265-278. [PMID: 33303494 PMCID: PMC7849410 DOI: 10.1101/gr.267310.120] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Accepted: 12/03/2020] [Indexed: 01/03/2023]
Abstract
Transcription factors (TFs) are the vocabulary that genomes use to regulate gene expression and phenotypes. The interactions among TFs enrich this vocabulary and orchestrate diverse biological processes. Although simple models identify open chromatin and the presence of TF motifs as the two major contributors to TF binding patterns, it remains elusive what contributes to the in vivo TF cobinding landscape. In this study, we developed a machine learning algorithm to explore the contributors of the cobinding patterns. The algorithm substantially outperforms the state-of-the-field models for TF cobinding prediction. Game theory-based feature importance analysis reveals that, for most of the TF pairs we studied, independent motif sequences contribute one or more of the two TFs under investigation to their cobinding patterns. Such independent motif sequences include, but are not limited to, transcription initiation-related proteins and known TF complexes. We found the motif sequence signatures and the TFs are rarely mutual, corroborating a hierarchical and directional organization of the regulatory network and refuting the possibility of artifacts caused by shared sequence similarity with the TFs under investigation. We modeled such regulatory language with directed graphs, which reveal shared, global factors that are related to many binding and cobinding patterns.
Collapse
Affiliation(s)
- Manqi Zhou
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Xueqing Wang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, USA
| |
Collapse
|
29
|
Srivastava D, Aydin B, Mazzoni EO, Mahony S. An interpretable bimodal neural network characterizes the sequence and preexisting chromatin predictors of induced transcription factor binding. Genome Biol 2021; 22:20. [PMID: 33413545 PMCID: PMC7788824 DOI: 10.1186/s13059-020-02218-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2019] [Accepted: 12/03/2020] [Indexed: 02/06/2023] Open
Abstract
BACKGROUND Transcription factor (TF) binding specificity is determined via a complex interplay between the transcription factor's DNA binding preference and cell type-specific chromatin environments. The chromatin features that correlate with transcription factor binding in a given cell type have been well characterized. For instance, the binding sites for a majority of transcription factors display concurrent chromatin accessibility. However, concurrent chromatin features reflect the binding activities of the transcription factor itself and thus provide limited insight into how genome-wide TF-DNA binding patterns became established in the first place. To understand the determinants of transcription factor binding specificity, we therefore need to examine how newly activated transcription factors interact with sequence and preexisting chromatin landscapes. RESULTS Here, we investigate the sequence and preexisting chromatin predictors of TF-DNA binding by examining the genome-wide occupancy of transcription factors that have been induced in well-characterized chromatin environments. We develop Bichrom, a bimodal neural network that jointly models sequence and preexisting chromatin data to interpret the genome-wide binding patterns of induced transcription factors. We find that the preexisting chromatin landscape is a differential global predictor of TF-DNA binding; incorporating preexisting chromatin features improves our ability to explain the binding specificity of some transcription factors substantially, but not others. Furthermore, by analyzing site-level predictors, we show that transcription factor binding in previously inaccessible chromatin tends to correspond to the presence of more favorable cognate DNA sequences. CONCLUSIONS Bichrom thus provides a framework for modeling, interpreting, and visualizing the joint sequence and chromatin landscapes that determine TF-DNA binding dynamics.
Collapse
Affiliation(s)
- Divyanshi Srivastava
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, USA
| | - Begüm Aydin
- Department of Biology, New York University, New York, NY, USA
| | | | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, Pennsylvania State University, University Park, PA, USA.
| |
Collapse
|
30
|
Li H, Guan Y. DeepSleep convolutional neural network allows accurate and fast detection of sleep arousal. Commun Biol 2021; 4:18. [PMID: 33398048 PMCID: PMC7782826 DOI: 10.1038/s42003-020-01542-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2020] [Accepted: 12/01/2020] [Indexed: 12/19/2022] Open
Abstract
Sleep arousals are transient periods of wakefulness punctuated into sleep. Excessive sleep arousals are associated with symptoms such as sympathetic activation, non-restorative sleep, and daytime sleepiness. Currently, sleep arousals are mainly annotated by human experts through looking at 30-second epochs (recorded pages) manually, which requires considerable time and effort. Here we present a deep learning approach for automatically segmenting sleep arousal regions based on polysomnographic recordings. Leveraging a specific architecture that 'translates' input polysomnographic signals to sleep arousal labels, this algorithm ranked first in the "You Snooze, You Win" PhysioNet Challenge. We created an augmentation strategy by randomly swapping similar physiological channels, which notably improved the prediction accuracy. Our algorithm enables fast and accurate delineation of sleep arousal events at the speed of 10 seconds per sleep recording. This computational tool would greatly empower the scoring process in clinical settings and accelerate studies on the impact of arousals.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
31
|
Martin PC, Zabet NR. Dissecting the binding mechanisms of transcription factors to DNA using a statistical thermodynamics framework. Comput Struct Biotechnol J 2020; 18:3590-3605. [PMID: 33304457 PMCID: PMC7708957 DOI: 10.1016/j.csbj.2020.11.006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2020] [Revised: 11/02/2020] [Accepted: 11/04/2020] [Indexed: 01/22/2023] Open
Abstract
Transcription Factors (TFs) bind to DNA and control activity of target genes. Here, we present ChIPanalyser, a user-friendly, versatile and powerful R/Bioconductor package predicting and modelling the binding of TFs to DNA. ChIPanalyser performs similarly to state-of-the-art tools, but is an explainable model and provides biological insights into binding mechanisms of TFs. We focused on investigating the binding mechanisms of three TFs that are known architectural proteins CTCF, BEAF-32 and su(Hw) in three Drosophila cell lines (BG3, Kc167 and S2). While CTCF preferentially binds only to a subset of high affinity sites located mainly in open chromatin, BEAF-32 binds to most of its high affinity binding sites available in open chromatin. In contrast, su(Hw) binds to both open chromatin and also partially closed chromatin. Most importantly, differences in TF binding profiles between cell lines for these TFs are mainly driven by differences in DNA accessibility and not by differences in TF concentrations between cell lines. Finally, we investigated binding of Hox TFs in Drosophila and found that Ubx binds only in open chromatin, while Abd-B and Dfd are capable to bind in both open and partially closed chromatin. Overall, our results show that TFs display different binding mechanisms and that our model is able to recapitulate their specific binding behaviour.
Collapse
Affiliation(s)
- Patrick C.N. Martin
- School of Life Sciences, University of Essex, Colchester CO4 3SQ, UK
- Biotech Research and Innovation Centre (BRIC), University of Copenhagen, DK-2200 Copenhagen, Denmark
| | - Nicolae Radu Zabet
- School of Life Sciences, University of Essex, Colchester CO4 3SQ, UK
- Blizard Institute, Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London E1 2AT, UK
| |
Collapse
|
32
|
Del Sol A, Jung S. The Importance of Computational Modeling in Stem Cell Research. Trends Biotechnol 2020; 39:126-136. [PMID: 32800604 DOI: 10.1016/j.tibtech.2020.07.006] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2020] [Revised: 07/13/2020] [Accepted: 07/15/2020] [Indexed: 12/30/2022]
Abstract
The generation of large amounts of omics data is increasingly enabling not only the processing and analysis of large data sets but also the development of computational models in the field of stem cell research. Although computational models have been proposed in recent decades, we believe that the stem cell community is not fully aware of the potentiality of computational modeling in guiding their experimental research. In this regard, we discuss how single-cell technologies provide the right framework for computational modeling at different scales of biological organization in order to address challenges in the stem cell field and to guide experimentalists in the design of new strategies for stem cell therapies and treatment of congenital disorders.
Collapse
Affiliation(s)
- Antonio Del Sol
- Luxembourg Centre for Systems Biomedicine (LCSB), University of Luxembourg, 6 Avenue du Swing, Esch-sur-Alzette, L-4367 Belvaux, Luxembourg; CIC bioGUNE-BRTA (Basque Research and Technology Alliance), Bizkaia Technology Park, 801 Building, 48160 Derio, Spain; IKERBASQUE, Basque Foundation for Science, Bilbao 48013, Spain.
| | - Sascha Jung
- CIC bioGUNE-BRTA (Basque Research and Technology Alliance), Bizkaia Technology Park, 801 Building, 48160 Derio, Spain
| |
Collapse
|
33
|
Srivastava D, Mahony S. Sequence and chromatin determinants of transcription factor binding and the establishment of cell type-specific binding patterns. BIOCHIMICA ET BIOPHYSICA ACTA. GENE REGULATORY MECHANISMS 2020; 1863:194443. [PMID: 31639474 PMCID: PMC7166147 DOI: 10.1016/j.bbagrm.2019.194443] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/30/2019] [Revised: 09/21/2019] [Accepted: 10/06/2019] [Indexed: 12/14/2022]
Abstract
Transcription factors (TFs) selectively bind distinct sets of sites in different cell types. Such cell type-specific binding specificity is expected to result from interplay between the TF's intrinsic sequence preferences, cooperative interactions with other regulatory proteins, and cell type-specific chromatin landscapes. Cell type-specific TF binding events are highly correlated with patterns of chromatin accessibility and active histone modifications in the same cell type. However, since concurrent chromatin may itself be a consequence of TF binding, chromatin landscapes measured prior to TF activation provide more useful insights into how cell type-specific TF binding events became established in the first place. Here, we review the various sequence and chromatin determinants of cell type-specific TF binding specificity. We identify the current challenges and opportunities associated with computational approaches to characterizing, imputing, and predicting cell type-specific TF binding patterns. We further focus on studies that characterize TF binding in dynamic regulatory settings, and we discuss how these studies are leading to a more complex and nuanced understanding of dynamic protein-DNA binding activities. We propose that TF binding activities at individual sites can be viewed along a two-dimensional continuum of local sequence and chromatin context. Under this view, cell type-specific TF binding activities may result from either strongly favorable sequence features or strongly favorable chromatin context.
Collapse
Affiliation(s)
- Divyanshi Srivastava
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America
| | - Shaun Mahony
- Center for Eukaryotic Gene Regulation, Department of Biochemistry & Molecular Biology, The Pennsylvania State University, University Park, PA, United States of America.
| |
Collapse
|
34
|
Schreiber J, Bilmes J, Noble WS. Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples. Genome Biol 2020; 21:82. [PMID: 32228713 PMCID: PMC7104481 DOI: 10.1186/s13059-020-01978-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Accepted: 02/26/2020] [Indexed: 12/16/2022] Open
Abstract
Recent efforts to describe the human epigenome have yielded thousands of epigenomic and transcriptomic datasets. However, due primarily to cost, the total number of such assays that can be performed is limited. Accordingly, we applied an imputation approach, Avocado, to a dataset of 3814 tracks of data derived from the ENCODE compendium, including measurements of chromatin accessibility, histone modification, transcription, and protein binding. Avocado shows significant improvements in imputing protein binding compared to the top models in the ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model.
Collapse
Affiliation(s)
- Jacob Schreiber
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
| | - Jeffrey Bilmes
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
- Department of Electrical Engineering, University of Washington, Seattle, USA
| | - William Stafford Noble
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
- Department of Genome Sciences, University of Washington, Seattle, USA.
| |
Collapse
|
35
|
Wu S, Li H, Quang D, Guan Y. Three-Plane-assembled Deep Learning Segmentation of Gliomas. Radiol Artif Intell 2020; 2:e190011. [PMID: 32280947 PMCID: PMC7104789 DOI: 10.1148/ryai.2020190011] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2019] [Revised: 10/09/2019] [Accepted: 10/18/2019] [Indexed: 12/15/2022]
Abstract
PURPOSE To design a computational method for automatic brain glioma segmentation of multimodal MRI scans with high efficiency and accuracy. MATERIALS AND METHODS The 2018 Multimodal Brain Tumor Segmentation Challenge (BraTS) dataset was used in this study, consisting of routine clinically acquired preoperative multimodal MRI scans. Three subregions of glioma-the necrotic and nonenhancing tumor core, the peritumoral edema, and the contrast-enhancing tumor-were manually labeled by experienced radiologists. Two-dimensional U-Net models were built using a three-plane-assembled approach to segment three subregions individually (three-region model) or to segment only the whole tumor (WT) region (WT-only model). The term three-plane-assembled means that coronal and sagittal images were generated by reformatting the original axial images. The model performance for each case was evaluated in three classes: enhancing tumor (ET), tumor core (TC), and WT. RESULTS On the internal unseen testing dataset split from the 2018 BraTS training dataset, the proposed models achieved mean Sørensen-Dice scores of 0.80, 0.84, and 0.91, respectively, for ET, TC, and WT. On the BraTS validation dataset, the proposed models achieved mean 95% Hausdorff distances of 3.1 mm, 7.0 mm, and 5.0 mm, respectively, for ET, TC, and WT and mean Sørensen-Dice scores of 0.80, 0.83, and 0.91, respectively, for ET, TC, and WT. On the BraTS testing dataset, the proposed models ranked fourth out of 61 teams. The source code is available at https://github.com/GuanLab/Brain_Glioma. CONCLUSION This deep learning method consistently segmented subregions of brain glioma with high accuracy, efficiency, reliability, and generalization ability on screening images from a large population, and it can be efficiently implemented in clinical practice to assist neuro-oncologists or radiologists. Supplemental material is available for this article. © RSNA, 2020.
Collapse
Affiliation(s)
- Shaocheng Wu
- From the Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109
| | - Hongyang Li
- From the Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109
| | - Daniel Quang
- From the Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109
| | - Yuanfang Guan
- From the Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Ave, Ann Arbor, MI 48109
| |
Collapse
|
36
|
Qin Q, Fan J, Zheng R, Wan C, Mei S, Wu Q, Sun H, Brown M, Zhang J, Meyer CA, Liu XS. Lisa: inferring transcriptional regulators through integrative modeling of public chromatin accessibility and ChIP-seq data. Genome Biol 2020; 21:32. [PMID: 32033573 PMCID: PMC7007693 DOI: 10.1186/s13059-020-1934-6] [Citation(s) in RCA: 178] [Impact Index Per Article: 35.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2019] [Accepted: 01/13/2020] [Indexed: 12/21/2022] Open
Abstract
We developed Lisa (http://lisa.cistrome.org/) to predict the transcriptional regulators (TRs) of differentially expressed or co-expressed gene sets. Based on the input gene sets, Lisa first uses histone mark ChIP-seq and chromatin accessibility profiles to construct a chromatin model related to the regulation of these genes. Using TR ChIP-seq peaks or imputed TR binding sites, Lisa probes the chromatin models using in silico deletion to find the most relevant TRs. Applied to gene sets derived from targeted TF perturbation experiments, Lisa boosted the performance of imputed TR cistromes and outperformed alternative methods in identifying the perturbed TRs.
Collapse
Affiliation(s)
- Qian Qin
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200433, China
- Center of Molecular Medicine, Children's Hospital of Fudan University, Shanghai, 201102, China
| | - Jingyu Fan
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200433, China
| | - Rongbin Zheng
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200433, China
| | - Changxin Wan
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200433, China
| | - Shenglin Mei
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200433, China
| | - Qiu Wu
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200433, China
| | - Hanfei Sun
- Clinical Translational Research Center, Shanghai Pulmonary Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200433, China
| | - Myles Brown
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, 02215, USA
- Department of Data Sciences, Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health, Boston, MA, 02215, USA
| | - Jing Zhang
- Stem Cell Translational Research Center, Tongji Hospital, School of Life Science and Technology, Tongji University, Shanghai, 200065, China.
| | - Clifford A Meyer
- Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health, Boston, MA, 02215, USA.
| | - X Shirley Liu
- Center for Functional Cancer Epigenetics, Dana-Farber Cancer Institute, Boston, MA, 02215, USA.
- Department of Data Sciences, Dana-Farber Cancer Institute and Harvard T.H. Chan School of Public Health, Boston, MA, 02215, USA.
| |
Collapse
|
37
|
Koo PK, Ploenzke M. Deep learning for inferring transcription factor binding sites. CURRENT OPINION IN SYSTEMS BIOLOGY 2020; 19:16-23. [PMID: 32905524 PMCID: PMC7469942 DOI: 10.1016/j.coisb.2020.04.001] [Citation(s) in RCA: 37] [Impact Index Per Article: 7.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Deep learning is a powerful tool for predicting transcription factor binding sites from DNA sequence. Despite their high predictive accuracy, there are no guarantees that a high-performing deep learning model will learn causal sequence-function relationships. Thus a move beyond performance comparisons on benchmark datasets is needed. Interpreting model predictions is a powerful approach to identify which features drive performance gains and ideally provide insight into the underlying biological mechanisms. Here we highlight timely advances in deep learning for genomics, with a focus on inferring transcription factors binding sites. We describe recent applications, model architectures, and advances in local and global model interpretability methods, then conclude with a discussion on future research directions.
Collapse
Affiliation(s)
- Peter K Koo
- Simons Center for Quantitative Biology, Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Matt Ploenzke
- Department of Biostatistics, Harvard University, Cambridge, MA, USA
| |
Collapse
|
38
|
Li H, Siddiqui O, Zhang H, Guan Y. Joint learning improves protein abundance prediction in cancers. BMC Biol 2019; 17:107. [PMID: 31870366 PMCID: PMC6929375 DOI: 10.1186/s12915-019-0730-9] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Accepted: 12/04/2019] [Indexed: 12/20/2022] Open
Abstract
BACKGROUND The classic central dogma in biology is the information flow from DNA to mRNA to protein, yet complicated regulatory mechanisms underlying protein translation often lead to weak correlations between mRNA and protein abundances. This is particularly the case in cancer samples and when evaluating the same gene across multiple samples. RESULTS Here, we report a method for predicting proteome from transcriptome, using a training dataset provided by NCI-CPTAC and TCGA, consisting of transcriptome and proteome data from 77 breast and 105 ovarian cancer samples. First, we establish a generic model capturing the correlation between mRNA and protein abundance of a single gene. Second, we build a gene-specific model capturing the interdependencies among multiple genes in a regulatory network. Third, we create a cross-tissue model by joint learning the information of shared regulatory networks and pathways across cancer tissues. Our method ranked first in the NCI-CPTAC DREAM Proteogenomics Challenge, and the predictive performance is close to the accuracy of experimental replicates. Key functional pathways and network modules controlling the proteomic abundance in cancers were revealed, in particular metabolism-related genes. CONCLUSIONS We present a method to predict proteome from transcriptome, leveraging data from different cancer tissues to build a trans-tissue model, and suggest how to integrate information from multiple cancers to provide a foundation for further research.
Collapse
Affiliation(s)
- Hongyang Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA.
| | - Omer Siddiqui
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA
| | - Hongjiu Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA. .,Department of Internal Medicine, University of Michigan, 100 Washtenaw Avenue, Ann Arbor, MI, 48109, USA.
| |
Collapse
|
39
|
Li H, Guan Y. Machine learning empowers phosphoproteome prediction in cancers. Bioinformatics 2019; 36:859-864. [PMID: 31410451 PMCID: PMC7868059 DOI: 10.1093/bioinformatics/btz639] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2019] [Revised: 07/25/2019] [Accepted: 08/12/2019] [Indexed: 02/06/2023] Open
Abstract
MOTIVATION Reversible protein phosphorylation is an essential post-translational modification regulating protein functions and signaling pathways in many cellular processes. Aberrant activation of signaling pathways often contributes to cancer development and progression. The mass spectrometry-based phosphoproteomics technique is a powerful tool to investigate the site-level phosphorylation of the proteome in a global fashion, paving the way for understanding the regulatory mechanisms underlying cancers. However, this approach is time-consuming and requires expensive instruments, specialized expertise and a large amount of starting material. An alternative in silico approach is predicting the phosphoproteomic profiles of cancer patients from the available proteomic, transcriptomic and genomic data. RESULTS Here, we present a winning algorithm in the 2017 NCI-CPTAC DREAM Proteogenomics Challenge for predicting phosphorylation levels of the proteome across cancer patients. We integrate four components into our algorithm, including (i) baseline correlations between protein and phosphoprotein abundances, (ii) universal protein-protein interactions, (iii) shareable regulatory information across cancer tissues and (iv) associations among multi-phosphorylation sites of the same protein. When tested on a large held-out testing dataset of 108 breast and 62 ovarian cancer samples, our method ranked first in both cancer tissues, demonstrating its robustness and generalization ability. AVAILABILITY AND IMPLEMENTATION Our code and reproducible results are freely available on GitHub: https://github.com/GuanLab/phosphoproteome_prediction. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongyang Li
- To whom correspondence should be addressed. or
| | | |
Collapse
|