1
|
Maciejewski E, Horvath S, Ernst J. CMImpute: cross-species and tissue imputation of species-level DNA methylation samples across mammalian species. Genome Biol 2025; 26:133. [PMID: 40394556 PMCID: PMC12090574 DOI: 10.1186/s13059-025-03561-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 03/26/2025] [Indexed: 05/22/2025] Open
Abstract
The large-scale application of the mammalian methylation array has substantially expanded the availability of DNA methylation data in mammalian species. However, this data captures only a small portion of species-tissue combinations. To address this, we develop CMImpute (Cross-species Methylation Imputation), a method based on a conditional variational autoencoder, to impute DNA methylation representing species-tissue combinations. We demonstrate that CMImpute achieves strong sample-wise correlation between imputed and observed values. Using CMImpute and data from 348 species and 59 tissue types, we impute methylation data for 19,786 new species-tissue combinations. We expect CMImpute will be a useful resource for DNA methylation analyses.
Collapse
Affiliation(s)
- Emily Maciejewski
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Steve Horvath
- Dept. of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Dept. Of Biostatistics, Fielding School of Public Health, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Altos Labs, Cambridge, WA14 2DT, UK
| | - Jason Ernst
- Computer Science Department, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Department of Biological Chemistry, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Department of Computational Medicine, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Jonsson Comprehensive Cancer Center, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
- Molecular Biology Institute, University of California, Los Angeles, Los Angeles, CA, 90095, USA.
| |
Collapse
|
2
|
López-Hernández L, Toolan-Kerr P, Bannister AJ, Millán-Zambrano G. Dynamic histone modification patterns coordinating DNA processes. Mol Cell 2025; 85:225-237. [PMID: 39824165 DOI: 10.1016/j.molcel.2024.10.034] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 10/02/2024] [Accepted: 10/25/2024] [Indexed: 01/20/2025]
Abstract
Significant effort has been spent attempting to unravel the causal relationship between histone post-translational modifications and fundamental DNA processes, including transcription, replication, and repair. However, less attention has been paid to understanding the reciprocal influence-that is, how DNA processes, in turn, shape the distribution and patterns of histone modifications and how these changes convey information, both temporally and spatially, from one process to another. Here, we review how histone modifications underpin the widespread bidirectional crosstalk between different DNA processes, which allow seemingly distinct phenomena to operate as a unified whole.
Collapse
Affiliation(s)
- Laura López-Hernández
- Centro Andaluz de Biología Molecular y Medicina Regenerativa-CABIMER, Universidad de Sevilla-CSIC-Universidad Pablo de Olavide, 41092 Seville, Spain; Departamento de Genética, Universidad de Sevilla, 41012 Seville, Spain
| | - Patrick Toolan-Kerr
- Centro Andaluz de Biología Molecular y Medicina Regenerativa-CABIMER, Universidad de Sevilla-CSIC-Universidad Pablo de Olavide, 41092 Seville, Spain; Departamento de Genética, Universidad de Sevilla, 41012 Seville, Spain
| | - Andrew J Bannister
- Gurdon Institute and Department of Pathology, University of Cambridge, Cambridge CB2 1QN, UK.
| | - Gonzalo Millán-Zambrano
- Centro Andaluz de Biología Molecular y Medicina Regenerativa-CABIMER, Universidad de Sevilla-CSIC-Universidad Pablo de Olavide, 41092 Seville, Spain; Departamento de Genética, Universidad de Sevilla, 41012 Seville, Spain.
| |
Collapse
|
3
|
Murphy AE, Beardall W, Rei M, Phuycharoen M, Skene NG. Predicting cell type-specific epigenomic profiles accounting for distal genetic effects. Nat Commun 2024; 15:9951. [PMID: 39550354 PMCID: PMC11569248 DOI: 10.1038/s41467-024-54441-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2024] [Accepted: 11/06/2024] [Indexed: 11/18/2024] Open
Abstract
Understanding how genetic variants affect the epigenome is key to interpreting GWAS, yet profiling these effects across the non-coding genome remains challenging due to experimental scalability. This necessitates accurate computational models. Existing machine learning approaches, while progressively improving, are confined to the cell types they were trained on, limiting their applicability. Here, we introduce Enformer Celltyping, a deep learning model which incorporates distal effects of DNA interactions, up to 100,000 base-pairs away, to predict epigenetic signals in previously unseen cell types. Using DNA and chromatin accessibility data for epigenetic imputation, Enformer Celltyping outperforms current best-in-class approaches and generalises across cell types and biological regions. Moreover, we propose a framework for evaluating models on genetic variant effect prediction using regulatory quantitative trait loci mapping studies, highlighting current limitations in genomic deep learning models. Despite this, Enformer Celltyping can also be used to study cell type-specific genetic enrichment of complex traits.
Collapse
Affiliation(s)
- Alan E Murphy
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK.
- Department of Brain Sciences, Imperial College London, London, W12 0BZ, UK.
| | - William Beardall
- Department of Bioengineering, Imperial College London, London, SW7 2AZ, UK
| | - Marek Rei
- Department of Computing, Imperial College London, London, SW7 2RH, UK
| | - Mike Phuycharoen
- Division of Informatics, Imaging & Data Sciences, University of Manchester, Manchester, M13 9PL, UK
| | - Nathan G Skene
- UK Dementia Research Institute at Imperial College London, London, W12 0BZ, UK.
- Department of Brain Sciences, Imperial College London, London, W12 0BZ, UK.
| |
Collapse
|
4
|
Wen W, Zhong J, Zhang Z, Jia L, Chu T, Wang N, Danko CG, Wang Z. dHICA: a deep transformer-based model enables accurate histone imputation from chromatin accessibility. Brief Bioinform 2024; 25:bbae459. [PMID: 39316943 PMCID: PMC11421843 DOI: 10.1093/bib/bbae459] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2024] [Revised: 07/13/2024] [Accepted: 09/04/2024] [Indexed: 09/26/2024] Open
Abstract
Histone modifications (HMs) are pivotal in various biological processes, including transcription, replication, and DNA repair, significantly impacting chromatin structure. These modifications underpin the molecular mechanisms of cell-type-specific gene expression and complex diseases. However, annotating HMs across different cell types solely using experimental approaches is impractical due to cost and time constraints. Herein, we present dHICA (deep histone imputation using chromatin accessibility), a novel deep learning framework that integrates DNA sequences and chromatin accessibility data to predict multiple HM tracks. Employing the transformer architecture alongside dilated convolutions, dHICA boasts an extensive receptive field and captures more cell-type-specific information. dHICA outperforms state-of-the-art baselines and achieves superior performance in cell-type-specific loci and gene elements, aligning with biological expectations. Furthermore, dHICA's imputations hold significant potential for downstream applications, including chromatin state segmentation and elucidating the functional implications of SNPs (Single Nucleotide Polymorphisms). In conclusion, dHICA serves as a valuable tool for advancing the understanding of chromatin dynamics, offering enhanced predictive capabilities and interpretability.
Collapse
Affiliation(s)
- Wen Wen
- School of Software Technology, Dalian University of Technology, Linggong Rd, Liaoning 116024, China
| | - Jiaxin Zhong
- School of Software Technology, Dalian University of Technology, Linggong Rd, Liaoning 116024, China
| | - Zhaoxi Zhang
- School of Software Technology, Dalian University of Technology, Linggong Rd, Liaoning 116024, China
| | - Lijuan Jia
- School of Software Technology, Dalian University of Technology, Linggong Rd, Liaoning 116024, China
| | - Tinyi Chu
- Meinig School of Biomedical Engineering, Cornell University, Weill Hall, Ithaca, NY 14853, United States
| | - Nating Wang
- Department of Molecular Biology and Genetics, Cornell University, Biotechnology Building, Ithaca, NY 14853, United States
| | - Charles G Danko
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Hungerford Hill Rd, Ithaca, NY 14853, United States
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Tower Rd, Ithaca, NY 14853, United States
| | - Zhong Wang
- School of Software Technology, Dalian University of Technology, Linggong Rd, Liaoning 116024, China
| |
Collapse
|
5
|
Tan ZC, Meyer AS. The structure is the message: Preserving experimental context through tensor decomposition. Cell Syst 2024; 15:679-693. [PMID: 39173584 PMCID: PMC11366223 DOI: 10.1016/j.cels.2024.07.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2024] [Revised: 06/25/2024] [Accepted: 07/22/2024] [Indexed: 08/24/2024]
Abstract
Recent biological studies have been revolutionized in scale and granularity by multiplex and high-throughput assays. Profiling cell responses across several experimental parameters, such as perturbations, time, and genetic contexts, leads to richer and more generalizable findings. However, these multidimensional datasets necessitate a reevaluation of the conventional methods for their representation and analysis. Traditionally, experimental parameters are merged to flatten the data into a two-dimensional matrix, sacrificing crucial experiment context reflected by the structure. As Marshall McLuhan famously stated, "the medium is the message." In this work, we propose that the experiment structure is the medium in which subsequent analysis is performed, and the optimal choice of data representation must reflect the experiment structure. We review how tensor-structured analyses and decompositions can preserve this information. We contend that tensor methods are poised to become integral to the biomedical data sciences toolkit.
Collapse
Affiliation(s)
- Zhixin Cyrillus Tan
- Bioinformatics Interdepartmental Program, University of California, Los Angeles (UCLA), Los Angeles, CA, USA.
| | - Aaron S Meyer
- Bioinformatics Interdepartmental Program, University of California, Los Angeles (UCLA), Los Angeles, CA, USA; Department of Bioengineering, UCLA, Los Angeles, CA, USA; Jonsson Comprehensive Cancer Center, UCLA, Los Angeles, CA, USA; Eli and Edythe Broad Center of Regenerative Medicine and Stem Cell Research, UCLA, Los Angeles, CA, USA.
| |
Collapse
|
6
|
Min A, Schreiber J, Kundaje A, Noble WS. Predicting chromatin conformation contact maps. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.04.12.589240. [PMID: 38645064 PMCID: PMC11030330 DOI: 10.1101/2024.04.12.589240] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/23/2024]
Abstract
Over the past 15 years, a variety of next-generation sequencing assays have been developed for measuring the 3D conformation of DNA in the nucleus. Each of these assays gives, for a particular cell or tissue type, a distinct picture of 3D chromatin architecture. Accordingly, making sense of the relationship between genome structure and function requires teasing apart two closely related questions: how does chromatin 3D structure change from one cell type to the next, and how do different measurements of that structure differ from one another, even when the two assays are carried out in the same cell type? In this work, we assemble a collection of chromatin 3D datasets-each represented as a 2D contact map- spanning multiple assay types and cell types. We then build a machine learning model that predicts missing contact maps in this collection. We use the model to systematically explore how genome 3D architecture changes, at the level of compartments, domains, and loops, between cell type and between assay types.
Collapse
Affiliation(s)
- Alan Min
- Department of Statistics, University of Washington
| | | | | | - William Stafford Noble
- Department of Genome Sciences, University of Washington
- Paul G. Allen School of Computer Science and Engineering, University of Washington
| |
Collapse
|
7
|
Xiang G, Guo Y, Bumcrot D, Sigova A. JMnorm: a novel joint multi-feature normalization method for integrative and comparative epigenomics. Nucleic Acids Res 2024; 52:e11. [PMID: 38055833 PMCID: PMC10810286 DOI: 10.1093/nar/gkad1146] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2023] [Revised: 10/25/2023] [Accepted: 11/14/2023] [Indexed: 12/08/2023] Open
Abstract
Combinatorial patterns of epigenetic features reflect transcriptional states and functions of genomic regions. While many epigenetic features have correlated relationships, most existing data normalization approaches analyze each feature independently. Such strategies may distort relationships between functionally correlated epigenetic features and hinder biological interpretation. We present a novel approach named JMnorm that simultaneously normalizes multiple epigenetic features across cell types, species, and experimental conditions by leveraging information from partially correlated epigenetic features. We demonstrate that JMnorm-normalized data can better preserve cross-epigenetic-feature correlations across different cell types and enhance consistency between biological replicates than data normalized by other methods. Additionally, we show that JMnorm-normalized data can consistently improve the performance of various downstream analyses, which include candidate cis-regulatory element clustering, cross-cell-type gene expression prediction, detection of transcription factor binding and changes upon perturbations. These findings suggest that JMnorm effectively minimizes technical noise while preserving true biologically significant relationships between epigenetic datasets. We anticipate that JMnorm will enhance integrative and comparative epigenomics.
Collapse
Affiliation(s)
- Guanjue Xiang
- CAMP4 Therapeutics Corp., One Kendall Square, Building 1400 West, Cambridge, MA 02139, USA
| | - Yuchun Guo
- CAMP4 Therapeutics Corp., One Kendall Square, Building 1400 West, Cambridge, MA 02139, USA
| | - David Bumcrot
- CAMP4 Therapeutics Corp., One Kendall Square, Building 1400 West, Cambridge, MA 02139, USA
| | - Alla Sigova
- CAMP4 Therapeutics Corp., One Kendall Square, Building 1400 West, Cambridge, MA 02139, USA
| |
Collapse
|
8
|
Yu X, Zhao H, Wang R, Chen Y, Ouyang X, Li W, Sun Y, Peng A. Cancer epigenetics: from laboratory studies and clinical trials to precision medicine. Cell Death Discov 2024; 10:28. [PMID: 38225241 PMCID: PMC10789753 DOI: 10.1038/s41420-024-01803-z] [Citation(s) in RCA: 72] [Impact Index Per Article: 72.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 12/23/2023] [Accepted: 01/04/2024] [Indexed: 01/17/2024] Open
Abstract
Epigenetic dysregulation is a common feature of a myriad of human diseases, particularly cancer. Defining the epigenetic defects associated with malignant tumors has become a focus of cancer research resulting in the gradual elucidation of cancer cell epigenetic regulation. In fact, most stages of tumor progression, including tumorigenesis, promotion, progression, and recurrence are accompanied by epigenetic alterations, some of which can be reversed by epigenetic drugs. The main objective of epigenetic therapy in the era of personalized precision medicine is to detect cancer biomarkers to improve risk assessment, diagnosis, and targeted treatment interventions. Rapid technological advancements streamlining the characterization of molecular epigenetic changes associated with cancers have propelled epigenetic drug research and development. This review summarizes the main mechanisms of epigenetic dysregulation and discusses past and present examples of epigenetic inhibitors in cancer diagnosis and treatment, with an emphasis on the development of epigenetic enzyme inhibitors or drugs. In the final part, the prospect of precise diagnosis and treatment is considered based on a better understanding of epigenetic abnormalities in cancer.
Collapse
Affiliation(s)
- Xinyang Yu
- Guangdong Provincial Key Laboratory of Tumor Interventional Diagnosis and Treatment, Zhuhai Institute of Translational Medicine, (Zhuhai People's Hospital Zhuhai Clinical Medical College of Jinan University), Zhuhai, 519000, China
| | - Hao Zhao
- Department of Spinal Surgery, Yichang Central People's Hospital Affiliated with China Three Gorges University, Yichang, Hubei, 443000, China
| | - Ruiqi Wang
- Department of Pharmacy, Zhuhai People's Hospital, Zhuhai People's Hospital (Zhuhai Clinical Medical College of Jinan University), Zhuhai, Guangdong, 519000, China
| | - Yingyin Chen
- Guangdong Provincial Key Laboratory of Tumor Interventional Diagnosis and Treatment, Zhuhai Institute of Translational Medicine, (Zhuhai People's Hospital Zhuhai Clinical Medical College of Jinan University), Zhuhai, 519000, China
| | - Xumei Ouyang
- Guangdong Provincial Key Laboratory of Tumor Interventional Diagnosis and Treatment, Zhuhai Institute of Translational Medicine, (Zhuhai People's Hospital Zhuhai Clinical Medical College of Jinan University), Zhuhai, 519000, China
| | - Wenting Li
- Guangdong Provincial Key Laboratory of Tumor Interventional Diagnosis and Treatment, Zhuhai Institute of Translational Medicine, (Zhuhai People's Hospital Zhuhai Clinical Medical College of Jinan University), Zhuhai, 519000, China
| | - Yihao Sun
- Guangdong Provincial Key Laboratory of Tumor Interventional Diagnosis and Treatment, Zhuhai Institute of Translational Medicine, (Zhuhai People's Hospital Zhuhai Clinical Medical College of Jinan University), Zhuhai, 519000, China.
| | - Anghui Peng
- Guangdong Provincial Key Laboratory of Tumor Interventional Diagnosis and Treatment, Zhuhai Institute of Translational Medicine, (Zhuhai People's Hospital Zhuhai Clinical Medical College of Jinan University), Zhuhai, 519000, China.
| |
Collapse
|
9
|
Fadri MTM, Lee JB, Keung AJ. Summary of ChIP-Seq Methods and Description of an Optimized ChIP-Seq Protocol. Methods Mol Biol 2024; 2842:419-447. [PMID: 39012609 DOI: 10.1007/978-1-0716-4051-7_22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/17/2024]
Abstract
Chromatin immunoprecipitation (ChIP) is an invaluable method to characterize interactions between proteins and genomic DNA, such as the genomic localization of transcription factors and post-translational modification of histones. DNA and proteins are reversibly and covalently crosslinked using formaldehyde. Then the cells are lysed to release the chromatin. The chromatin is fragmented into smaller sizes either by micrococcal nuclease (MN) or sonication and then purified from other cellular components. The protein-DNA complexes are enriched by immunoprecipitation (IP) with antibodies that target the epitope of interest. The DNA is released from the proteins by heat and protease treatment, followed by degradation of contaminating RNAs with RNase. The resulting DNA is analyzed using various methods, including polymerase chain reaction (PCR), quantitative PCR (qPCR), or sequencing. This protocol outlines each of these steps for both yeast and human cells. This chapter includes a contextual discussion of the combination of ChIP with DNA analysis methods such as ChIP-on-Chip, ChIP-qPCR, and ChIP-Seq, recent updates on ChIP-Seq data analysis pipelines, complementary methods for identification of binding sites of DNA binding proteins, and additional protocol information about ChIP-qPCR and ChIP-Seq.
Collapse
Affiliation(s)
- Maria Theresa M Fadri
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA.
| | - Jessica B Lee
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA
| | - Albert J Keung
- Department of Chemical and Biomolecular Engineering, North Carolina State University, Raleigh, NC, USA.
| |
Collapse
|
10
|
Hawkins-Hooker A, Visonà G, Narendra T, Rojas-Carulla M, Schölkopf B, Schweikert G. Getting personal with epigenetics: towards individual-specific epigenomic imputation with machine learning. Nat Commun 2023; 14:4750. [PMID: 37550323 PMCID: PMC10406842 DOI: 10.1038/s41467-023-40211-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2022] [Accepted: 07/18/2023] [Indexed: 08/09/2023] Open
Abstract
Epigenetic modifications are dynamic mechanisms involved in the regulation of gene expression. Unlike the DNA sequence, epigenetic patterns vary not only between individuals, but also between different cell types within an individual. Environmental factors, somatic mutations and ageing contribute to epigenetic changes that may constitute early hallmarks or causal factors of disease. Epigenetic modifications are reversible and thus promising therapeutic targets for precision medicine. However, mapping efforts to determine an individual's cell-type-specific epigenome are constrained by experimental costs and tissue accessibility. To address these challenges, we developed eDICE, an attention-based deep learning model that is trained to impute missing epigenomic tracks by conditioning on observed tracks. Using a recently published set of epigenomes from four individual donors, we show that transfer learning across individuals allows eDICE to successfully predict individual-specific epigenetic variation even in tissues that are unmapped in a given donor. These results highlight the potential of machine learning-based imputation methods to advance personalized epigenomics.
Collapse
Affiliation(s)
- Alex Hawkins-Hooker
- School of Life Sciences, University of Dundee, Dow Street, Dundee, DD1 5EH, UK.
- Empirical Inference Department, Max-Planck Institute for Intelligent Systems, Max-Planck-Ring 4, Tübingen, 72076, Germany.
- Centre for Artificial Intelligence, University College London, London, UK.
| | - Giovanni Visonà
- Empirical Inference Department, Max-Planck Institute for Intelligent Systems, Max-Planck-Ring 4, Tübingen, 72076, Germany
| | - Tanmayee Narendra
- School of Life Sciences, University of Dundee, Dow Street, Dundee, DD1 5EH, UK
- Interfaculty Institute for Biomedical Informatics, University of Tübingen, Sand 13, Tübingen, 72076, Germany
| | - Mateo Rojas-Carulla
- Empirical Inference Department, Max-Planck Institute for Intelligent Systems, Max-Planck-Ring 4, Tübingen, 72076, Germany
| | - Bernhard Schölkopf
- Empirical Inference Department, Max-Planck Institute for Intelligent Systems, Max-Planck-Ring 4, Tübingen, 72076, Germany
| | - Gabriele Schweikert
- School of Life Sciences, University of Dundee, Dow Street, Dundee, DD1 5EH, UK.
- Interfaculty Institute for Biomedical Informatics, University of Tübingen, Sand 13, Tübingen, 72076, Germany.
| |
Collapse
|
11
|
Schreiber JM, Boix CA, Wook Lee J, Li H, Guan Y, Chang CC, Chang JC, Hawkins-Hooker A, Schölkopf B, Schweikert G, Carulla MR, Canakoglu A, Guzzo F, Nanni L, Masseroli M, Carman MJ, Pinoli P, Hong C, Yip KY, Spence JP, Batra SS, Song YS, Mahony S, Zhang Z, Tan W, Shen Y, Sun Y, Shi M, Adrian J, Sandstrom RS, Farrell NP, Halow JM, Lee K, Jiang L, Yang X, Epstein CB, Strattan JS, Bernstein BE, Snyder MP, Kellis M, Noble WS, Kundaje AB. The ENCODE Imputation Challenge: a critical assessment of methods for cross-cell type imputation of epigenomic profiles. Genome Biol 2023; 24:79. [PMID: 37072822 PMCID: PMC10111747 DOI: 10.1186/s13059-023-02915-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2022] [Accepted: 03/24/2023] [Indexed: 04/20/2023] Open
Abstract
A promising alternative to comprehensively performing genomics experiments is to, instead, perform a subset of experiments and use computational methods to impute the remainder. However, identifying the best imputation methods and what measures meaningfully evaluate performance are open questions. We address these questions by comprehensively analyzing 23 methods from the ENCODE Imputation Challenge. We find that imputation evaluations are challenging and confounded by distributional shifts from differences in data collection and processing over time, the amount of available data, and redundancy among performance measures. Our analyses suggest simple steps for overcoming these issues and promising directions for more robust research.
Collapse
Affiliation(s)
| | - Carles A Boix
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jin Wook Lee
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Hongyang Li
- Department of computational medicine and bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Yuanfang Guan
- Department of computational medicine and bioinformatics, University of Michigan, Ann Arbor, MI, USA
| | - Chun-Chieh Chang
- Department of Research and Development, DeepSeq.AI, San Francisco, CA, USA
| | - Jen-Chien Chang
- RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Alex Hawkins-Hooker
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Stuttgart, Germany
| | - Bernhard Schölkopf
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Stuttgart, Germany
| | | | - Mateo Rojas Carulla
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Stuttgart, Germany
| | - Arif Canakoglu
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy
| | - Francesco Guzzo
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy
| | - Luca Nanni
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
| | - Marco Masseroli
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy
| | - Mark James Carman
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy
| | - Pietro Pinoli
- Department of Electronics, Information and Bioengineering, Politecnico di Milano, Milano, Italy
| | - Chenyang Hong
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Sha Tin, Hong Kong
| | - Kevin Y Yip
- Sanford Burnham Prebys Medical Discovery Institute, San Diego, CA, USA
| | - Jefrey P Spence
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Sanjit Singh Batra
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
| | - Yun S Song
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA, USA
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
| | - Shaun Mahony
- Department of Biochemistry & Molecular Biology, Center for Eukaryotic Gene Regulation, Pennsylvania State University, University Park, PA, USA
| | - Zheng Zhang
- Department of Statistics, Pennsylvania State University, University Park, PA, USA
| | - Wuwei Tan
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
| | - Yuanfei Sun
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX, USA
| | - Minyi Shi
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Jessika Adrian
- Department of Genetics, Stanford University, Stanford, CA, USA
| | | | - Nina P Farrell
- Epigenomics Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | | | - Lixia Jiang
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Xinqiong Yang
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Charles B Epstein
- Epigenomics Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - J Seth Strattan
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Bradley E Bernstein
- Epigenomics Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Manolis Kellis
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - William S Noble
- Department of Genome Sciences, University of Washington, Seattle, WA, USA
| | - Anshul Bharat Kundaje
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Computer Science, Stanford University, Stanford, CA, USA
| |
Collapse
|
12
|
Llera A, Brammer M, Oakley B, Tillmann J, Zabihi M, Amelink JS, Mei T, Charman T, Ecker C, Dell'Acqua F, Banaschewski T, Moessnang C, Baron-Cohen S, Holt R, Durston S, Murphy D, Loth E, Buitelaar JK, Floris DL, Beckmann CF. Evaluation of data imputation strategies in complex, deeply-phenotyped data sets: the case of the EU-AIMS Longitudinal European Autism Project. BMC Med Res Methodol 2022; 22:229. [PMID: 35971088 PMCID: PMC9380301 DOI: 10.1186/s12874-022-01656-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2022] [Accepted: 06/02/2022] [Indexed: 12/19/2022] Open
Abstract
An increasing number of large-scale multi-modal research initiatives has been conducted in the typically developing population, e.g. Dev. Cogn. Neur. 32:43-54, 2018; PLoS Med. 12(3):e1001779, 2015; Elam and Van Essen, Enc. Comp. Neur., 2013, as well as in psychiatric cohorts, e.g. Trans. Psych. 10(1):100, 2020; Mol. Psych. 19:659–667, 2014; Mol. Aut. 8:24, 2017; Eur. Child and Adol. Psych. 24(3):265–281, 2015. Missing data is a common problem in such datasets due to the difficulty of assessing multiple measures on a large number of participants. The consequences of missing data accumulate when researchers aim to integrate relationships across multiple measures. Here we aim to evaluate different imputation strategies to fill in missing values in clinical data from a large (total N = 764) and deeply phenotyped (i.e. range of clinical and cognitive instruments administered) sample of N = 453 autistic individuals and N = 311 control individuals recruited as part of the EU-AIMS Longitudinal European Autism Project (LEAP) consortium. In particular, we consider a total of 160 clinical measures divided in 15 overlapping subsets of participants. We use two simple but common univariate strategies—mean and median imputation—as well as a Round Robin regression approach involving four independent multivariate regression models including Bayesian Ridge regression, as well as several non-linear models: Decision Trees (Extra Trees., and Nearest Neighbours regression. We evaluate the models using the traditional mean square error towards removed available data, and also consider the Kullback–Leibler divergence between the observed and the imputed distributions. We show that all of the multivariate approaches tested provide a substantial improvement compared to typical univariate approaches. Further, our analyses reveal that across all 15 data-subsets tested, an Extra Trees regression approach provided the best global results. This not only allows the selection of a unique model to impute missing data for the LEAP project and delivers a fixed set of imputed clinical data to be used by researchers working with the LEAP dataset in the future, but provides more general guidelines for data imputation in large scale epidemiological studies.
Collapse
Affiliation(s)
- A Llera
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands. .,Department of Cognitive Neuroscience, Radboud University Medical Centre, Nijmegen, The Netherlands. .,LIS Data Solutions, Machine Learning Group, Santander, Spain.
| | - M Brammer
- Institute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King's College London, London, UK
| | - B Oakley
- Department of Forensic and Neurodevelopmental Sciences, Institute of Psychiatry, Psychology, and Neuroscience, King's College London, London, UK
| | - J Tillmann
- Roche Pharma Research and Early Development, Neuroscience and Rare Diseases, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd., Basel, Switzerland
| | - M Zabihi
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands.,Department of Cognitive Neuroscience, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - J S Amelink
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands.,Max Planck Institute for Psycholinguistics, Language & Genetics Department, Nijmegen, The Netherlands
| | - T Mei
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands.,Department of Cognitive Neuroscience, Radboud University Medical Centre, Nijmegen, The Netherlands
| | - T Charman
- Department of Psychology, Institute of Psychiatry, Psychology, and Neuroscience, King's College London, London, UK
| | - C Ecker
- Institute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King's College London, London, UK.,Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, University Hospital Frankfurt Am Main, Goethe University, Frankfurt, Germany
| | - F Dell'Acqua
- Institute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King's College London, London, UK
| | - T Banaschewski
- Child and Adolescent Psychiatry, Central Institute of Mental Health, University of Heidelberg, Mannheim, Germany
| | - C Moessnang
- Department of Child and Adolescent Psychiatry, Psychosomatics and Psychotherapy, University Hospital Frankfurt Am Main, Goethe University, Frankfurt, Germany.,Department of Applied Psychology, SRH University, Heidelberg, Germany
| | - S Baron-Cohen
- Autism Research Centre, Department of Psychiatry, University of Cambridge, Cambridge, UK
| | - R Holt
- Autism Research Centre, Department of Psychiatry, University of Cambridge, Cambridge, UK
| | - S Durston
- Department of Psychiatry, Brain Center Rudolf Magnus, University Medical Center Utrecht, Utrecht, The Netherlands
| | - D Murphy
- Institute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King's College London, London, UK.,Department of Forensic and Neurodevelopmental Sciences, Institute of Psychiatry, Psychology, and Neuroscience, King's College London, London, UK
| | - E Loth
- Institute of Psychiatry, Psychology, and Neuroscience, Sackler Institute for Translational Neurodevelopment, King's College London, London, UK.,Department of Forensic and Neurodevelopmental Sciences, Institute of Psychiatry, Psychology, and Neuroscience, King's College London, London, UK
| | - J K Buitelaar
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands.,Department of Cognitive Neuroscience, Radboud University Medical Centre, Nijmegen, The Netherlands.,Karakter Child and Adolescent Psychiatry University Centre, Nijmegen, The Netherlands
| | - D L Floris
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands.,Department of Cognitive Neuroscience, Radboud University Medical Centre, Nijmegen, The Netherlands.,Methods of Plasticity Research, Department of Psychology, University of Zurich, Zurich, Switzerland
| | - C F Beckmann
- Donders Institute for Brain, Cognition and Behaviour, Centre for Cognitive Neuroimaging, Nijmegen, The Netherlands.,Department of Cognitive Neuroscience, Radboud University Medical Centre, Nijmegen, The Netherlands.,Wellcome Centre for Integrative Neuroimaging - Centre for Functional MRI of the Brain (WIN FMRIB), University of Oxford, Oxford, UK
| |
Collapse
|
13
|
Dsouza KB, Li AY, Bhargava VK, Libbrecht MW. Latent Representation of the Human Pan-Celltype Epigenome Through a Deep Recurrent Neural Network. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:2313-2323. [PMID: 34043510 DOI: 10.1109/tcbb.2021.3084147] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
The availability of thousands of assays of epigenetic activity necessitates compressed representations of these data sets that summarize the epigenetic landscape of the genome. Until recently, most such representations were cell type-specific, applying to a single tissue or cell state. Recently, neural networks have made it possible to summarize data across tissues to produce a pan-cell type representation. In this work, we propose Epi-LSTM, a deep long short-term memory (LSTM) recurrent neural network autoencoder to capture the long-term dependencies in the epigenomic data. The latent representations from Epi-LSTM capture a variety of genomic phenomena, including gene-expression, promoter-enhancer interactions, replication timing, frequently interacting regions, and evolutionary conservation. These representations outperform existing methods in a majority of cell types while yielding smoother representations along the genomic axis due to their sequential nature.
Collapse
|
14
|
Albrecht S, Andreani T, Andrade-Navarro MA, Fontaine JF. Single-cell specific and interpretable machine learning models for sparse scChIP-seq data imputation. PLoS One 2022; 17:e0270043. [PMID: 35776722 PMCID: PMC9249201 DOI: 10.1371/journal.pone.0270043] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Accepted: 06/02/2022] [Indexed: 11/19/2022] Open
Abstract
MOTIVATION Single-cell Chromatin ImmunoPrecipitation DNA-Sequencing (scChIP-seq) analysis is challenging due to data sparsity. High degree of sparsity in biological high-throughput single-cell data is generally handled with imputation methods that complete the data, but specific methods for scChIP-seq are lacking. We present SIMPA, a scChIP-seq data imputation method leveraging predictive information within bulk data from the ENCODE project to impute missing protein-DNA interacting regions of target histone marks or transcription factors. RESULTS Imputations using machine learning models trained for each single cell, each ChIP protein target, and each genomic region accurately preserve cell type clustering and improve pathway-related gene identification on real human data. Results on bulk data simulating single cells show that the imputations are single-cell specific as the imputed profiles are closer to the simulated cell than to other cells related to the same ChIP protein target and the same cell type. Simulations also show that 100 input genomic regions are already enough to train single-cell specific models for the imputation of thousands of undetected regions. Furthermore, SIMPA enables the interpretation of machine learning models by revealing interaction sites of a given single cell that are most important for the imputation model trained for a specific genomic region. The corresponding feature importance values derived from promoter-interaction profiles of H3K4me3, an activating histone mark, highly correlate with co-expression of genes that are present within the cell-type specific pathways in 2 real human and mouse datasets. The SIMPA's interpretable imputation method allows users to gain a deep understanding of individual cells and, consequently, of sparse scChIP-seq datasets. AVAILABILITY AND IMPLEMENTATION Our interpretable imputation algorithm was implemented in Python and is available at https://github.com/salbrec/SIMPA.
Collapse
Affiliation(s)
- Steffen Albrecht
- Institute of Organismic and Molecular Evolution (iOME), Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Tommaso Andreani
- Institute of Organismic and Molecular Evolution (iOME), Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany
- Institute of Molecular Biology, Mainz, Germany
| | - Miguel A. Andrade-Navarro
- Institute of Organismic and Molecular Evolution (iOME), Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany
| | - Jean Fred Fontaine
- Institute of Organismic and Molecular Evolution (iOME), Faculty of Biology, Johannes Gutenberg University Mainz, Mainz, Germany
| |
Collapse
|
15
|
Luo K, Zhong J, Safi A, Hong LK, Tewari AK, Song L, Reddy TE, Ma L, Crawford GE, Hartemink AJ. Profiling the quantitative occupancy of myriad transcription factors across conditions by modeling chromatin accessibility data. Genome Res 2022; 32:1183-1198. [PMID: 35609992 PMCID: PMC9248881 DOI: 10.1101/gr.272203.120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Accepted: 05/06/2022] [Indexed: 11/24/2022]
Abstract
Over a thousand different transcription factors (TFs) bind with varying occupancy across the human genome. Chromatin immunoprecipitation (ChIP) can assay occupancy genome-wide, but only one TF at a time, limiting our ability to comprehensively observe the TF occupancy landscape, let alone quantify how it changes across conditions. We developed TF occupancy profiler (TOP), a Bayesian hierarchical regression framework, to profile genome-wide quantitative occupancy of numerous TFs using data from a single chromatin accessibility experiment (DNase- or ATAC-seq). TOP is supervised, and its hierarchical structure allows it to predict the occupancy of any sequence-specific TF, even those never assayed with ChIP. We used TOP to profile the quantitative occupancy of hundreds of sequence-specific TFs at sites throughout the genome and examined how their occupancies changed in multiple contexts: in approximately 200 human cell types, through 12 h of exposure to different hormones, and across the genetic backgrounds of 70 individuals. TOP enables cost-effective exploration of quantitative changes in the landscape of TF binding.
Collapse
Affiliation(s)
- Kaixuan Luo
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Human Genetics, The University of Chicago, Chicago, Illinois 60637, USA
| | - Jianling Zhong
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
| | - Alexias Safi
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Linda K Hong
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alok K Tewari
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, Massachusetts 02215, USA
| | - Lingyun Song
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Timothy E Reddy
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Biostatistics and Bioinformatics, Durham, North Carolina 27710, USA
- Department of Molecular Genetics and Microbiology, Duke University Medical Center, Durham, North Carolina 27710, USA
- Department of Biomedical Engineering, Duke University, Durham, North Carolina 27708, USA
| | - Li Ma
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Department of Statistical Science, Duke University, Durham, North Carolina 27708, USA
| | - Gregory E Crawford
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Pediatrics, Duke University Medical Center, Durham, North Carolina 27710, USA
| | - Alexander J Hartemink
- Computational Biology & Bioinformatics Graduate Program, Duke University, Durham, North Carolina 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, North Carolina 27708, USA
- Department of Computer Science, Duke University, Durham, North Carolina 27708, USA
- Department of Biology, Duke University, Durham, North Carolina 27708, USA
| |
Collapse
|
16
|
Hesami M, Alizadeh M, Jones AMP, Torkamaneh D. Machine learning: its challenges and opportunities in plant system biology. Appl Microbiol Biotechnol 2022; 106:3507-3530. [PMID: 35575915 DOI: 10.1007/s00253-022-11963-6] [Citation(s) in RCA: 24] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2022] [Revised: 03/14/2022] [Accepted: 05/07/2022] [Indexed: 12/25/2022]
Abstract
Sequencing technologies are evolving at a rapid pace, enabling the generation of massive amounts of data in multiple dimensions (e.g., genomics, epigenomics, transcriptomic, metabolomics, proteomics, and single-cell omics) in plants. To provide comprehensive insights into the complexity of plant biological systems, it is important to integrate different omics datasets. Although recent advances in computational analytical pipelines have enabled efficient and high-quality exploration and exploitation of single omics data, the integration of multidimensional, heterogenous, and large datasets (i.e., multi-omics) remains a challenge. In this regard, machine learning (ML) offers promising approaches to integrate large datasets and to recognize fine-grained patterns and relationships. Nevertheless, they require rigorous optimizations to process multi-omics-derived datasets. In this review, we discuss the main concepts of machine learning as well as the key challenges and solutions related to the big data derived from plant system biology. We also provide in-depth insight into the principles of data integration using ML, as well as challenges and opportunities in different contexts including multi-omics, single-cell omics, protein function, and protein-protein interaction. KEY POINTS: • The key challenges and solutions related to the big data derived from plant system biology have been highlighted. • Different methods of data integration have been discussed. • Challenges and opportunities of the application of machine learning in plant system biology have been highlighted and discussed.
Collapse
Affiliation(s)
- Mohsen Hesami
- Department of Plant Agriculture, University of Guelph, Guelph, ON, N1G 2W1, Canada
| | - Milad Alizadeh
- Department of Botany, University of British Columbia, Vancouver, BC, V6T 1Z4, Canada
| | | | - Davoud Torkamaneh
- Département de Phytologie, Université Laval, Québec City, QC, G1V 0A6, Canada. .,Institut de Biologie Intégrative Et Des Systèmes (IBIS), Université Laval, Québec City, QC, G1V 0A6, Canada.
| |
Collapse
|
17
|
Daneshpajouh H, Chen B, Shokraneh N, Masoumi S, Wiese KC, Libbrecht MW. Continuous chromatin state feature annotation of the human epigenome. Bioinformatics 2022; 38:3029-3036. [PMID: 35451453 PMCID: PMC9154241 DOI: 10.1093/bioinformatics/btac283] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2021] [Revised: 02/18/2022] [Accepted: 04/18/2022] [Indexed: 12/02/2022] Open
Abstract
Motivation Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These methods take as input a set of sequencing-based assays of epigenomic activity, such as ChIP-seq measurements of histone modification and transcription factor binding. They output an annotation of the genome that assigns a chromatin state label to each genomic position. Existing SAGA methods have several limitations caused by the discrete annotation framework: such annotations cannot easily represent varying strengths of genomic elements, and they cannot easily represent combinatorial elements that simultaneously exhibit multiple types of activity. To remedy these limitations, we propose an annotation strategy that instead outputs a vector of chromatin state features at each position rather than a single discrete label. Continuous modeling is common in other fields, such as in topic modeling of text documents. We propose a method, epigenome-ssm-nonneg, that uses a non-negative state space model to efficiently annotate the genome with chromatin state features. We also propose several measures of the quality of a chromatin state feature annotation and we compare the performance of several alternative methods according to these quality measures. Results We show that chromatin state features from epigenome-ssm-nonneg are more useful for several downstream applications than both continuous and discrete alternatives, including their ability to identify expressed genes and enhancers. Therefore, we expect that these continuous chromatin state features will be valuable reference annotations to be used in visualization and downstream analysis. Availability and implementation Source code for epigenome-ssm is available at https://github.com/habibdanesh/epigenome-ssm and Zenodo (DOI: 10.5281/zenodo.6507585). Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Habib Daneshpajouh
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Bowen Chen
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Neda Shokraneh
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Shohre Masoumi
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Kay C Wiese
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| | - Maxwell W Libbrecht
- School of Computing Science, Simon Fraser University, Burnaby, BC, V5A 1S6, Canada
| |
Collapse
|
18
|
Li H, Guan Y. Asymmetric predictive relationships across histone modifications. NAT MACH INTELL 2022; 4:288-299. [DOI: 10.1038/s42256-022-00455-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
19
|
Wang Z, Chivu AG, Choate LA, Rice EJ, Miller DC, Chu T, Chou SP, Kingsley NB, Petersen JL, Finno CJ, Bellone RR, Antczak DF, Lis JT, Danko CG. Prediction of histone post-translational modification patterns based on nascent transcription data. Nat Genet 2022; 54:295-305. [PMID: 35273399 PMCID: PMC9444190 DOI: 10.1038/s41588-022-01026-x] [Citation(s) in RCA: 64] [Impact Index Per Article: 21.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2021] [Accepted: 01/24/2022] [Indexed: 01/01/2023]
Abstract
The role of histone modifications in transcription remains incompletely understood. Here, we examine the relationship between histone modifications and transcription using experimental perturbations combined with sensitive machine-learning tools. Transcription predicted the variation in active histone marks and complex chromatin states, like bivalent promoters, down to single-nucleosome resolution and at an accuracy that rivaled the correspondence between independent ChIP-seq experiments. Blocking transcription rapidly removed two punctate marks, H3K4me3 and H3K27ac, from chromatin indicating that transcription is required for active histone modifications. Transcription was also required for maintenance of H3K27me3, consistent with a role for RNA in recruiting PRC2. A subset of DNase-I-hypersensitive sites were refractory to prediction, precluding models where transcription initiates pervasively at any open chromatin. Our results, in combination with past literature, support a model in which active histone modifications serve a supportive, rather than an essential regulatory, role in transcription.
Collapse
Affiliation(s)
- Zhong Wang
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
- School of Software Technology, Dalian University of Technology, Dalian, China
| | - Alexandra G Chivu
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Lauren A Choate
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - Edward J Rice
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - Donald C Miller
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - Tinyi Chu
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - Shao-Pei Chou
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - Nicole B Kingsley
- Veterinary Genetics Laboratory, School of Veterinary Medicine, University of California, Davis, Davis, CA, USA
| | - Jessica L Petersen
- Department of Animal Science, University of Nebraska-Lincoln, Lincoln, NE, USA
| | - Carrie J Finno
- Department of Population Health and Reproduction, University of California, Davis, Davis, CA, USA
| | - Rebecca R Bellone
- Veterinary Genetics Laboratory, School of Veterinary Medicine, University of California, Davis, Davis, CA, USA
| | - Douglas F Antczak
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA
| | - John T Lis
- Department of Molecular Biology and Genetics, Cornell University, Ithaca, NY, USA
| | - Charles G Danko
- Baker Institute for Animal Health, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA.
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, NY, USA.
| |
Collapse
|
20
|
Ni Z, Zheng X, Zheng X, Zou X. scLRTD : A Novel Low Rank Tensor Decomposition Method for Imputing Missing Values in Single-Cell Multi-Omics Sequencing Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:1144-1153. [PMID: 32960767 DOI: 10.1109/tcbb.2020.3025804] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
With the successful application of single-cell sequencing technology, a large number of single-cell multi-omics sequencing (scMO-seq)data have been generated, which enables researchers to study heterogeneity between individual cells. One prominent problem in single-cell data analysis is the prevalence of dropouts, caused by failures in amplification during the experiments. It is necessary to develop effective approaches for imputing the missing values. Different with general methods imputing single type of single-cell data, we propose an imputation method called scLRTD, using low-rank tensor decomposition based on nuclear norm to impute scMO-seq data and single-cell RNA-sequencing (scRNA-seq)data with different stages, tissues or conditions. Furthermore, four sets of simulated and two sets of real scRNA-seq data from mouse embryonic stem cells and hepatocellular carcinoma, respectively, are used to carry out numerical experiments and compared with other six published methods. Error accuracy and clustering results demonstrate the effectiveness of proposed method. Moreover, we clearly identify two cell subpopulations after imputing the real scMO-seq data from hepatocellular carcinoma. Further, Gene Ontology identifies 7 genes in Bile secretion pathway, which is related to metabolism in hepatocellular carcinoma. The survival analysis using the database TCGA also show that two cell subpopulations after imputing have distinguished survival rates.
Collapse
|
21
|
Arslan E, Schulz J, Rai K. Machine Learning in Epigenomics: Insights into Cancer Biology and Medicine. Biochim Biophys Acta Rev Cancer 2021; 1876:188588. [PMID: 34245839 PMCID: PMC8595561 DOI: 10.1016/j.bbcan.2021.188588] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2021] [Revised: 05/29/2021] [Accepted: 07/02/2021] [Indexed: 02/01/2023]
Abstract
The recent deluge of genome-wide technologies for the mapping of the epigenome and resulting data in cancer samples has provided the opportunity for gaining insights into and understanding the roles of epigenetic processes in cancer. However, the complexity, high-dimensionality, sparsity, and noise associated with these data pose challenges for extensive integrative analyses. Machine Learning (ML) algorithms are particularly suited for epigenomic data analyses due to their flexibility and ability to learn underlying hidden structures. We will discuss four overlapping but distinct major categories under ML: dimensionality reduction, unsupervised methods, supervised methods, and deep learning (DL). We review the preferred use cases of these algorithms in analyses of cancer epigenomics data with the hope to provide an overview of how ML approaches can be used to explore fundamental questions on the roles of epigenome in cancer biology and medicine.
Collapse
Affiliation(s)
- Emre Arslan
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Jonathan Schulz
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America
| | - Kunal Rai
- Department of Genomic Medicine, MD Anderson Cancer Center, Houston, TX 77030, United States of America.
| |
Collapse
|
22
|
Morrow A, Hughes J, Singh J, Joseph A, Yosef N. Epitome: predicting epigenetic events in novel cell types with multi-cell deep ensemble learning. Nucleic Acids Res 2021; 49:e110. [PMID: 34379786 PMCID: PMC8565335 DOI: 10.1093/nar/gkab676] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/11/2021] [Revised: 07/19/2021] [Accepted: 07/25/2021] [Indexed: 01/04/2023] Open
Abstract
The accumulation of large epigenomics data consortiums provides us with the opportunity to extrapolate existing knowledge to new cell types and conditions. We propose Epitome, a deep neural network that learns similarities of chromatin accessibility between well characterized reference cell types and a query cellular context, and copies over signal of transcription factor binding and modification of histones from reference cell types when chromatin profiles are similar to the query. Epitome achieves state-of-the-art accuracy when predicting transcription factor binding sites on novel cellular contexts and can further improve predictions as more epigenetic signals are collected from both reference cell types and the query cellular context of interest.
Collapse
Affiliation(s)
- Alyssa Kramer Morrow
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - John Weston Hughes
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Computer Science Department, Stanford University, 353 Serra Mall, Stanford, CA 94305, USA
| | - Jahnavi Singh
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
| | - Anthony Douglas Joseph
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Unite Genomics, Inc., 1301 Marina Village Pkwy, Suite 320, Alameda, CA 94501, USA
| | - Nir Yosef
- Electrical Engineering and Computer Science Department, University of California-Berkeley 465 Soda Hall, Berkeley, CA 94720-1776, USA
- Center for Computational Biology, University of California-Berkeley 108 Stanley Hall, Berkeley, CA 94720-3220, USA
- Ragon Institute of Massachusetts General Hospital, Massachusetts Institute of Technology, and Harvard University, Boston, MA, 02139, USA
- Chan Zuckerberg Biohub, San Francisco, CA, 94158, USA
| |
Collapse
|
23
|
Libbrecht MW, Chan RCW, Hoffman MM. Segmentation and genome annotation algorithms for identifying chromatin state and other genomic patterns. PLoS Comput Biol 2021; 17:e1009423. [PMID: 34648491 PMCID: PMC8516206 DOI: 10.1371/journal.pcbi.1009423] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
Segmentation and genome annotation (SAGA) algorithms are widely used to understand genome activity and gene regulation. These algorithms take as input epigenomic datasets, such as chromatin immunoprecipitation-sequencing (ChIP-seq) measurements of histone modifications or transcription factor binding. They partition the genome and assign a label to each segment such that positions with the same label exhibit similar patterns of input data. SAGA algorithms discover categories of activity such as promoters, enhancers, or parts of genes without prior knowledge of known genomic elements. In this sense, they generally act in an unsupervised fashion like clustering algorithms, but with the additional simultaneous function of segmenting the genome. Here, we review the common methodological framework that underlies these methods, review variants of and improvements upon this basic framework, and discuss the outlook for future work. This review is intended for those interested in applying SAGA methods and for computational researchers interested in improving upon them.
Collapse
Affiliation(s)
| | - Rachel C. W. Chan
- Department of Computer Science, University of Toronto, Toronto, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
| | - Michael M. Hoffman
- Department of Computer Science, University of Toronto, Toronto, Canada
- Princess Margaret Cancer Centre, University Health Network, Toronto, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Canada
- Vector Institute for Artificial Intelligence, Toronto, Canada
| |
Collapse
|
24
|
Zhou W, Hongkai J. Genome-wide Prediction of Chromatin Accessibility Based on Gene Expression. WILEY INTERDISCIPLINARY REVIEWS. COMPUTATIONAL STATISTICS 2021; 13:e1544. [PMID: 39391743 PMCID: PMC11466374 DOI: 10.1002/wics.1544] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/19/2019] [Accepted: 11/28/2020] [Indexed: 10/12/2024]
Abstract
Decoding gene regulation in a biological system requires information from both transcriptome and regulome. While multiple high-throughput transcriptome and regulome mapping technologies are available, transcriptome profiling is more widely used. Today, over a million bulk and single-cell gene expression samples are stored in public databases. This number is orders of magnitude larger than the number of available regulome samples. Most of the gene expression samples do not have corresponding regulome data. However, it is possible to obtain regulome information via prediction. Open chromatin is a hallmark of active regulatory elements. This mini-review discusses recent advances in predicting chromatin accessibility using gene expression data, including both the development of prediction methods and their applications in expanding the regulome catalog, improving regulome analysis, integrating transcriptome and regulome data, and facilitating single-cell analysis of gene regulation.
Collapse
Affiliation(s)
- Weiqiang Zhou
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
| | - Ji Hongkai
- Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, 615 North Wolfe Street, Baltimore, MD 21205, USA
| |
Collapse
|
25
|
Nieboer MM, Nguyen L, de Ridder J. Predicting pathogenic non-coding SVs disrupting the 3D genome in 1646 whole cancer genomes using multiple instance learning. Sci Rep 2021; 11:14411. [PMID: 34257393 PMCID: PMC8277903 DOI: 10.1038/s41598-021-93917-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2021] [Accepted: 07/01/2021] [Indexed: 11/21/2022] Open
Abstract
Over the past years, large consortia have been established to fuel the sequencing of whole genomes of many cancer patients. Despite the increased abundance in tools to study the impact of SNVs, non-coding SVs have been largely ignored in these data. Here, we introduce svMIL2, an improved version of our Multiple Instance Learning-based method to study the effect of somatic non-coding SVs disrupting boundaries of TADs and CTCF loops in 1646 cancer genomes. We demonstrate that svMIL2 predicts pathogenic non-coding SVs with an average AUC of 0.86 across 12 cancer types, and identifies non-coding SVs affecting well-known driver genes. The disruption of active (super) enhancers in open chromatin regions appears to be a common mechanism by which non-coding SVs exert their pathogenicity. Finally, our results reveal that the contribution of pathogenic non-coding SVs as opposed to driver SNVs may highly vary between cancers, with notably high numbers of genes being disrupted by pathogenic non-coding SVs in ovarian and pancreatic cancer. Taken together, our machine learning method offers a potent way to prioritize putatively pathogenic non-coding SVs and leverage non-coding SVs to identify driver genes. Moreover, our analysis of 1646 cancer genomes demonstrates the importance of including non-coding SVs in cancer diagnostics.
Collapse
Affiliation(s)
- Marleen M Nieboer
- Center for Molecular Medicine, University Medical Center Utrecht, 3584 CG, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Luan Nguyen
- Center for Molecular Medicine, University Medical Center Utrecht, 3584 CG, Utrecht, The Netherlands
- Oncode Institute, Utrecht, The Netherlands
| | - Jeroen de Ridder
- Center for Molecular Medicine, University Medical Center Utrecht, 3584 CG, Utrecht, The Netherlands.
- Oncode Institute, Utrecht, The Netherlands.
| |
Collapse
|
26
|
Bayat F, Libbrecht M. VSS: Variance-stabilized signals for sequencing-based genomic signals. Bioinformatics 2021; 37:4383-4391. [PMID: 34165492 PMCID: PMC8652025 DOI: 10.1093/bioinformatics/btab457] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 04/28/2021] [Accepted: 06/17/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION A sequencing-based genomic assay such as ChIP-seq outputs a real-valued signal for each position in the genome that measures the strength of activity at that position. Most genomic signals lack the property of variance stabilization. That is, a difference between 0 and 100 reads usually has a very different statistical importance from a difference between 1,000 and 1,100 reads. A statistical model such as a negative binomial distribution can account for this pattern, but learning these models is computationally challenging. Therefore, many applications - including imputation and segmentation and genome annotation (SAGA) - instead use Gaussian models and use a transformation such as log or inverse hyperbolic sine (asinh) to stabilize variance. RESULTS We show here that existing transformations do not fully stabilize variance in genomic data sets. To solve this issue, we propose VSS, a method that produces variance-stabilized signals for sequencing-based genomic signals. VSS learns the empirical relationship between the mean and variance of a given signal data set and produces transformed signals that normalize for this dependence. We show that VSS successfully stabilizes variance and that doing so improves downstream applications such as SAGA. VSS will eliminate the need for downstream methods to implement complex mean-variance relationship models, and will enable genomic signals to be easily understood by eye. AVAILABILITY https://github.com/faezeh-bayat/VSS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Faezeh Bayat
- Department of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| | - Maxwell Libbrecht
- Department of Computing Science, Simon Fraser University, Burnaby, BC V5A 1S6, Canada
| |
Collapse
|
27
|
Schreiber J, Singh R. Machine learning for profile prediction in genomics. Curr Opin Chem Biol 2021; 65:35-41. [PMID: 34107341 DOI: 10.1016/j.cbpa.2021.04.008] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Revised: 04/21/2021] [Accepted: 04/24/2021] [Indexed: 02/08/2023]
Abstract
A recent deluge of publicly available multi-omics data has fueled the development of machine learning methods aimed at investigating important questions in genomics. Although the motivations for these methods vary, a task that is commonly adopted is that of profile prediction, where predictions are made for one or more forms of biochemical activity along the genome, for example, histone modification, chromatin accessibility, or protein binding. In this review, we give an overview of the research works performing profile prediction, define two broad categories of profile prediction tasks, and discuss the types of scientific questions that can be answered in each.
Collapse
Affiliation(s)
| | - Ritambhara Singh
- Department of Computer Science, Center for Computational Molecular Biology, Brown University, United States.
| |
Collapse
|
28
|
Schreiber J, Bilmes J, Noble WS. Prioritizing transcriptomic and epigenomic experiments using an optimization strategy that leverages imputed data. Bioinformatics 2021; 37:439-447. [PMID: 32966546 PMCID: PMC8088321 DOI: 10.1093/bioinformatics/btaa830] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2019] [Revised: 07/28/2020] [Accepted: 09/09/2020] [Indexed: 12/03/2022] Open
Abstract
Motivation Successful science often involves not only performing experiments well, but also choosing well among many possible experiments. In a hypothesis generation setting, choosing an experiment well means choosing an experiment whose results are interesting or novel. In this work, we formalize this selection procedure in the context of genomics and epigenomics data generation. Specifically, we consider the task faced by a scientific consortium such as the National Institutes of Health ENCODE Consortium, whose goal is to characterize all of the functional elements in the human genome. Given a list of possible cell types or tissue types (‘biosamples’) and a list of possible high-throughput sequencing assays, where at least one experiment has been performed in each biosample and for each assay, we ask ‘Which experiments should ENCODE perform next?’ Results We demonstrate how to represent this task as a submodular optimization problem, where the goal is to choose a panel of experiments that maximize the facility location function. A key aspect of our approach is that we use imputed data, rather than experimental data, to directly answer the posed question. We find that, across several evaluations, our method chooses a panel of experiments that span a diversity of biochemical activity. Finally, we propose two modifications of the facility location function, including a novel submodular–supermodular function, that allow incorporation of domain knowledge or constraints into the optimization procedure. Availability and implementation Our method is available as a Python package at https://github.com/jmschrei/kiwano and can be installed using the command pip install kiwano. The source code used here and the similarity matrix can be found at http://doi.org/10.5281/zenodo.3708538. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
|
29
|
Nakato R, Sakata T. Methods for ChIP-seq analysis: A practical workflow and advanced applications. Methods 2021; 187:44-53. [PMID: 32240773 DOI: 10.1016/j.ymeth.2020.03.005] [Citation(s) in RCA: 120] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2020] [Revised: 03/17/2020] [Accepted: 03/18/2020] [Indexed: 12/13/2022] Open
Abstract
Chromatin immunoprecipitation followed by sequencing (ChIP-seq) is a central method in epigenomic research. Genome-wide analysis of histone modifications, such as enhancer analysis and genome-wide chromatin state annotation, enables systematic analysis of how the epigenomic landscape contributes to cell identity, development, lineage specification, and disease. In this review, we first present a typical ChIP-seq analysis workflow, from quality assessment to chromatin-state annotation. We focus on practical, rather than theoretical, approaches for biological studies. Next, we outline various advanced ChIP-seq applications and introduce several state-of-the-art methods, including prediction of gene expression level and chromatin loops from epigenome data and data imputation. Finally, we discuss recently developed single-cell ChIP-seq analysis methodologies that elucidate the cellular diversity within complex tissues and cancers.
Collapse
Affiliation(s)
- Ryuichiro Nakato
- Laboratory of Computational Genomics, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.
| | - Toyonori Sakata
- Laboratory of Genome Structure and Function, Institute for Quantitative Biosciences, The University of Tokyo, 1-1-1 Yayoi, Bunkyo-ku, Tokyo 113-0032, Japan.
| |
Collapse
|
30
|
Pei G, Hu R, Dai Y, Manuel AM, Zhao Z, Jia P. Predicting regulatory variants using a dense epigenomic mapped CNN model elucidated the molecular basis of trait-tissue associations. Nucleic Acids Res 2021; 49:53-66. [PMID: 33300042 PMCID: PMC7797043 DOI: 10.1093/nar/gkaa1137] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2020] [Revised: 10/22/2020] [Accepted: 12/08/2020] [Indexed: 02/06/2023] Open
Abstract
Assessing the causal tissues of human complex diseases is important for the prioritization of trait-associated genetic variants. Yet, the biological underpinnings of trait-associated variants are extremely difficult to infer due to statistical noise in genome-wide association studies (GWAS), and because >90% of genetic variants from GWAS are located in non-coding regions. Here, we collected the largest human epigenomic map from ENCODE and Roadmap consortia and implemented a deep-learning-based convolutional neural network (CNN) model to predict the regulatory roles of genetic variants across a comprehensive list of epigenomic modifications. Our model, called DeepFun, was built on DNA accessibility maps, histone modification marks, and transcription factors. DeepFun can systematically assess the impact of non-coding variants in the most functional elements with tissue or cell-type specificity, even for rare variants or de novo mutations. By applying this model, we prioritized trait-associated loci for 51 publicly-available GWAS studies. We demonstrated that CNN-based analyses on dense and high-resolution epigenomic annotations can refine important GWAS associations in order to identify regulatory loci from background signals, which yield novel insights for better understanding the molecular basis of human complex disease. We anticipate our approaches will become routine in GWAS downstream analysis and non-coding variant evaluation.
Collapse
Affiliation(s)
- Guangsheng Pei
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Ruifeng Hu
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yulin Dai
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Astrid Marilyn Manuel
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
31
|
Pei G, Wang YY, Simon LM, Dai Y, Zhao Z, Jia P. Gene expression imputation and cell-type deconvolution in human brain with spatiotemporal precision and its implications for brain-related disorders. Genome Res 2020; 31:146-158. [PMID: 33272935 PMCID: PMC7849392 DOI: 10.1101/gr.265769.120] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2020] [Accepted: 11/25/2020] [Indexed: 12/30/2022]
Abstract
As the most complex organ of the human body, the brain is composed of diverse regions, each consisting of distinct cell types and their respective cellular interactions. Human brain development involves a finely tuned cascade of interactive events. These include spatiotemporal gene expression changes and dynamic alterations in cell-type composition. However, our understanding of this process is still largely incomplete owing to the difficulty of brain spatiotemporal transcriptome collection. In this study, we developed a tensor-based approach to impute gene expression on a transcriptome-wide level. After rigorous computational benchmarking, we applied our approach to infer missing data points in the widely used BrainSpan resource and completed the entire grid of spatiotemporal transcriptomics. Next, we conducted deconvolutional analyses to comprehensively characterize major cell-type dynamics across the entire BrainSpan resource to estimate the cellular temporal changes and distinct neocortical areas across development. Moreover, integration of these results with GWAS summary statistics for 13 brain-associated traits revealed multiple novel trait–cell-type associations and trait-spatiotemporal relationships. In summary, our imputed BrainSpan transcriptomic data provide a valuable resource for the research community and our findings help further studies of the transcriptional and cellular dynamics of the human brain and related diseases.
Collapse
Affiliation(s)
- Guangsheng Pei
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| | - Yin-Ying Wang
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| | - Lukas M Simon
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| | - Yulin Dai
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, Texas 77030, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee 37203, USA
| | - Peilin Jia
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas 77030, USA
| |
Collapse
|
32
|
Schreiber J, Singh R, Bilmes J, Noble WS. A pitfall for machine learning methods aiming to predict across cell types. Genome Biol 2020; 21:282. [PMID: 33213499 PMCID: PMC7678316 DOI: 10.1186/s13059-020-02177-y] [Citation(s) in RCA: 30] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2020] [Accepted: 10/07/2020] [Indexed: 01/19/2023] Open
Abstract
Machine learning models that predict genomic activity are most useful when they make accurate predictions across cell types. Here, we show that when the training and test sets contain the same genomic loci, the resulting model may falsely appear to perform well by effectively memorizing the average activity associated with each locus across the training cell types. We demonstrate this phenomenon in the context of predicting gene expression and chromatin domain boundaries, and we suggest methods to diagnose and avoid the pitfall. We anticipate that, as more data becomes available, future projects will increasingly risk suffering from this issue.
Collapse
Affiliation(s)
- Jacob Schreiber
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA
| | - Ritambhara Singh
- Department of Genome Science, University of Washington, Seattle, USA.,Current Affiliation: Department of Computer Science, and Center for Computational Molecular Biology, Brown University, Providence, 02906, RI, United States
| | - Jeffrey Bilmes
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA.,Department of Electrical & Computer Engineering, University of Washington, Seattle, USA
| | - William Stafford Noble
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, USA. .,Department of Genome Science, University of Washington, Seattle, USA.
| |
Collapse
|
33
|
Song M, Greenbaum J, Luttrell J, Zhou W, Wu C, Shen H, Gong P, Zhang C, Deng HW. A Review of Integrative Imputation for Multi-Omics Datasets. Front Genet 2020; 11:570255. [PMID: 33193667 PMCID: PMC7594632 DOI: 10.3389/fgene.2020.570255] [Citation(s) in RCA: 60] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Accepted: 09/16/2020] [Indexed: 01/05/2023] Open
Abstract
Multi-omics studies, which explore the interactions between multiple types of biological factors, have significant advantages over single-omics analysis for their ability to provide a more holistic view of biological processes, uncover the causal and functional mechanisms for complex diseases, and facilitate new discoveries in precision medicine. However, omics datasets often contain missing values, and in multi-omics study designs it is common for individuals to be represented for some omics layers but not all. Since most statistical analyses cannot be applied directly to the incomplete datasets, imputation is typically performed to infer the missing values. Integrative imputation techniques which make use of the correlations and shared information among multi-omics datasets are expected to outperform approaches that rely on single-omics information alone, resulting in more accurate results for the subsequent downstream analyses. In this review, we provide an overview of the currently available imputation methods for handling missing values in bioinformatics data with an emphasis on multi-omics imputation. In addition, we also provide a perspective on how deep learning methods might be developed for the integrative imputation of multi-omics datasets.
Collapse
Affiliation(s)
- Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, United States
| | - Jonathan Greenbaum
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Joseph Luttrell
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, United States
| | - Weihua Zhou
- College of Computing, Michigan Technological University, Houghton, MI, United States
| | - Chong Wu
- Department of Statistics, Florida State University, Tallahassee, FL, United States
| | - Hui Shen
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS, United States
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS, United States
| | - Hong-Wen Deng
- Tulane Center of Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA, United States
| |
Collapse
|
34
|
Sahu A, Li N, Dunkel I, Chung HR. EPIGENE: genome-wide transcription unit annotation using a multivariate probabilistic model of histone modifications. Epigenetics Chromatin 2020; 13:20. [PMID: 32264931 PMCID: PMC7137282 DOI: 10.1186/s13072-020-00341-z] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 03/28/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Understanding the transcriptome is critical for explaining the functional as well as regulatory roles of genomic regions. Current methods for the identification of transcription units (TUs) use RNA-seq that, however, require large quantities of mRNA rendering the identification of inherently unstable TUs, e.g. miRNA precursors, difficult. This problem can be alleviated by chromatin-based approaches due to a correlation between histone modifications and transcription. RESULTS Here, we introduce EPIGENE, a novel chromatin segmentation method for the identification of active TUs using transcription-associated histone modifications. Unlike the existing chromatin segmentation approaches, EPIGENE uses a constrained, semi-supervised multivariate hidden Markov model (HMM) that models the observed combination of histone modifications using a product of independent Bernoulli random variables, to identify active TUs. Our results show that EPIGENE can identify genome-wide TUs in an unbiased manner. EPIGENE-predicted TUs show an enrichment of RNA Polymerase II at the transcription start site and in gene body indicating that they are indeed transcribed. Comprehensive validation using existing annotations revealed that 93% of EPIGENE TUs can be explained by existing gene annotations and 5% of EPIGENE TUs in HepG2 can be explained by microRNA annotations. EPIGENE outperformed the existing RNA-seq-based approaches in TU prediction precision across human cell lines. Finally, we identified 232 novel TUs in K562 and 43 novel cell-specific TUs all of which were supported by RNA Polymerase II ChIP-seq and Nascent RNA-seq data. CONCLUSION We demonstrate the applicability of EPIGENE to identify genome-wide active TUs and to provide valuable information about unannotated TUs. EPIGENE is an open-source method and is freely available at: https://github.com/imbbLab/EPIGENE.
Collapse
Affiliation(s)
- Anshupa Sahu
- Institute for Medical Bioinformatics and Biostatistics, Philipps University of Marburg, 35037, Marburg, Germany
- Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, 14195, Berlin, Germany
| | - Na Li
- Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, 14195, Berlin, Germany
- Guangzhou Institute of Pediatrics, Guangzhou Women and Children's Medical Center, Guangzhou, 510623, China
| | - Ilona Dunkel
- Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, 14195, Berlin, Germany
| | - Ho-Ryun Chung
- Institute for Medical Bioinformatics and Biostatistics, Philipps University of Marburg, 35037, Marburg, Germany.
- Otto-Warburg-Laboratory, Max Planck Institute for Molecular Genetics, 14195, Berlin, Germany.
| |
Collapse
|
35
|
Schreiber J, Bilmes J, Noble WS. Completing the ENCODE3 compendium yields accurate imputations across a variety of assays and human biosamples. Genome Biol 2020; 21:82. [PMID: 32228713 PMCID: PMC7104481 DOI: 10.1186/s13059-020-01978-5] [Citation(s) in RCA: 16] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/28/2019] [Accepted: 02/26/2020] [Indexed: 12/16/2022] Open
Abstract
Recent efforts to describe the human epigenome have yielded thousands of epigenomic and transcriptomic datasets. However, due primarily to cost, the total number of such assays that can be performed is limited. Accordingly, we applied an imputation approach, Avocado, to a dataset of 3814 tracks of data derived from the ENCODE compendium, including measurements of chromatin accessibility, histone modification, transcription, and protein binding. Avocado shows significant improvements in imputing protein binding compared to the top models in the ENCODE-DREAM challenge. Additionally, we show that the Avocado model allows for efficient addition of new assays and biosamples to a pre-trained model.
Collapse
Affiliation(s)
- Jacob Schreiber
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
| | - Jeffrey Bilmes
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
- Department of Electrical Engineering, University of Washington, Seattle, USA
| | - William Stafford Noble
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.
- Department of Genome Sciences, University of Washington, Seattle, USA.
| |
Collapse
|
36
|
Schreiber J, Durham T, Bilmes J, Noble WS. Avocado: a multi-scale deep tensor factorization method learns a latent representation of the human epigenome. Genome Biol 2020; 21:81. [PMID: 32228704 PMCID: PMC7104480 DOI: 10.1186/s13059-020-01977-6] [Citation(s) in RCA: 63] [Impact Index Per Article: 12.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2019] [Accepted: 02/26/2020] [Indexed: 02/08/2023] Open
Abstract
The human epigenome has been experimentally characterized by thousands of measurements for every basepair in the human genome. We propose a deep neural network tensor factorization method, Avocado, that compresses this epigenomic data into a dense, information-rich representation. We use this learned representation to impute epigenomic data more accurately than previous methods, and we show that machine learning models that exploit this representation outperform those trained directly on epigenomic data on a variety of genomics tasks. These tasks include predicting gene expression, promoter-enhancer interactions, replication timing, and an element of 3D chromatin architecture.
Collapse
Affiliation(s)
- Jacob Schreiber
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA
| | - Timothy Durham
- Department of Genome Sciences, University of Washington, Seattle, USA
| | - Jeffrey Bilmes
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA.,Department of Electrical Engineering, University of Washington, Seattle, USA
| | - William Stafford Noble
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, USA. .,Department of Genome Sciences, University of Washington, Seattle, USA.
| |
Collapse
|
37
|
Zhang S, Chasman D, Knaack S, Roy S. In silico prediction of high-resolution Hi-C interaction matrices. Nat Commun 2019; 10:5449. [PMID: 31811132 PMCID: PMC6898380 DOI: 10.1038/s41467-019-13423-8] [Citation(s) in RCA: 48] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Accepted: 11/07/2019] [Indexed: 11/28/2022] Open
Abstract
The three-dimensional (3D) organization of the genome plays an important role in gene regulation bringing distal sequence elements in 3D proximity to genes hundreds of kilobases away. Hi-C is a powerful genome-wide technique to study 3D genome organization. Owing to experimental costs, high resolution Hi-C datasets are limited to a few cell lines. Computational prediction of Hi-C counts can offer a scalable and inexpensive approach to examine 3D genome organization across multiple cellular contexts. Here we present HiC-Reg, an approach to predict contact counts from one-dimensional regulatory signals. HiC-Reg predictions identify topologically associating domains and significant interactions that are enriched for CCCTC-binding factor (CTCF) bidirectional motifs and interactions identified from complementary sources. CTCF and chromatin marks, especially repressive and elongation marks, are most important for HiC-Reg’s predictive performance. Taken together, HiC-Reg provides a powerful framework to generate high-resolution profiles of contact counts that can be used to study individual locus level interactions and higher-order organizational units of the genome. Existing computational approaches to predict long-range regulatory interactions do not fully exploit high-resolution Hi-C datasets. Here the authors present a Random Forests regression-based approach to predict high-resolution Hi-C counts using one-dimensional regulatory genomic signals.
Collapse
Affiliation(s)
- Shilu Zhang
- Wisconsin Institute for Discovery, 330 North Orchard Street, Madison, WI, 53715, USA
| | - Deborah Chasman
- Wisconsin Institute for Discovery, 330 North Orchard Street, Madison, WI, 53715, USA
| | - Sara Knaack
- Wisconsin Institute for Discovery, 330 North Orchard Street, Madison, WI, 53715, USA
| | - Sushmita Roy
- Wisconsin Institute for Discovery, 330 North Orchard Street, Madison, WI, 53715, USA. .,Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA.
| |
Collapse
|
38
|
Zhang Y, Mahony S. Direct prediction of regulatory elements from partial data without imputation. PLoS Comput Biol 2019; 15:e1007399. [PMID: 31682602 PMCID: PMC6855516 DOI: 10.1371/journal.pcbi.1007399] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2019] [Revised: 11/14/2019] [Accepted: 09/12/2019] [Indexed: 01/07/2023] Open
Abstract
Genome segmentation approaches allow us to characterize regulatory states in a given cell type using combinatorial patterns of histone modifications and other regulatory signals. In order to analyze regulatory state differences across cell types, current genome segmentation approaches typically require that the same regulatory genomics assays have been performed in all analyzed cell types. This necessarily limits both the numbers of cell types that can be analyzed and the complexity of the resulting regulatory states, as only a small number of histone modifications have been profiled across many cell types. Data imputation approaches that aim to estimate missing regulatory signals have been applied before genome segmentation. However, this approach is computationally costly and propagates any errors in imputation to produce incorrect genome segmentation results downstream. We present an extension to the IDEAS genome segmentation platform which can perform genome segmentation on incomplete regulatory genomics dataset collections without using imputation. Instead of relying on imputed data, we use an expectation-maximization approach to estimate marginal density functions within each regulatory state. We demonstrate that our genome segmentation results compare favorably with approaches based on imputation or other strategies for handling missing data. We further show that our approach can accurately impute missing data after genome segmentation, reversing the typical order of imputation/genome segmentation pipelines. Finally, we present a new 2D genome segmentation analysis of 127 human cell types studied by the Roadmap Epigenomics Consortium. By using an expanded set of chromatin marks that have been profiled in subsets of these cell types, our new segmentation results capture a more complex picture of combinatorial regulatory patterns that appear on the human genome.
Collapse
Affiliation(s)
- Yu Zhang
- Department of Statistics, Penn State University, University Park, Pennsylvania, United States of America
| | - Shaun Mahony
- Department of Biochemistry & Molecular Biology and Center for Eukaryotic Gene Regulation, Penn State University, University Park, Pennsylvania, United States of America
| |
Collapse
|
39
|
Zitnik M, Nguyen F, Wang B, Leskovec J, Goldenberg A, Hoffman MM. Machine Learning for Integrating Data in Biology and Medicine: Principles, Practice, and Opportunities. AN INTERNATIONAL JOURNAL ON INFORMATION FUSION 2019; 50:71-91. [PMID: 30467459 PMCID: PMC6242341 DOI: 10.1016/j.inffus.2018.09.012] [Citation(s) in RCA: 262] [Impact Index Per Article: 43.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/10/2023]
Abstract
New technologies have enabled the investigation of biology and human health at an unprecedented scale and in multiple dimensions. These dimensions include myriad properties describing genome, epigenome, transcriptome, microbiome, phenotype, and lifestyle. No single data type, however, can capture the complexity of all the factors relevant to understanding a phenomenon such as a disease. Integrative methods that combine data from multiple technologies have thus emerged as critical statistical and computational approaches. The key challenge in developing such approaches is the identification of effective models to provide a comprehensive and relevant systems view. An ideal method can answer a biological or medical question, identifying important features and predicting outcomes, by harnessing heterogeneous data across several dimensions of biological variation. In this Review, we describe the principles of data integration and discuss current methods and available implementations. We provide examples of successful data integration in biology and medicine. Finally, we discuss current challenges in biomedical integrative methods and our perspective on the future development of the field.
Collapse
Affiliation(s)
- Marinka Zitnik
- Department of Computer Science, Stanford University,
Stanford, CA, USA
| | - Francis Nguyen
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
| | - Bo Wang
- Hikvision Research Institute, Santa Clara, CA, USA
| | - Jure Leskovec
- Department of Computer Science, Stanford University,
Stanford, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Anna Goldenberg
- Genetics & Genome Biology, SickKids Research Institute,
Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| | - Michael M. Hoffman
- Department of Medical Biophysics, University of Toronto,
Toronto, ON, Canada
- Princess Margaret Cancer Centre, Toronto, ON, Canada
- Department of Computer Science, University of Toronto,
Toronto, ON, Canada
- Vector Institute, Toronto, ON, Canada
| |
Collapse
|
40
|
Nair S, Kim DS, Perricone J, Kundaje A. Integrating regulatory DNA sequence and gene expression to predict genome-wide chromatin accessibility across cellular contexts. Bioinformatics 2019; 35:i108-i116. [PMID: 31510655 PMCID: PMC6612838 DOI: 10.1093/bioinformatics/btz352] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023] Open
Abstract
MOTIVATION Genome-wide profiles of chromatin accessibility and gene expression in diverse cellular contexts are critical to decipher the dynamics of transcriptional regulation. Recently, convolutional neural networks have been used to learn predictive cis-regulatory DNA sequence models of context-specific chromatin accessibility landscapes. However, these context-specific regulatory sequence models cannot generalize predictions across cell types. RESULTS We introduce multi-modal, residual neural network architectures that integrate cis-regulatory sequence and context-specific expression of trans-regulators to predict genome-wide chromatin accessibility profiles across cellular contexts. We show that the average accessibility of a genomic region across training contexts can be a surprisingly powerful predictor. We leverage this feature and employ novel strategies for training models to enhance genome-wide prediction of shared and context-specific chromatin accessible sites across cell types. We interpret the models to reveal insights into cis- and trans-regulation of chromatin dynamics across 123 diverse cellular contexts. AVAILABILITY AND IMPLEMENTATION The code is available at https://github.com/kundajelab/ChromDragoNN. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Surag Nair
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Daniel S Kim
- Department of Biomedical Data Sciences, Stanford University, Stanford, CA, USA
| | - Jacob Perricone
- Institute for Computational and Mathematical Engineering, Stanford University, Stanford, CA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| |
Collapse
|
41
|
Keilwagen J, Posch S, Grau J. Accurate prediction of cell type-specific transcription factor binding. Genome Biol 2019; 20:9. [PMID: 30630522 PMCID: PMC6327544 DOI: 10.1186/s13059-018-1614-y] [Citation(s) in RCA: 56] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2018] [Accepted: 12/18/2018] [Indexed: 01/11/2023] Open
Abstract
Prediction of cell type-specific, in vivo transcription factor binding sites is one of the central challenges in regulatory genomics. Here, we present our approach that earned a shared first rank in the "ENCODE-DREAM in vivo Transcription Factor Binding Site Prediction Challenge" in 2017. In post-challenge analyses, we benchmark the influence of different feature sets and find that chromatin accessibility and binding motifs are sufficient to yield state-of-the-art performance. Finally, we provide 682 lists of predicted peaks for a total of 31 transcription factors in 22 primary cell types and tissues and a user-friendly version of our approach, Catchitt, for download.
Collapse
Affiliation(s)
- Jens Keilwagen
- Institute for Biosafety in Plant Biotechnology, Julius Kühn-Institut (JKI) - Federal Research Centre for Cultivated Plants, Erwin-Baur-Straße 27, Quedlinburg, 06484 Germany
| | - Stefan Posch
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Von-Seckendorff-Platz 1, Halle (Saale), 06120 Germany
| | - Jan Grau
- Institute of Computer Science, Martin Luther University Halle–Wittenberg, Von-Seckendorff-Platz 1, Halle (Saale), 06120 Germany
| |
Collapse
|
42
|
Stein-O'Brien GL, Arora R, Culhane AC, Favorov AV, Garmire LX, Greene CS, Goff LA, Li Y, Ngom A, Ochs MF, Xu Y, Fertig EJ. Enter the Matrix: Factorization Uncovers Knowledge from Omics. Trends Genet 2018; 34:790-805. [PMID: 30143323 PMCID: PMC6309559 DOI: 10.1016/j.tig.2018.07.003] [Citation(s) in RCA: 132] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2018] [Revised: 06/01/2018] [Accepted: 07/16/2018] [Indexed: 12/20/2022]
Abstract
Omics data contain signals from the molecular, physical, and kinetic inter- and intracellular interactions that control biological systems. Matrix factorization (MF) techniques can reveal low-dimensional structure from high-dimensional data that reflect these interactions. These techniques can uncover new biological knowledge from diverse high-throughput omics data in applications ranging from pathway discovery to timecourse analysis. We review exemplary applications of MF for systems-level analyses. We discuss appropriate applications of these methods, their limitations, and focus on the analysis of results to facilitate optimal biological interpretation. The inference of biologically relevant features with MF enables discovery from high-throughput data beyond the limits of current biological knowledge - answering questions from high-dimensional data that we have not yet thought to ask.
Collapse
Affiliation(s)
- Genevieve L Stein-O'Brien
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA; Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Raman Arora
- Department of Computer Science, Institute for Data Intensive Engineering and Science, Johns Hopkins University, Baltimore, MD, USA
| | - Aedin C Culhane
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA; Department of Biostatistics, Harvard TH Chan School of Public Health, Boston, MA, USA
| | - Alexander V Favorov
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA; Vavilov Institute of General Genetics, Moscow, Russia
| | | | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, PA, USA; Childhood Cancer Data Lab, Alex's Lemonade Stand Foundation, PA, USA
| | - Loyal A Goff
- Department of Neuroscience, Johns Hopkins School of Medicine, Baltimore, MD, USA; McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, USA
| | - Yifeng Li
- Digital Technologies Research Centre, National Research Council of Canada, Ottawa, ON, Canada
| | - Aloune Ngom
- School of Computer Science, University of Windsor, Windsor, ON, Canada
| | - Michael F Ochs
- Department of Mathematics and Statistics, The College of New Jersey, Ewing, NJ, USA
| | - Yanxun Xu
- Department of Applied Mathematics and Statistics, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Elana J Fertig
- Department of Oncology, Division of Biostatistics and Bioinformatics, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD, USA.
| |
Collapse
|