1
|
Kotlov N, Shaposhnikov K, Tazearslan C, Chasse M, Baisangurov A, Podsvirova S, Fernandez D, Abdou M, Kaneunyenye L, Morgan K, Cheremushkin I, Zemskiy P, Chelushkin M, Sorokina M, Belova E, Khorkova S, Lozinsky Y, Nuzhdina K, Vasileva E, Kravchenko D, Suryamohan K, Nomie K, Curran J, Fowler N, Bagaev A. Procrustes is a machine-learning approach that removes cross-platform batch effects from clinical RNA sequencing data. Commun Biol 2024; 7:392. [PMID: 38555407 PMCID: PMC10981711 DOI: 10.1038/s42003-024-06020-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2023] [Accepted: 03/06/2024] [Indexed: 04/02/2024] Open
Abstract
With the increased use of gene expression profiling for personalized oncology, optimized RNA sequencing (RNA-seq) protocols and algorithms are necessary to provide comparable expression measurements between exome capture (EC)-based and poly-A RNA-seq. Here, we developed and optimized an EC-based protocol for processing formalin-fixed, paraffin-embedded samples and a machine-learning algorithm, Procrustes, to overcome batch effects across RNA-seq data obtained using different sample preparation protocols like EC-based or poly-A RNA-seq protocols. Applying Procrustes to samples processed using EC and poly-A RNA-seq protocols showed the expression of 61% of genes (N = 20,062) to correlate across both protocols (concordance correlation coefficient > 0.8, versus 26% before transformation by Procrustes), including 84% of cancer-specific and cancer microenvironment-related genes (versus 36% before applying Procrustes; N = 1,438). Benchmarking analyses also showed Procrustes to outperform other batch correction methods. Finally, we showed that Procrustes can project RNA-seq data for a single sample to a larger cohort of RNA-seq data. Future application of Procrustes will enable direct gene expression analysis for single tumor samples to support gene expression-based treatment decisions.
Collapse
Affiliation(s)
| | | | | | | | | | | | | | - Mary Abdou
- BostonGene, Corp., Waltham, MA, 02453, USA
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
2
|
Hou L, Xiong X, Park Y, Boix C, James B, Sun N, He L, Patel A, Zhang Z, Molinie B, Van Wittenberghe N, Steelman S, Nusbaum C, Aguet F, Ardlie KG, Kellis M. Multitissue H3K27ac profiling of GTEx samples links epigenomic variation to disease. Nat Genet 2023; 55:1665-1676. [PMID: 37770633 PMCID: PMC10562256 DOI: 10.1038/s41588-023-01509-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2022] [Accepted: 08/22/2023] [Indexed: 09/30/2023]
Abstract
Genetic variants associated with complex traits are primarily noncoding, and their effects on gene-regulatory activity remain largely uncharacterized. To address this, we profile epigenomic variation of histone mark H3K27ac across 387 brain, heart, muscle and lung samples from Genotype-Tissue Expression (GTEx). We annotate 282 k active regulatory elements (AREs) with tissue-specific activity patterns. We identify 2,436 sex-biased AREs and 5,397 genetically influenced AREs associated with 130 k genetic variants (haQTLs) across tissues. We integrate genetic and epigenomic variation to provide mechanistic insights for disease-associated loci from 55 genome-wide association studies (GWAS), by revealing candidate tissues of action, driver SNPs and impacted AREs. Lastly, we build ARE-gene linking scores based on genetics (gLink scores) and demonstrate their unique ability to prioritize SNP-ARE-gene circuits. Overall, our epigenomic datasets, computational integration and mechanistic predictions provide valuable resources and important insights for understanding the molecular basis of human diseases/traits such as schizophrenia.
Collapse
Affiliation(s)
- Lei Hou
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Xushen Xiong
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Liangzhu Laboratory, Zhejiang University, Hangzhou, China
| | - Yongjin Park
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Carles Boix
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Benjamin James
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Na Sun
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Liang He
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Aman Patel
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Zhizhuo Zhang
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Benoit Molinie
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | | | - Scott Steelman
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Chad Nusbaum
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - François Aguet
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | | | - Manolis Kellis
- Computer Science and Artificial Intelligence Lab, Massachusetts Institute of Technology, Cambridge, MA, USA.
- The Broad Institute of Harvard and MIT, Cambridge, MA, USA.
| |
Collapse
|
3
|
Anzawa H, Kinoshita K. C4S DB: Comprehensive Collection and Comparison for ChIP-Seq Database. J Mol Biol 2023:168157. [PMID: 37244568 DOI: 10.1016/j.jmb.2023.168157] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2022] [Revised: 05/15/2023] [Accepted: 05/19/2023] [Indexed: 05/29/2023]
Abstract
Combining multiple binding profiles, such as transcription factors and histone modifications, is a crucial step in revealing the functions of complex biological systems. Although a massive amount of chromatin immunoprecipitation followed by sequencing (ChIP-seq) data is available, existing ChIP-seq databases or repositories focus on individual experiments, and it is difficult to elucidate orchestrated regulation by DNA-binding elements. We developed the Comprehensive Collection and Comparison for ChIP-Seq Database (C4S DB) to provide researchers with insights into the combination of DNA binding elements based on quality-assessed public ChIP-seq data. The C4S DB is based on > 16,000 human ChIP-seq experiments and provides two main web interfaces to discover the relationships between ChIP-seq data. "Gene browser" illustrates the landscape of distributions of binding elements around a specified gene, and "global similarity," a hierarchical clustering heatmap based on a similarity between two ChIP-seq experiments, gives an overview of genome-wide relations of regulatory elements. These functions promote the identification or evaluation of both gene-specific and genome-wide colocalization or mutually exclusive localization. Modern web technologies allow users to search for and aggregate large-scale experimental data through interactive web interfaces with quick responses. The C4S DB is available at https://c4s.site.
Collapse
Affiliation(s)
- Hayato Anzawa
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, 980-8573, Japan; Department of System Bioinformatics, Graduate School of Information Sciences, Tohoku University, 6-3-09, Aramaki-Aza-Aoba, Aoba-ku, Sendai, 980-8579, Japan
| | - Kengo Kinoshita
- Department of Integrative Genomics, Tohoku Medical Megabank Organization, Tohoku University, 2-1 Seiryo-machi, Aoba-ku, Sendai, 980-8573, Japan; Department of System Bioinformatics, Graduate School of Information Sciences, Tohoku University, 6-3-09, Aramaki-Aza-Aoba, Aoba-ku, Sendai, 980-8579, Japan; Department of in Silico, Institute of Development, Aging, and Cancer, Tohoku University, 4-1 Seiryo-machi, Aoba-ku, Sendai, 980-8575, Japan
| |
Collapse
|
4
|
Wang C, Liu X, Liang J, Narita Y, Ding W, Li D, Zhang L, Wang H, Leong MML, Hou I, Gerdt C, Jiang C, Zhong Q, Tang Z, Forney C, Kottyan L, Weirauch MT, Gewurz BE, Zeng MS, Jiang S, Teng M, Zhao B. A DNA tumor virus globally reprograms host 3D genome architecture to achieve immortal growth. Nat Commun 2023; 14:1598. [PMID: 36949074 PMCID: PMC10033825 DOI: 10.1038/s41467-023-37347-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2019] [Accepted: 03/13/2023] [Indexed: 03/24/2023] Open
Abstract
Epstein-Barr virus (EBV) immortalization of resting B lymphocytes (RBLs) to lymphoblastoid cell lines (LCLs) models human DNA tumor virus oncogenesis. RBL and LCL chromatin interaction maps are compared to identify the spatial and temporal genome architectural changes during EBV B cell transformation. EBV induces global genome reorganization where contact domains frequently merge or subdivide during transformation. Repressed B compartments in RBLs frequently switch to active A compartments in LCLs. LCLs gain 40% new contact domain boundaries. Newly gained LCL boundaries have strong CTCF binding at their borders while in RBLs, the same sites have much less CTCF binding. Some LCL CTCF sites also have EBV nuclear antigen (EBNA) leader protein EBNALP binding. LCLs have more local interactions than RBLs at LCL dependency factors and super-enhancer targets. RNA Pol II HiChIP and FISH of RBL and LCL further validate the Hi-C results. EBNA3A inactivation globally alters LCL genome interactions. EBNA3A inactivation reduces CTCF and RAD21 DNA binding. EBNA3C inactivation rewires the looping at the CDKN2A/B and AICDA loci. Disruption of a CTCF site at AICDA locus increases AICDA expression. These data suggest that EBV controls lymphocyte growth by globally reorganizing host genome architecture to facilitate the expression of key oncogenes.
Collapse
Affiliation(s)
- Chong Wang
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
- Department of Diagnostic and Biological Sciences, School of Dentistry, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Xiang Liu
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Jun Liang
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Yohei Narita
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Weiyue Ding
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Difei Li
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Luyao Zhang
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Hongbo Wang
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Merrin Man Long Leong
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Isabella Hou
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Catherine Gerdt
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Chang Jiang
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
- Department of Cancer Physiology, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA
| | - Qian Zhong
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Sun Yat-sen University Cancer Center, Guangzhou, 510060, China
| | - Zhonghui Tang
- Zhongshan School of Medicine, Sun Yat-sen University, Guangzhou, 510060, China
| | - Carmy Forney
- Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, 45229, USA
| | - Leah Kottyan
- Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, 45229, USA
| | - Matthew T Weirauch
- Center for Autoimmune Genomics and Etiology, Cincinnati Children's Hospital Medical Center, Cincinnati, OH, 45229, USA
| | - Benjamin E Gewurz
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA
| | - Mu-Sheng Zeng
- State Key Laboratory of Oncology in South China, Collaborative Innovation Center for Cancer Medicine, Guangdong Key Laboratory of Nasopharyngeal Carcinoma Diagnosis and Therapy, Sun Yat-sen University Cancer Center, Guangzhou, 510060, China
| | - Sizun Jiang
- Center for Virology and Vaccine Research, Beth Israel Deaconess Medical Center and Harvard Medical School, Boston, MA, 02115, USA.
| | - Mingxiang Teng
- Department of Biostatistics and Bioinformatics, H. Lee Moffitt Cancer Center and Research Institute, Tampa, FL, 33612, USA.
| | - Bo Zhao
- Division of Infectious Disease, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, 181 Longwood Avenue, Boston, MA, 02115, USA.
| |
Collapse
|
5
|
Teng M. Statistical Analysis in ChIP-seq-Related Applications. Methods Mol Biol 2023; 2629:169-181. [PMID: 36929078 DOI: 10.1007/978-1-0716-2986-4_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/27/2023]
Abstract
Chromatin immunoprecipitation sequencing (ChIP-seq) has been widely performed to identify protein binding information along the genome. The sequencing protocol is quite flexible and mature to measure different types of protein binding as long as sequencing parameters are properly tailored to accommodate protein features. Two distinct types of protein binding are point-source-like binding by transcription factors and diffused-distribution binding by histone modifications. Consequently, statistical approaches have been proposed to address ChIP-seq-related questions according to different protein features. In this chapter, we briefly summarize statistical principles, approaches, and tools that are widely implemented in modeling ChIP-seq data, from raw data quality control to final result reporting. We discuss the key solutions in addressing eight routine questions in ChIP-seq applications. We also include discussion on approaches fitting unique data features in different ChIP-seq types. We hope this chapter will serve as a brief guide, especially for ChIP-seq beginners, to provide them with a high-level overview to understand and design processing plans for their ChIP-seq experiments.
Collapse
Affiliation(s)
- Mingxiang Teng
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL, USA.
| |
Collapse
|
6
|
Zhou X, Zheng H, Fu H, Dillehay McKillip KL, Pinney SM, Liu Y. CRAG: de novo characterization of cell-free DNA fragmentation hotspots in plasma whole-genome sequencing. Genome Med 2022; 14:138. [PMID: 36482487 PMCID: PMC9733064 DOI: 10.1186/s13073-022-01141-8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 11/14/2022] [Indexed: 12/13/2022] Open
Abstract
The fine-scale cell-free DNA fragmentation patterns in early-stage cancers are poorly understood. We developed a de novo approach to characterize the cell-free DNA fragmentation hotspots from plasma whole-genome sequencing. Hotspots are enriched in open chromatin regions, and, interestingly, 3'end of transposons. Hotspots showed global hypo-fragmentation in early-stage liver cancers and are associated with genes involved in the initiation of hepatocellular carcinoma and associated with cancer stem cells. The hotspots varied across multiple early-stage cancers and demonstrated high performance for the diagnosis and identification of tissue-of-origin in early-stage cancers. We further validated the performance with a small number of independent case-control-matched early-stage cancer samples.
Collapse
Affiliation(s)
- Xionghui Zhou
- grid.239573.90000 0000 9025 8099Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 USA ,grid.35155.370000 0004 1790 4137Present address: Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070 China
| | - Haizi Zheng
- grid.239573.90000 0000 9025 8099Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 USA
| | - Hailu Fu
- grid.239573.90000 0000 9025 8099Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 USA
| | - Kelsey L. Dillehay McKillip
- grid.24827.3b0000 0001 2179 9593University of Cincinnati Cancer Center, Cincinnati, OH 45229 USA ,grid.24827.3b0000 0001 2179 9593Department of Pathology & Laboratory Medicine, University of Cincinnati College of Medicine, Cincinnati, OH 45229 USA
| | - Susan M. Pinney
- grid.24827.3b0000 0001 2179 9593University of Cincinnati Cancer Center, Cincinnati, OH 45229 USA ,grid.24827.3b0000 0001 2179 9593Department of Environmental and Public Health Sciences, University of Cincinnati College of Medicine, Cincinnati, OH 45229 USA
| | - Yaping Liu
- grid.239573.90000 0000 9025 8099Division of Human Genetics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 USA ,grid.24827.3b0000 0001 2179 9593University of Cincinnati Cancer Center, Cincinnati, OH 45229 USA ,grid.239573.90000 0000 9025 8099Division of Biomedical Informatics, Cincinnati Children’s Hospital Medical Center, Cincinnati, OH 45229 USA ,grid.24827.3b0000 0001 2179 9593Department of Pediatrics, University of Cincinnati College of Medicine, Cincinnati, OH 45229 USA ,grid.24827.3b0000 0001 2179 9593Department of Electrical Engineering and Computing Sciences, University of Cincinnati College of Engineering and Applied Science, Cincinnati, OH 45229 USA
| |
Collapse
|
7
|
Van den Berge K, Chou HJ, Roux de Bézieux H, Street K, Risso D, Ngai J, Dudoit S. Normalization benchmark of ATAC-seq datasets shows the importance of accounting for GC-content effects. CELL REPORTS METHODS 2022; 2:100321. [PMID: 36452861 PMCID: PMC9701614 DOI: 10.1016/j.crmeth.2022.100321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/01/2021] [Revised: 02/23/2022] [Accepted: 10/06/2022] [Indexed: 06/17/2023]
Abstract
The assay for transposase-accessible chromatin using sequencing (ATAC-seq) allows the study of epigenetic regulation of gene expression by assessing chromatin configuration for an entire genome. Despite its popularity, there have been limited studies investigating the analytical challenges related to ATAC-seq data, with most studies leveraging tools developed for bulk transcriptome sequencing. Here, we show that GC-content effects are omnipresent in ATAC-seq datasets. Since the GC-content effects are sample specific, they can bias downstream analyses such as clustering and differential accessibility analysis. We introduce a normalization method based on smooth-quantile normalization within GC-content bins and evaluate it together with 11 different normalization procedures on 8 public ATAC-seq datasets. Accounting for GC-content effects in the normalization is crucial for common downstream ATAC-seq data analyses, improving accuracy and interpretability. Through case studies, we show that exploratory data analysis is essential to guide the choice of an appropriate normalization method for a given dataset.
Collapse
Affiliation(s)
- Koen Van den Berge
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University, Ghent, Belgium
- Bioinformatics Institute Ghent, Ghent University, Ghent, Belgium
| | - Hsin-Jung Chou
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Hector Roux de Bézieux
- Division of Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| | - Kelly Street
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Davide Risso
- Department of Statistical Sciences, University of Padova, Padova, Italy
| | - John Ngai
- Department of Molecular and Cell Biology, University of California, Berkeley, Berkeley, CA, USA
- Helen Wills Neuroscience Institute, University of California, Berkeley, Berkeley, CA, USA
| | - Sandrine Dudoit
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
- Division of Biostatistics, School of Public Health, University of California, Berkeley, Berkeley, CA, USA
- Center for Computational Biology, University of California, Berkeley, Berkeley, CA, USA
| |
Collapse
|
8
|
Teng M, Du D, Chen D, Irizarry RA. Characterizing batch effects and binding site-specific variability in ChIP-seq data. NAR Genom Bioinform 2021; 3:lqab098. [PMID: 34661103 PMCID: PMC8515842 DOI: 10.1093/nargab/lqab098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2021] [Revised: 09/15/2021] [Accepted: 10/05/2021] [Indexed: 11/12/2022] Open
Abstract
Multiple sources of variability can bias ChIP-seq data toward inferring transcription factor (TF) binding profiles. As ChIP-seq datasets increase in public repositories, it is now possible and necessary to account for complex sources of variability in ChIP-seq data analysis. We find that two types of variability, the batch effects by sequencing laboratories and differences between biological replicates, not associated with changes in condition or state, vary across genomic sites. This implies that observed differences between samples from different conditions or states, such as cell-type, must be assessed statistically, with an understanding of the distribution of obscuring noise. We present a statistical approach that characterizes both differences of interests and these source of variability through the parameters of a mixed effects model. We demonstrate the utility of our approach on a CTCF binding dataset composed of 211 samples representing 90 different cell-types measured across three different laboratories. The results revealed that sites exhibiting large variability were associated with sequence characteristics such as GC-content and low complexity. Finally, we identified TFs associated with high-variance CTCF sites using TF motifs documented in public databases, pointing the possibility of these being false positives if the sources of variability are not properly accounted for.
Collapse
Affiliation(s)
- Mingxiang Teng
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL 33612, USA
| | - Dongliang Du
- Department of Biostatistics and Bioinformatics, Moffitt Cancer Center, Tampa, FL 33612, USA
| | - Danfeng Chen
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rafael A Irizarry
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA 02215, USA
| |
Collapse
|
9
|
Althouse AD, Below JE, Claggett BL, Cox NJ, de Lemos JA, Deo RC, Duval S, Hachamovitch R, Kaul S, Keith SW, Secemsky E, Teixeira-Pinto A, Roger VL. Recommendations for Statistical Reporting in Cardiovascular Medicine: A Special Report From the American Heart Association. Circulation 2021; 144:e70-e91. [PMID: 34032474 DOI: 10.1161/circulationaha.121.055393] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 01/01/2023]
Abstract
Statistical analyses are a crucial component of the biomedical research process and are necessary to draw inferences from biomedical research data. The application of sound statistical methodology is a prerequisite for publication in the American Heart Association (AHA) journal portfolio. The objective of this document is to summarize key aspects of statistical reporting that might be most relevant to the authors, reviewers, and readership of AHA journals. The AHA Scientific Publication Committee convened a task force to inventory existing statistical standards for publication in biomedical journals and to identify approaches suitable for the AHA journal portfolio. The experts on the task force were selected by the AHA Scientific Publication Committee, who identified 12 key topics that serve as the section headers for this document. For each topic, the members of the writing group identified relevant references and evaluated them as a resource to make the standards summarized herein. Each section was independently reviewed by an expert reviewer who was not part of the task force. Expert reviewers were also permitted to comment on other sections if they chose. Differences of opinion were adjudicated by consensus. The standards presented in this report are intended to serve as a guide for high-quality reporting of statistical analyses methods and results.
Collapse
Affiliation(s)
- Andrew D Althouse
- Center for Research on Health Care Data Center, Division of General Internal Medicine, University of Pittsburgh, PA (A.D.A.)
| | - Jennifer E Below
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN (J.E.B., N.J.C.)
| | - Brian L Claggett
- Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA (B.L.C., R.C.D.)
| | - Nancy J Cox
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN (J.E.B., N.J.C.)
| | - James A de Lemos
- Division of Cardiology, University of Texas Southwestern Medical Center, Dallas (J.A.d.L.)
| | - Rahul C Deo
- Division of Cardiovascular Medicine, Brigham and Women's Hospital, Boston, MA (B.L.C., R.C.D.)
| | - Sue Duval
- Cardiovascular Division, University of Minnesota Medical School, Minneapolis (S.D.)
| | - Rory Hachamovitch
- Department of Cardiovascular Medicine, Heart and Vascular Institute, Cleveland Clinic Foundation, OH (R.H.)
| | - Sanjay Kaul
- Department of Cardiology, Cedars-Sinai Medical Center, and the David Geffen School of Medicine, University of California, Los Angeles (S.K.)
| | - Scott W Keith
- Division of Biostatistics, Department of Pharmacology and Experimental Therapeutics, Sidney Kimmel Medical College of Thomas Jefferson University, Philadelphia, PA (S.W.K.)
| | - Eric Secemsky
- Richard A. and Susan F. Smith Center for Outcomes Research in Cardiology, Division of Cardiology, Department of Medicine, Beth Israel Deaconess Medical Center, Boston, MA (E.S.)
| | - Armando Teixeira-Pinto
- School of Public Health, Faculty of Medicine and Health, University of Sydney, Australia (A.T.-P.)
| | - Veronique L Roger
- Department of Cardiovascular Diseases Medicine, Mayo Clinic College of Medicine, Rochester, MN (V.L.R.).,now with Epidemiology and Community Health Branch National Heart, Lung and Blood Institute, National Institutes of Health, Bethesda, MD (V.L.R.)
| |
Collapse
|
10
|
Baldoni PL, Rashid NU, Ibrahim JG. Efficient detection and classification of epigenomic changes under multiple conditions. Biometrics 2021; 78:1141-1154. [PMID: 33860525 DOI: 10.1111/biom.13477] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2020] [Revised: 04/02/2021] [Accepted: 04/08/2021] [Indexed: 11/28/2022]
Abstract
Epigenomics, the study of the human genome and its interactions with proteins and other cellular elements, has become of significant interest in recent years. Such interactions have been shown to regulate essential cellular functions and are associated with multiple complex diseases. Therefore, understanding how these interactions may change across conditions is central in biomedical research. Chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-seq) is one of several techniques to detect local changes in epigenomic activity (peaks). However, existing methods for differential peak calling are not optimized for the diversity in ChIP-seq signal profiles, are limited to the analysis of two conditions, or cannot classify specific patterns of differential change when multiple patterns exist. To address these limitations, we present a flexible and efficient method for the detection of differential epigenomic activity across multiple conditions. We utilize data from the ENCODE Consortium and show that the presented method, epigraHMM, exhibits superior performance to current tools and it is among the fastest algorithms available, while allowing the classification of combinatorial patterns of differential epigenomic activity and the characterization of chromatin regulatory states.
Collapse
Affiliation(s)
- Pedro L Baldoni
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Naim U Rashid
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Joseph G Ibrahim
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| |
Collapse
|
11
|
Kim YS, Johnson GD, Seo J, Barrera A, Cowart TN, Majoros WH, Ochoa A, Allen AS, Reddy TE. Correcting signal biases and detecting regulatory elements in STARR-seq data. Genome Res 2021; 31:877-889. [PMID: 33722938 PMCID: PMC8092017 DOI: 10.1101/gr.269209.120] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2020] [Accepted: 03/09/2021] [Indexed: 12/13/2022]
Abstract
High-throughput reporter assays such as self-transcribing active regulatory region sequencing (STARR-seq) have made it possible to measure regulatory element activity across the entire human genome at once. The resulting data, however, present substantial analytical challenges. Here, we identify technical biases that explain most of the variance in STARR-seq data. We then develop a statistical model to correct those biases and to improve detection of regulatory elements. This approach substantially improves precision and recall over current methods, improves detection of both activating and repressive regulatory elements, and controls for false discoveries despite strong local correlations in signal.
Collapse
Affiliation(s)
- Young-Sook Kim
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Advanced Genomic Technologies, Duke University, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27710, USA
| | - Graham D Johnson
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Advanced Genomic Technologies, Duke University, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA
| | - Jungkyun Seo
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Advanced Genomic Technologies, Duke University, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27710, USA
| | - Alejandro Barrera
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Advanced Genomic Technologies, Duke University, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA
| | - Thomas N Cowart
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA
| | - William H Majoros
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Advanced Genomic Technologies, Duke University, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27710, USA
| | - Alejandro Ochoa
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27710, USA
| | - Andrew S Allen
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27710, USA
| | - Timothy E Reddy
- Department of Biostatistics and Bioinformatics, Division of Integrative Genomics, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Genomic and Computational Biology, Duke University Medical School, Durham, North Carolina 27710, USA.,Center for Advanced Genomic Technologies, Duke University, Durham, North Carolina 27710, USA.,Duke Center for Statistical Genetics and Genomics, Duke University, Durham, North Carolina 27710, USA.,Program in Computational Biology and Bioinformatics, Duke University, Durham, North Carolina 27710, USA
| |
Collapse
|
12
|
Awdeh A, Turcotte M, Perkins TJ. WACS: improving ChIP-seq peak calling by optimally weighting controls. BMC Bioinformatics 2021; 22:69. [PMID: 33588754 PMCID: PMC7885521 DOI: 10.1186/s12859-020-03927-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Accepted: 12/09/2020] [Indexed: 01/21/2023] Open
Abstract
Background Chromatin immunoprecipitation followed by high throughput sequencing (ChIP-seq), initially introduced more than a decade ago, is widely used by the scientific community to detect protein/DNA binding and histone modifications across the genome. Every experiment is prone to noise and bias, and ChIP-seq experiments are no exception. To alleviate bias, the incorporation of control datasets in ChIP-seq analysis is an essential step. The controls are used to account for the background signal, while the remainder of the ChIP-seq signal captures true binding or histone modification. However, a recurrent issue is different types of bias in different ChIP-seq experiments. Depending on which controls are used, different aspects of ChIP-seq bias are better or worse accounted for, and peak calling can produce different results for the same ChIP-seq experiment. Consequently, generating “smart” controls, which model the non-signal effect for a specific ChIP-seq experiment, could enhance contrast and increase the reliability and reproducibility of the results. Result We propose a peak calling algorithm, Weighted Analysis of ChIP-seq (WACS), which is an extension of the well-known peak caller MACS2. There are two main steps in WACS: First, weights are estimated for each control using non-negative least squares regression. The goal is to customize controls to model the noise distribution for each ChIP-seq experiment. This is then followed by peak calling. We demonstrate that WACS significantly outperforms MACS2 and AIControl, another recent algorithm for generating smart controls, in the detection of enriched regions along the genome, in terms of motif enrichment and reproducibility analyses. Conclusions This ultimately improves our understanding of ChIP-seq controls and their biases, and shows that WACS results in a better approximation of the noise distribution in controls.
Collapse
Affiliation(s)
- Aseel Awdeh
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada. .,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, K1H8L6, Canada.
| | - Marcel Turcotte
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada
| | - Theodore J Perkins
- School of Electrical Engineering and Computer Science, University of Ottawa, Ottawa, K1N6N5, Canada. .,Regenerative Medicine Program, Ottawa Hospital Research Institute, Ottawa, K1H8L6, Canada. .,Department of Biochemistry, Microbiology and Immunology, University of Ottawa, Ottawa, K1H8M5, Canada.
| |
Collapse
|
13
|
Zhang T, Wang R, Jiang Q, Wang Y. An Information Gain-based Method for Evaluating the Classification Power of Features Towards Identifying Enhancers. Curr Bioinform 2020. [DOI: 10.2174/1574893614666191120141032] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Enhancers are cis-regulatory elements that enhance gene expression on
DNA sequences. Since most of enhancers are located far from transcription start sites, it is difficult
to identify them. As other regulatory elements, the regions around enhancers contain a variety of
features, which can help in enhancer recognition.
Objective:
The classification power of features differs significantly, the performances of existing
methods that use one or a few features for identifying enhancer vary greatly. Therefore, evaluating
the classification power of each feature can improve the predictive performance of enhancers.
Methods:
We present an evaluation method based on Information Gain (IG) that captures the
entropy change of enhancer recognition according to features. To validate the performance of our
method, experiments using the Single Feature Prediction Accuracy (SFPA) were conducted on
each feature.
Results:
The average IG values of the sequence feature, transcriptional feature and epigenetic
feature are 0.068, 0.213, and 0.299, respectively. Through SFPA, the average AUC values of the
sequence feature, transcriptional feature and epigenetic feature are 0.534, 0.605, and 0.647,
respectively. The verification results are consistent with our evaluation results.
Conclusion:
This IG-based method can effectively evaluate the classification power of features for
identifying enhancers. Compared with sequence features, epigenetic features are more effective for
recognizing enhancers.
Collapse
Affiliation(s)
- Tianjiao Zhang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Rongjie Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Qinghua Jiang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| |
Collapse
|
14
|
Johnstone SE, Reyes A, Qi Y, Adriaens C, Hegazi E, Pelka K, Chen JH, Zou LS, Drier Y, Hecht V, Shoresh N, Selig MK, Lareau CA, Iyer S, Nguyen SC, Joyce EF, Hacohen N, Irizarry RA, Zhang B, Aryee MJ, Bernstein BE. Large-Scale Topological Changes Restrain Malignant Progression in Colorectal Cancer. Cell 2020; 182:1474-1489.e23. [PMID: 32841603 PMCID: PMC7575124 DOI: 10.1016/j.cell.2020.07.030] [Citation(s) in RCA: 92] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2019] [Revised: 05/04/2020] [Accepted: 07/20/2020] [Indexed: 02/06/2023]
Abstract
Widespread changes to DNA methylation and chromatin are well documented in cancer, but the fate of higher-order chromosomal structure remains obscure. Here we integrated topological maps for colon tumors and normal colons with epigenetic, transcriptional, and imaging data to characterize alterations to chromatin loops, topologically associated domains, and large-scale compartments. We found that spatial partitioning of the open and closed genome compartments is profoundly compromised in tumors. This reorganization is accompanied by compartment-specific hypomethylation and chromatin changes. Additionally, we identify a compartment at the interface between the canonical A and B compartments that is reorganized in tumors. Remarkably, similar shifts were evident in non-malignant cells that have accumulated excess divisions. Our analyses suggest that these topological changes repress stemness and invasion programs while inducing anti-tumor immunity genes and may therefore restrain malignant progression. Our findings call into question the conventional view that tumor-associated epigenomic alterations are primarily oncogenic.
Collapse
Affiliation(s)
- Sarah E Johnstone
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA
| | - Alejandro Reyes
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Department of Data Sciences, Dana Farber Cancer Institute, Boston, MA 02215, USA; Department of Biostatistics, Harvard School of Public Health, Boston, MA 02215, USA
| | - Yifeng Qi
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Carmen Adriaens
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA
| | - Esmat Hegazi
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA
| | - Karin Pelka
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA
| | - Jonathan H Chen
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA
| | - Luli S Zou
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Department of Data Sciences, Dana Farber Cancer Institute, Boston, MA 02215, USA; Department of Biostatistics, Harvard School of Public Health, Boston, MA 02215, USA
| | - Yotam Drier
- The Lautenberg Center for Immunology and Cancer Research, The Hebrew University, Jerusalem, Israel
| | - Vivian Hecht
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA
| | - Noam Shoresh
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA
| | - Martin K Selig
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
| | - Caleb A Lareau
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA 02215, USA
| | - Sowmya Iyer
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA
| | - Son C Nguyen
- Department of Genetics, Penn Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Eric F Joyce
- Department of Genetics, Penn Epigenetics Institute, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Nir Hacohen
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA
| | - Rafael A Irizarry
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Department of Data Sciences, Dana Farber Cancer Institute, Boston, MA 02215, USA; Department of Biostatistics, Harvard School of Public Health, Boston, MA 02215, USA
| | - Bin Zhang
- Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Department of Chemistry, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Martin J Aryee
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA; Department of Biostatistics, Harvard School of Public Health, Boston, MA 02215, USA.
| | - Bradley E Bernstein
- Department of Pathology, Massachusetts General Hospital and Harvard Medical School, Boston, MA 02114, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02139, USA; Center for Cancer Research, Massachusetts General Hospital, Boston, MA 02129, USA.
| |
Collapse
|
15
|
Partridge EC, Chhetri SB, Prokop JW, Ramaker RC, Jansen CS, Goh ST, Mackiewicz M, Newberry KM, Brandsmeier LA, Meadows SK, Messer CL, Hardigan AA, Coppola CJ, Dean EC, Jiang S, Savic D, Mortazavi A, Wold BJ, Myers RM, Mendenhall EM. Occupancy maps of 208 chromatin-associated proteins in one human cell type. Nature 2020; 583:720-728. [PMID: 32728244 PMCID: PMC7398277 DOI: 10.1038/s41586-020-2023-4] [Citation(s) in RCA: 70] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 01/09/2020] [Indexed: 01/02/2023]
Abstract
Transcription factors are DNA-binding proteins that have key roles in gene regulation1,2. Genome-wide occupancy maps of transcriptional regulators are important for understanding gene regulation and its effects on diverse biological processes3–6. However, only a minority of the more than 1,600 transcription factors encoded in the human genome has been assayed. Here we present, as part of the ENCODE (Encyclopedia of DNA Elements) project, data and analyses from chromatin immunoprecipitation followed by high-throughput sequencing (ChIP–seq) experiments using the human HepG2 cell line for 208 chromatin-associated proteins (CAPs). These comprise 171 transcription factors and 37 transcriptional cofactors and chromatin regulator proteins, and represent nearly one-quarter of CAPs expressed in HepG2 cells. The binding profiles of these CAPs form major groups associated predominantly with promoters or enhancers, or with both. We confirm and expand the current catalogue of DNA sequence motifs for transcription factors, and describe motifs that correspond to other transcription factors that are co-enriched with the primary ChIP target. For example, FOX family motifs are enriched in ChIP–seq peaks of 37 other CAPs. We show that motif content and occupancy patterns can distinguish between promoters and enhancers. This catalogue reveals high-occupancy target regions at which many CAPs associate, although each contains motifs for only a minority of the numerous associated transcription factors. These analyses provide a more complete overview of the gene regulatory networks that define this cell type, and demonstrate the usefulness of the large-scale production efforts of the ENCODE Consortium. ChIP–seq and CETCh–seq data are used to analyse binding maps for 208 transcription factors and other chromatin-associated proteins in a single human cell type, providing a comprehensive catalogue of the transcription factor landscape and gene regulatory networks in these cells.
Collapse
Affiliation(s)
| | - Surya B Chhetri
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Biological Sciences, The University of Alabama in Huntsville, Huntsville, AL, USA.,Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MA, USA
| | - Jeremy W Prokop
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Pediatrics and Human Development, College of Human Medicine, Michigan State University, Grand Rapids, MI, USA
| | - Ryne C Ramaker
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Camden S Jansen
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA, USA
| | - Say-Tar Goh
- Division of Biology, California Institute of Technology, Pasadena, CA, USA
| | - Mark Mackiewicz
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | | | | | - Sarah K Meadows
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - C Luke Messer
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA
| | - Andrew A Hardigan
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Genetics, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Candice J Coppola
- Department of Biological Sciences, The University of Alabama in Huntsville, Huntsville, AL, USA
| | - Emma C Dean
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.,Department of Pathology, University of Alabama at Birmingham, Birmingham, AL, USA
| | - Shan Jiang
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA, USA
| | - Daniel Savic
- Pharmaceutical Sciences Department, St Jude Children's Research Hospital, Memphis, TN, USA
| | - Ali Mortazavi
- Department of Developmental and Cell Biology, University of California Irvine, Irvine, CA, USA
| | - Barbara J Wold
- Division of Biology, California Institute of Technology, Pasadena, CA, USA
| | - Richard M Myers
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.
| | - Eric M Mendenhall
- HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA. .,Department of Biological Sciences, The University of Alabama in Huntsville, Huntsville, AL, USA.
| |
Collapse
|
16
|
Michael AK, Grand RS, Isbel L, Cavadini S, Kozicka Z, Kempf G, Bunker RD, Schenk AD, Graff-Meyer A, Pathare GR, Weiss J, Matsumoto S, Burger L, Schübeler D, Thomä NH. Mechanisms of OCT4-SOX2 motif readout on nucleosomes. Science 2020; 368:1460-1465. [PMID: 32327602 DOI: 10.1126/science.abb0074] [Citation(s) in RCA: 127] [Impact Index Per Article: 31.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2020] [Accepted: 04/16/2020] [Indexed: 12/12/2022]
Abstract
Transcription factors (TFs) regulate gene expression through chromatin where nucleosomes restrict DNA access. To study how TFs bind nucleosome-occupied motifs, we focused on the reprogramming factors OCT4 and SOX2 in mouse embryonic stem cells. We determined TF engagement throughout a nucleosome at base-pair resolution in vitro, enabling structure determination by cryo-electron microscopy at two preferred positions. Depending on motif location, OCT4 and SOX2 differentially distort nucleosomal DNA. At one position, OCT4-SOX2 removes DNA from histone H2A and histone H3; however, at an inverted motif, the TFs only induce local DNA distortions. OCT4 uses one of its two DNA-binding domains to engage DNA in both structures, reading out a partial motif. These findings explain site-specific nucleosome engagement by the pluripotency factors OCT4 and SOX2, and they reveal how TFs distort nucleosomes to access chromatinized motifs.
Collapse
Affiliation(s)
- Alicia K Michael
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Ralph S Grand
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Luke Isbel
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Simone Cavadini
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Zuzanna Kozicka
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland.,Faculty of Science, University of Basel, Petersplatz 1, 4003 Basel, Switzerland
| | - Georg Kempf
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Richard D Bunker
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Andreas D Schenk
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Alexandra Graff-Meyer
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Ganesh R Pathare
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Joscha Weiss
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Syota Matsumoto
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland
| | - Lukas Burger
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland.,Swiss Institute of Bioinformatics, 4058 Basel, Switzerland
| | - Dirk Schübeler
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland. .,Faculty of Science, University of Basel, Petersplatz 1, 4003 Basel, Switzerland
| | - Nicolas H Thomä
- Friedrich Miescher Institute for Biomedical Research, Maulbeerstrasse 66, 4058 Basel, Switzerland.
| |
Collapse
|
17
|
Ma T, Ye Z, Wang L. Genome Wide Approaches to Identify Protein-DNA Interactions. Curr Med Chem 2020; 26:7641-7654. [PMID: 29848263 DOI: 10.2174/0929867325666180530115711] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2017] [Revised: 02/27/2018] [Accepted: 05/11/2018] [Indexed: 12/15/2022]
Abstract
BACKGROUND Transcription factors are DNA-binding proteins that play key roles in many fundamental biological processes. Unraveling their interactions with DNA is essential to identify their target genes and understand the regulatory network. Genome-wide identification of their binding sites became feasible thanks to recent progress in experimental and computational approaches. ChIP-chip, ChIP-seq, and ChIP-exo are three widely used techniques to demarcate genome-wide transcription factor binding sites. OBJECTIVE This review aims to provide an overview of these three techniques including their experiment procedures, computational approaches, and popular analytic tools. CONCLUSION ChIP-chip, ChIP-seq, and ChIP-exo have been the major techniques to study genome- wide in vivo protein-DNA interaction. Due to the rapid development of next-generation sequencing technology, array-based ChIP-chip is deprecated and ChIP-seq has become the most widely used technique to identify transcription factor binding sites in genome-wide. The newly developed ChIP-exo further improves the spatial resolution to single nucleotide. Numerous tools have been developed to analyze ChIP-chip, ChIP-seq and ChIP-exo data. However, different programs may employ different mechanisms or underlying algorithms thus each will inherently include its own set of statistical assumption and bias. So choosing the most appropriate analytic program for a given experiment needs careful considerations. Moreover, most programs only have command line interface so their installation and usage will require basic computation expertise in Unix/Linux.
Collapse
Affiliation(s)
- Tao Ma
- Division of Biomedical Statistics and Informatics, Mayo Clinic College of Medicine, Rochester, MN 55905, United States
| | - Zhenqing Ye
- Division of Biomedical Statistics and Informatics, Mayo Clinic College of Medicine, Rochester, MN 55905, United States
| | - Liguo Wang
- Division of Biomedical Statistics and Informatics, Mayo Clinic College of Medicine, Rochester, MN 55905, United States
| |
Collapse
|
18
|
Extensive sex differences at the initiation of genetic recombination. Nature 2018; 561:338-342. [PMID: 30185906 PMCID: PMC6364566 DOI: 10.1038/s41586-018-0492-5] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2017] [Accepted: 07/18/2018] [Indexed: 12/28/2022]
Abstract
Meiotic recombination differs between males and females; however, when and how these differences are established is unknown. Here we identify extensive sex differences at recombination initiation by mapping hotspots of meiotic DNA double-strand breaks in male and female mice. Contrary to past findings in humans, few hotspots are used uniquely in either sex. Instead, grossly different recombination landscapes result from up to 15-fold differences in hotspot usage between males and females. Indeed, most recombination occurs at sex-biased hotspots. Sex-biased hotspots appear to be partly determined by chromosome structure, and DNA methylation, absent in females at the onset of meiosis, plays a substantial role. Sex differences are also evident later in meiosis as the repair frequency of distal meiotic breaks as crossovers diverges in males and females. Suppression of distal crossovers may help to minimize age-related aneuploidy that arises due to cohesion loss during dictyate arrest in females.
Collapse
|
19
|
Kelley DR, Reshef YA, Bileschi M, Belanger D, McLean CY, Snoek J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res 2018; 28:739-750. [PMID: 29588361 PMCID: PMC5932613 DOI: 10.1101/gr.227819.117] [Citation(s) in RCA: 214] [Impact Index Per Article: 35.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2017] [Accepted: 03/23/2018] [Indexed: 01/10/2023]
Abstract
Models for predicting phenotypic outcomes from genotypes have important applications to understanding genomic function and improving human health. Here, we develop a machine-learning system to predict cell-type-specific epigenetic and transcriptional profiles in large mammalian genomes from DNA sequence alone. By use of convolutional neural networks, this system identifies promoters and distal regulatory elements and synthesizes their content to make effective gene expression predictions. We show that model predictions for the influence of genomic variants on gene expression align well to causal variants underlying eQTLs in human populations and can be useful for generating mechanistic hypotheses to enable fine mapping of disease loci.
Collapse
Affiliation(s)
| | - Yakir A Reshef
- Department of Computer Science, Harvard University, Cambridge, Massachusetts 02138, USA
| | | | | | | | - Jasper Snoek
- Google Brain, Cambridge, Massachusetts 02142, USA
| |
Collapse
|